We’re trying to figure out what is happening for too long so I decided to ask it here. Generally speaking, we started seeing these lines in logs of our 2.3.3 kong community gateway:
2023/01/09 21:17:47 [error] 23#0: *80971136 recv() failed (104: Connection reset by peer), context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8001 2023/01/09 21:17:47 [error] 24#0: *80971147 recv() failed (104: Connection reset by peer), context: ngx.timer
Basically, we successfully used our kong + ingress gateway setup with OKD 4.11 in AWS and did not experience described behaviour previously. The logs above belong to the same OKD 4.11 setup but in vSphere. What have we tried so far:
- Update Kong to 3.0.1 and ingress controller to 2.7.0 version thinking it could be some issue with the old router.
- Replacing network SDN plug-ins
- Dig into kong pod packets with wireshark.
Speaking of wireshark, we found that there is some packets pattern after which kong receives RST packet which spawns error log lines. Here they are
It seems that kong is trying to get an ip of a k8s service with a short name pattern .. instead of fqdn ...cluster.local and receives no such name in response. But the same packets capture in AWS setup shows us the same queries, with the same responses but without RST packet. It looks like the upstream DNS server drops the connection but why this happens in the case of vSphere and not in the case of AWS VPC resolver we do not know yet.
So right now our best solution is to forward err logs to /dev/null as it seems the issue does not affect functionality but it would be great to know what is happening and how to fix this. Would be grateful for any ideas!