This may be a shot in the dark as I don’t have anything to back up my findings
We’re sometimes seeing very long synchronization times in the ingress-controller logs. (Between two subsequent
level=info msg="syncing configuration" component=controller and
level=info msg="no configuration change, skipping sync to kong" component=controller messages)
The issue then causes kong to miss pod IP changes, so when a pod gets deleted and recreated, kong routes traffic to the old IP until synchronization has finished. The routing to old dead IPs seemed to stop after 1,5-3 minutes, so it seemed that Kong took 1,5-3 minutes to sync.
After digging around in the kong-ingress-controller source I found that synchronization is enqued every second (kubernetes-ingress-controller/controller.go at d8f58ead71664b042c854c1a3e429fe95f17eb24 · Kong/kubernetes-ingress-controller · GitHub), so this should be ok? What else could cause these long synchronization times?
As a short term mitigation, we applied the
ingress.kubernetes.io/service-upstream: true annotation to our services, but that seem not to resolve the underlying issue.