Failed recreating loadbalancer after few days when pods running

We have been using kong on our production clusters for a while. When we only had 20% of our applications we had no problems.

We recently migrated the rest of our workload and since then we’ve had an error on part of our kong replica:


failed recreating balancer for app.namespace.80.svc: timeout waiting for balancer for a9a7385e-ad84-474e-9b1c-2....

To “solve” this problem, we are forced to restart the pod that returns this error.

Unfortunately if we don’t manage to detect the problem quickly enough, all the pods end up having the error and we have 100% of our traffic interrupted.

We did not notice any excessive RAM consumption on the pods and no other errors were returned by the pod before the error below.

Can you file a bug at Issues · Kong/kong · GitHub? They should be able to guide you through diagnosing this particular issue and can develop a fix if it’s a code issue.

For background, our Kubernetes tooling creates Kong configuration, and the proxy then takes that and uses it to build a number of optimized internal structures for routing requests. The balancer is one of those internal structures, derived from Kong services and upstreams (roughly analogous to Kubernetes Services and Endpoints). When something in that area changes (e.g. a Pod is rescheduled, so the set of IPs in its attached Service’s Endpoints changes), the proxy has to rebuild the balancer.

Balancer builds shouldn’t fail, and I’m not sure offhand what would cause a build to time out. The core team (who review the Kong repo issues) are more familiar with that codebase.