We’ve got an intermittent issue that causes Kong API Gateway to stop responding to requests. The root cause is not clear at this point, but we made a few observations that could help in the troubleshooting. Firstly, this is our environment:
Kong 3.3.0 in DB-less mode
KIC 2.8.1
Python Kong-PDK 0.33
k8s 1.24.12
Kong Helm Chart 2.23.0
Everything seems to be running fine from the API perspective, but Kong logs are full of messages:
declarative reconfigure was started on worker #0
[DB cache] purging (local) cache
building a new plugins iterator
AFAIU reconfiguration could be triggered by k8s infrastructure changes, but what is unclear is why it’s purging the whole cache which results is rebuilding plugins iterator. This can happen a dozen of times in a single second and occasionally Kong may become completely unresponsive and when it happens, we are starting to see:
Could not claim instance_id for {{PLUGIN_NAME}} (key: {{PLUGIN_ID}})
Memory and CPU usage is stable and below 50%.
Any idea what could be the root cause? What k8s changes trigger reconfigure?
It turns out that if declarative configuration has changed, all caches will be purged and previous information about plugins will be invalidated. This leads to new plugin instances to be created. This thing is that our upstream services run on spot instances and their IPs change pretty often that leads to updates to declarative configuration, however plugins never get changed and it does not make sense to reload them every time. The problem that we are observing when one of the pods becomes unresponsive (it gets stuck on Could not claim instance_id) could be mitigated by avoiding frequent plugin reloads - it makes sense to reload plugins only when plugins hash has changed.
A potential issue could be that reset_instance is called only for not ready or no plugin instance. If any other error occurs, non-initialized plugin instance will not be cleaned up, and other threads will not make through the while loop in get_instance_id.