Slowly increasing CPU usage for Kong with Ingress Controller

I have just made the switch from Kong 0.13.0 to 2.0.2 (with the ingress controller) in our production environment, and we’re seeing much higher CPU usage and Redis operation timeouts (for plugins like “rate-limiting”) after a few hours with 2.0.2. The CPU usage eventually climbs to the point where it’s hitting the container cgroup CPU limits.

Is there a way that I can profile the Kong and plugin Lua code? I suspect that there’s some contention within Lua, but I really have no idea where to start on trying to track it down.

1 Like

After a bit of debugging (by disabling global plugins one at a time), I determined that the high CPU usage was caused by the local copy of the “request-transformer” plugin that we had created back in 0.13.0 to allow us to strip incoming headers globally when the “request-transformer” plugin may also be used on individual services/routes. I have no idea why, but something about the code of that plugin from 0.13.0 was making Kong 2.0.2 a bit angry.

So that plugin was a red herring. The CPU usage was slowly climbing again with it disabled. On a hunch, I re-enabled the plugin again and the CPU usage dropped back to ~8%.

It seems that it’s the act of updating the config (by turning that plugin on/off) that makes the CPU usage “reset” back to baseline.

As an additional note, it seems that the config doesn’t even need to change. Just POSTing Kong’s current config back to it will also trigger whatever is happening to “reset” the CPU usage back to baseline.

Interesting.
Some more questions to dig into this.
What other plugins are you using?
What is the number of worker processes that you are using?
What is the baseline CPU and how high does it go up to?
Are you seeing any change in amount of memory consumed when CPU consumption changes?

There are some profiling capabilities that are available but they have not yet come into the open-source world.

We are using acl, basic-auth, correlation-id, datadog, http-log, ip-restriction, key-auth, rate-limiting, request-termination, request-transformer, as well as a number of custom plugins that mostly do basic stuff like redirect-or-allow logic. These custom plugins have been used for a while with Kong 0.13.0 without any issues, and have been brought over to Kong 2.0.2 mostly untouched.

We were originally configured for 2 workers but have now increased it to 4. We saw the rate of increase for the CPU usage go down some with this change, which implies it’s related to the volume of traffic received.

The baseline CPU usage is around 5-10%, and I’ve seen it hit the container resource limit of 200% when there were only 2 workers. The total container CPU usage at the current moment is hovering around 100% (of 200%, with 4 workers), and we’re currently seeing the errors about Redis operation timeouts (we set the timeout to 500ms) sporadically.

There does not seem to be any meaningful change in memory usage between the low and high CPU usage states.

As a side note, I tried doing a DELETE to the /cache endpoint because I saw the log messages about purging (local) cache when you load a new config, but that alone had no effect on the CPU usage.

Although the memory usage doesn’t change appreciably, it does go from a “slightly bumpy” to a “smooth” pattern for a little while after reloading the config. You can see what I mean in this graph. The memory limit for the container is 1GB, which gives you an idea of the scale of those small spikes. The size of the spikes starts to slowly grow over time, noticeably starting about an hour after the config reload.

Screen Shot 2020-04-20 at 4.49.59 PM

Here’s a view of the last 4 hours of CPU/memory usage of one of the pods. We re-posted the current config at ~3:10am, which is where you see the drop off.

Screen Shot 2020-04-21 at 7.00.21 AM
Screen Shot 2020-04-21 at 7.00.37 AM

For now, I’ve put a k8s CronJob in place that runs the following command once an hour. This is not an ideal solution, but it at least keeps this from becoming a problem that the on-call has to deal with.

$ kubectl get pod -n infra | grep kong-ingress-kong | awk '{ print $1 }' | while read pod; do (set -x; kubectl exec -n infra $pod -c proxy -- bash -c 'curl -sS localhost:8444/config | curl -sS -X POST localhost:8444/config -d @- -H "Content-type: application/json" >/dev/null'); done

To clarify, are you using ingress controller or kong without a DB-less setup?
If the latter, then why are you not sending any configuration in the call to /config?
Does even an unsuccessful (4xx from kong) /config API call fix the CPU problem?

This is a DB-less Kong setup with the ingress controller. I am submitting a valid config to the /config endpoint. That command is fetching the current config from /config and then POSTing it back by piping the output of the first curl into the second and passing it as the request body with -d @-.

Got it.
Can you share the declarative configuration that you get from /config? Please sanitize it to mask sensitive information. This will help us investigate the issue further.

Also, how does your traffic profile look like? How many requests do you see per second/minute and what are the request durations?

That would take a whole lot of sanitization to prevent exposing DNS names, credentials, service names, etc., plus the config references a bunch of our custom plugins.

Our typical traffic is between 200 and 600 requests/minute, but that’s spread across 6 instances with 4 workers each. We don’t really need that many instances/workers, but we’ve scaled it like that to slow down the CPU usage growth.

That’s completely reasonable. Can you provide a guesstimate of how many services/routes/certificates you have?

Is it possible for you to experiment with number of workers? Does the bug show up if you set the number of workers to 1? Does the growth rate change?

We have 51 services and 469 routes (this should be closer to 51, but the use of the ingress controller means each individual path/host combo on an Ingress ends up as a separate rule, which means a separate route). There are no certificates.

I cannot easily experiment with the workers, as a restart of Kong can be somewhat disruptive to production traffic due to the time it takes for the config to update/settle.

I’m also looking into upgrading to Kong 2.0.4 since it has a fix for at least one issue that I know that I have, as well as a few fixes that look like they might be relevant for this issue. Unfortunately, there is not yet an image for 2.0.4 available on Docker Hub.

Because of some other issues that we encountered in Kong 2.0.x, we decided to downgrade to 1.5.1. It’s only been running for ~2 hours, but it seems to be suffering from the same problem.

Kong 2.0.4 is in progress:

I don’t have any updates on fixing the issue here yet.

Do you have any suggestions on other things that I could check or try?

I did end up experimenting with different numbers of replicas/workers with 2.0.2. With 12 replicas and 1 worker each, I was seeing the same CPU usage increase behavior.

I was wrong earlier about the amount of traffic that Kong is handling, as I misinterpreted the way the “kong.request.count” Datadog metric worked.

It looks like our level of requests is ~100-200/sec across all of our Kong instances.

That’s not a lot for Kong and one shouldn’t need so many workers to process this low amount.

Now, this will be hard but if possible to see what happens if you disable plugins?
If you disable all the plugins and this behavior goes away, then we can be sure that it is one of the plugins that is the culprit here. If not, then we have to dig deeper into the core itself.