Linear increase in ingress-controller memory with increasing number of ingress routes

why do we observe linear increase in memory consumption for ingress-controller container with increasing number of ingress resources?

According to my understanding, ingress-controller checks the Kubernetes API for Ingress/Service/etc. resource updates, and when it sees them, it sends requests to the Kong proxy’s admin API to create Kong configuration, which is then stored in Kong’s database.

Is there any caching mechanism within ingress controller? Please clarify.

Thanks.

Copied from the other thread where this was asked. Didn’t see the new thread initially; makes more sense to make this a new discussion:

How many Ingresses are you creating?

The controller does cache objects it builds config from, and there is a linear correspondence between the size of that cache/the size of generated configuration and the number of watched resources. 2.5GB seems rather high given what I’ve seen in practice, however.

There is a --profiling flag to enable pprof metrics collection. It may make sense to take a look at that to see if there’s anything unusual about the memory allocation.

We’re working on getting a better understanding of the controller’s performance characteristics and improving problem areas in upcoming versions. Historically that hasn’t been much of a concern since the controller is normally fairly lightweight compared to Kong itself, so we were more focused on building out functionality first and optimizing performance after.

1 Like

Thanks for your reply.
We were currently testing the maximum Ingress objects that Kong(dbmode: postgres) could support with resource allocation of 4GB memory and 2Cores (evaluated as 1 unit).

When we started creating ingress objects (each with 2 plugins & 1 kongIngress), we found that there was linear growth in memory consumption by ingress-controller container proportional to ingress objects created.

Since our use case required that we support max number of ingress routes with decent throughput, we were able to achieve ~25K ingress objects with throughput of 3K/sec. with below mentioned resource allocation
Memory(4GB):
Ingress-controller Container: 3GB
Proxy Container : 1GB
CPU (2 Cores):
Ingress Controller Container : 1 Core
Proxy Container : 1 Core

This leads to few question that we had:

  1. You mentioned that there’s caching mechanism in ingress-controller corresponding to the number of watched resources. Is this the same memcached in-memory store that is mentioned in docs for which we can configure size? Or it’s just that it’s non-configurable and this will always keep all those watched resource configurations in-memory and will load all configurations from database upon restart.

  2. We have observed that proxy container also consumes memory with warmup cache routines corresponding to preloading services, plugins and dns entries to core-cache.
    Does proxy container contain its own memcached keystore? because we have observed that memory consumption also increases for proxy container but only till it reaches its allocated limit, after that it operates well within its limits. In contrast to this, for ingress-controller container if we try to create more ingress-objects, we face container failures with continuous restarts corresponding to OOM error.

Please shed some light to clarify our doubts related to memory and caching pattern/flow.

The cache in question is part of client-go. We use the most basic version, which doesn’t have any mechanism for expiring objects other than deleting them entirely. It will load resources of interest at startup, add new ones as they’re created, and remove them if they’re deleted.

I didn’t dig into the implementation, but my brief read of the library docs indicates that it just uses a Go map–there’s no external cache implementation like memcached (not sure where you saw that–as far as I know the only place we use that anywhere in our product suite is the OIDC plugin, where it’s an optional store for session data)

There’s no way to size-limit that cache and I don’t think it’d be advisable to impose one: the purpose of the controller cache is to reduce hits to etcd, and clearing cached resources that it actually needs would require that it fetch them again from etcd, which could result in broader cluster instability if it’s fetching too much. The only stock policy to expire resources from cache is TTL-based, which is designed for applications that only operate on a resource briefly. The ingress controller generates configuration from the full set of relevant resources each time it generates configuration, so that wouldn’t work for us.

2.5GB for 100k objects does seem unusually high based on our testing. We observed 356MiB of usage with 30k resources (split evenly between 3 types of resources) and 560MiB for 40k, so about 10-14kB/resource. We don’t expect the distribution of resources to matter much, since they’re all roughly the same size and all cache the serialized Go struct built from their YAML spec only (we don’t generate any persistent derived resources from anything, and the ephemeral resources we generate when building configuration use very similar structs).

While your usage does seem beyond what we expect, most future performance optimization will be for the upcoming 2.x version of the controller, where we’ve refactored around new Kubernetes client libraries created since our controller’s original implementation. We observed a small reduction in memory usage (~80% of the 1.x usage in the larger test) with it, which is expected: it doesn’t use a different caching strategy, and the change is likely due to other, non-scaling memory consumption elsewhere. If you’re interested in testing against it, the latest alpha release is not expected to change much–we’re still working on finalizing documentation for a beta soon-ish, but it’s a drop-in replacement for 1.x for the most part (a few more obscure flags are the only breaking changes we know of). Alternately, if you can provide your test resource YAMLs, we can try to reproduce your results independently.

The proxy uses an internal Lua cache of entities in Postgres when using database-backed mode. The general forum may be better able to answer details on specifics, but in general it functions as you describe, just without any external caching system. It’s better able to impose a limit because it can evaluate entity usefulness based on observed requests and use a least-recently-used metric to discard cache for entities it hasn’t used recently.

The controller can’t really use the same strategy since it’s always generating a config representing all resources to either apply in its entirety (in DB-less mode) or diff against the current Kong configuration (in DB-backed mode). We can’t easily get away from that without the possibility of config drift, so our recommendation to users with larger configurations has been to move some of their configuration out of Kubernetes resources (it’s more common to have a number of consumers that’s an order of magnitude larger than the number of routes/Ingresses) and manage it in the Kong database alone.

In your case, are you able to reduce the number of Ingresses by using more rules per Ingress? I’m not sure how much that would reduce total size as we haven’t researched it much, but it should be some reduction by cutting down on repetitive boilerplate metadata in favor of meaningful configuration. Another strategy would be to separate the controller from your Deployment–in database-backed mode there’s no reason to run one in each Pod alongside Kong, and we simply default to that because it’s simple and allows us to hide the admin API as a basic zero-configuration security measure. You can run the controller independently, which would not reduce its memory consumption, but would mean that you don’t have to account for the limit on each Kong replica. In database mode, only one controller replica actively submits configuration updates, but each replica always pulls resources into cache.

I don’t know that our planned optimizations would likely help in your case: our first issue to tackle is making our Kubernetes resource cache less naive, as it currently pulls in all resources of a given type before filtering out which are relevant when building configuration. We will pull a Service into cache even if no Ingress references it, simply because we lack any filter, and thus pull in many resources we’ll never incorporate into config. Improving that to filter out irrelevant resources before they reach cache is an easy win for most environments, but wouldn’t make much difference if you actually have a large number of resources that will be rendered into config.

1 Like

Thanks for this detailed explanation. really appreciate it. couldn’t have asked for more.
I think now am able to understand both in-memory cache use-cases.
I get that we can’t do anything about ingress-controller’s cache storage although i am wondering even if our average yaml file size is approx. 2KB, how it is consuming ~3GB for 100K resources!
Link to sample resource yaml files
Please have a look at these samples.
If we can’t do anything about this, your recommendation for running standalone controller nodes seems the way to go.What i understood is that it will get rid of maintaining redundant ingress-controller instance(along with cached data) corresponding to each kong proxy replica.
If i understand this correctly, we can have one to many mapping between ingress-controller instance and proxy instances. what i mean to say is that if we have 10K ingress resources, we can have a single ingress-controller handle configuration syncing for all these resources with let’s say a cluster of 3 kong proxy instances which in turn take care of serving traffic!
And going one step ahead: it was mentioned in standalone controller configuration that we need to have separate ingressClass names for different ingress-controllers. I think it’s equivalent to having multiple ingress controllers running within same kubernetes cluster. according to me, each ingress-controller should have its own separate cluster of proxy instances pointing to separate databases to handle separation of concern between different kong clusters.

Attempted my own (again, with the new 2.x controller codebase, so not quite the same as what you’d see on 1.x) with 100k resources to gather pprof data: Analyze 2.x config update frequency and fluctuation · Issue #1519 · Kong/kubernetes-ingress-controller · GitHub

Not 100% sure of how to effectively interpret that (in particular I’m not sure why the heap seems so much smaller in that data than what metrics show), but assuming it’s proportional there’s not too much of interest beyond what we already know (memory scales with watched resources, so we should par down the total we watch).

You can indeed have a one-to-many controllers to proxies relationship, since the controller just needs to interact with one of the proxy instances to insert configuration into Postgres, which will then share it with the other proxy instances.

Separating ingress classes won’t actually help yet, since the actual watches don’t filter anything: we just fetch all Ingresses, Services, etc. from Kubernetes and then filter them afterwards. A controller configured to use class foo will ingest Ingresses with classes foo and bar equally, but will only turn the foo Ingresses into actual proxy configuration. Splitting that now will likely just get you multiple controller instances that have about the same memory usage since they’re not excluding irrelevant resources from their cache.

That is the next major performance improvement we intend to target, but it has some complexity: we can easily filter resources with classes, but most lack them. An Ingress has a class, but the Service it points to does not, so we need to somehow perform an initial filter on labeled resources and then derive a graph of unlabeled resources they reference (doable, but much more complex than our current naive “give us everything” approach).