We recently upgraded Kong from 0.11.2 to 0.14.0 and this resulted in a significant increase in database (postgres) use by the cluster. This led to much higher latency for our api. Increasing our postgres cpu capacity has brought latency back down to normal levels, but we’d rather not spend the extra money.
Our cluster uses the rate-limiting and key-auth plugins.
I see from the changelog that there was a move from using Serf to the database for cache invalidation - is this likely to be the cause?
We tried playing with mem_cache_size and db_update_frequency settings but it had no impact.
Is there anything we can do to get our postgres load back to how it was, or is this the new normal with Kong?
Community member here, I would assume the serf to database driven cache invalidation would put more stress on the db itself(although I believe Kong had many problems around serf as well which is why they moved away from it). Do you tend to add/edit/delete kong resources frequently, because at high volume rebuilding those entities(routes/services/api’s) in Kong can add stress on the db but they are cooking up a fix for that right now. Is the rate limiting plugin you run using the db cluster mode or local node mode.
I would highly advise running the local node mode and just thinking about the # of Kong nodes you have with a LB in front to make equal distribution if you are running cluster mode using your db currently. We have found cluster mode highly db intensive to the point of breaking our db under decent traffic. I am almost certain this is probably related to the rate limiting settings you are running.
@tgu Thanks for sharing your troubles with us. Interestingly though, Kong 0.11 and Kong 0.14 are not much different, caching-wise. In fact, the dependency on Serf was removed starting with 0.11.0 (See the Changelog).
May I ask, how many plugins do you have configured, roughly? Not just which plugins, but how many instances of the rate-limiting and key-auth plugins? Knowing the answer to @jeremyjpj0916’s question about the frequency of edits made to the Kong resources (Admin API) would also be helpful.
Thank you for your replies, @jeremyjpj0916 and @thibaultcha.
- The frequency of edits to resources via the admin API is very low - a handful per day.
- We are using cluster mode for rate-limiting.
- The number of plugins (rate-limiting per consumer) is large: ~13,000.
We can switch to local policy, potentially sacrificing accuracy for cost, though it’s odd that this was not an issue in v0.11.
Yeah I think it depends on how confident you are in your LB to distribute traffic evenly. Internally our LB is bad and sends traffic 100% one direction for 30 seconds and then 100% other direction 30 seconds. So our Local policy it set under the impression the Kong node in one DC is getting all the traffic essentially. Our use case for rate limiting is just for semi accurate prevention of over stressing back-ends but everyone’s use case is different. Will be interesting if you have charts of a before and after you go from cluster to local.