We have been trying to evaluate kong-ingress for some time now in our environment, but are facing some performance issues. We are using AWS EKS for hosting our environment and currently our request flow is like: User -> AWS ALB -> nginx ingress -> kong gateway -> apps
We simply are looking to replace nginx ingress + kong gateway with kong ingress with dbless mode so that we can use declarative approach for our kong configuration.
Now the problem is we are frequently getting Liveness and Readiness probes failing for both the containers (proxy and ingress-controller) inside kong-ingress pod similar to below
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 4m34s (x82 over 3d5h) kubelet, ip-100-64-22-175.eu-central-1.compute.internal Liveness probe failed: Get http://100.64.30.47:9001/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m30s (x29 over 3d5h) kubelet, ip-100-64-22-175.eu-central-1.compute.internal Readiness probe failed: Get http://100.64.30.47:9001/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
and the containers are killed and recreated.
We thought it may be due to load and spawned replicas to 10 but alas we start getting the issue during repetitive functional testing only without any load.
Secondly, we tried to split the containers (I read that in some of the post here in the discussions only) and run proxy as DaemonSet and ingress-controller as a quorum of 5 replicas but to our surprise our calls were succeeding sometimes and failing rest of the times because now the kong configuration was getting messed-up and proxy configuration (routes and services) was of sync among the proxies.
Also, when we see the metrics exported through prometheus, the ones under Caching we are not able to understand them much especially the kong_process_events, what does this metric signify because when the containers freshly start it is like 5% and as soon as we run a test it goes to 100% and stays that way (we are not limiting/requesting the pod resources in any way).
I would like to add one more update here, currently we are using kong 0.14.1 which I know is too old and that’s why we were trying to move to latest version along with ingress. But we have observed strange behavior in addition to above that all our upstream-latencies have increased and previously where we were getting only about 100-120 502 status for our 50K user test now we are getting around 1500-1700 502 errors with the same app pods behind it.
Please check the response below:
HTTP response:
status=
502 Bad Gateway
headers=
Date: Tue, 07 Apr 2020 04:58:13 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 77
Connection: keep-alive
Server: kong/2.0.2
X-Kong-Upstream-Latency: 1200
X-Kong-Proxy-Latency: 0
Via: kong/2.0.2
body=
{
"message": "An invalid response was received from the upstream server"
}
The X-Kong-Upstream-Latency earlier used to be between 200-250 and now it is between 900-1200
It will be really helpful if anyone can shed some light on this.
This is really strange. Are your k8s worker nodes over-subscribed by a huge amount?
You can do that with DB-mode but not DB-less mode.
That is expected. The way that shm is used it will be reported at 100%. You don’t need to worry about it.
In one post you mention DB-less and then later on you mention 0.14.1 with DB.
Please stick to DB-less and don’t use a database for your use-case.
As to why health-check probes are failing, make sure that you have sufficient resources to run Kong.
Sharing logs of the controller and proxy will help debug it.
Thanks for clarifying the 2nd and 3rd point. Regarding 1st one no our nodes are not that overloaded plus I’m able to perfectly hit the Probes using http://<pod-ip>:9001/health or http://<pod-ip>:10254/healthz from a separate bash pod in same namespace.
Here actually I wanted to mention that currently we are using 0.14.1 version of kong (just as gateway) which we now want to upgrade to latest with the introduction on kong-ingress too.
Now actually I have been restarting the containers too frequently (after every performance test) to add/remove environment variables or any other change as we are trying out different things to get better performance so not getting the probe timeout and so not able to share logs.
Moreover our main concern now is to get better performance than our existing setup, which we are not able to get and seems like kong is throttling somewhere (since we are getting more 502 and the latencies have increased manyfold)
You should generally see higher performance with DB-less setups.
If you can share a Deployment spec of Kong Ingress, that will help.
we are getting more 502 and the latencies have increased manyfold
This seems more about the issue with Kong’s integration with existing cluster and network than any kind of throttling in Kong. Check and see if there is anything off with DNS settings in Kong and cluster.
To be true that’s because currently we are using nginx-ingress in front of kong api gateway, where I could see these different nginx directives so I thought may be some of them are making difference but the results are more or less same.
If I share with you the results of our performance test:
With our current setup (nginx ingress (2 replicas) + kong api gateway (8 replicas)):
We just now finished one more test using kong-ingress (with postgres AWS RDS and kong-proxy running as DaemonSet and 3 instances of ingress-controller ) but lower RPS and the results look somewhat better:
Hi @ajay , were you able to identify the rootcause of this issue.
I am having similar issue with performance testing. Direct application call with load are scaling up , with kong they arent.