Performance issue with kong-ingress

Hi All,

We have been trying to evaluate kong-ingress for some time now in our environment, but are facing some performance issues. We are using AWS EKS for hosting our environment and currently our request flow is like:
User -> AWS ALB -> nginx ingress -> kong gateway -> apps

We simply are looking to replace nginx ingress + kong gateway with kong ingress with dbless mode so that we can use declarative approach for our kong configuration.

Now the problem is we are frequently getting Liveness and Readiness probes failing for both the containers (proxy and ingress-controller) inside kong-ingress pod similar to below

Events:
  Type     Reason     Age                    From                                                     Message
  ----     ------     ----                   ----                                                     -------
  Warning  Unhealthy  4m34s (x82 over 3d5h)  kubelet, ip-100-64-22-175.eu-central-1.compute.internal  Liveness probe failed: Get     http://100.64.30.47:9001/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  4m30s (x29 over 3d5h)  kubelet, ip-100-64-22-175.eu-central-1.compute.internal  Readiness probe failed: Get     http://100.64.30.47:9001/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

and the containers are killed and recreated.

We thought it may be due to load and spawned replicas to 10 but alas we start getting the issue during repetitive functional testing only without any load.

Secondly, we tried to split the containers (I read that in some of the post here in the discussions only) and run proxy as DaemonSet and ingress-controller as a quorum of 5 replicas but to our surprise our calls were succeeding sometimes and failing rest of the times because now the kong configuration was getting messed-up and proxy configuration (routes and services) was of sync among the proxies.

Also, when we see the metrics exported through prometheus, the ones under Caching we are not able to understand them much especially the kong_process_events, what does this metric signify because when the containers freshly start it is like 5% and as soon as we run a test it goes to 100% and stays that way (we are not limiting/requesting the pod resources in any way).

So our queries are:

  1. Why are the Liveness and Readiness probes failing frequently?
  2. Why is the proxy containers’ configuration different after splitting the containers and running proxy as DaemonSet?
  3. What does kong_process_events signify and why it becomes 100% after a single test run and how to tackle this if it’s a problem?

We are using below versions (deployed using https://github.com/Kong/kubernetes-ingress-controller/blob/master/deploy/single/all-in-one-dbless.yaml):
ingress-controller: 0.7.1
kong: 1.4.3

Please do let me know if any other information is required to understand the problem.

Hi Guys,

Just an update here, I again tried things with postgres db now as per https://github.com/Kong/kubernetes-ingress-controller/blob/master/deploy/single/all-in-one-postgres.yaml and there too I used containerized postgres and next time an AWS RDS instance but the performance still stays the same.

Moreover I tried with DaemonSet of ingress-kong just to see if there is any contention but it still stays the same.

As much as I observed we are getting more of 502 with kong-ingress setup.

Any help with my above 3 queries will be much appreciated.

Thanks
Ajay

Hi,

I would like to add one more update here, currently we are using kong 0.14.1 which I know is too old and that’s why we were trying to move to latest version along with ingress. But we have observed strange behavior in addition to above that all our upstream-latencies have increased and previously where we were getting only about 100-120 502 status for our 50K user test now we are getting around 1500-1700 502 errors with the same app pods behind it.
Please check the response below:

HTTP response:
status=
502 Bad Gateway
headers= 
Date: Tue, 07 Apr 2020 04:58:13 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 77
Connection: keep-alive
Server: kong/2.0.2
X-Kong-Upstream-Latency: 1200
X-Kong-Proxy-Latency: 0
Via: kong/2.0.2

body=
{
  "message": "An invalid response was received from the upstream server"
}

The X-Kong-Upstream-Latency earlier used to be between 200-250 and now it is between 900-1200 :frowning:

It will be really helpful if anyone can shed some light on this.

  1. This is really strange. Are your k8s worker nodes over-subscribed by a huge amount?
  2. You can do that with DB-mode but not DB-less mode.
  3. That is expected. The way that shm is used it will be reported at 100%. You don’t need to worry about it.

In one post you mention DB-less and then later on you mention 0.14.1 with DB.

Please stick to DB-less and don’t use a database for your use-case.

As to why health-check probes are failing, make sure that you have sufficient resources to run Kong.
Sharing logs of the controller and proxy will help debug it.

Thanks Harry for your response. I’ll go through each point and will try to share as much logs possible, in the morning.

Best Regards

Ajay Yadav

Hi @hbagdi,

Thanks for clarifying the 2nd and 3rd point. Regarding 1st one no our nodes are not that overloaded plus I’m able to perfectly hit the Probes using http://<pod-ip>:9001/health or http://<pod-ip>:10254/healthz from a separate bash pod in same namespace.

Here actually I wanted to mention that currently we are using 0.14.1 version of kong (just as gateway) which we now want to upgrade to latest with the introduction on kong-ingress too.

Now actually I have been restarting the containers too frequently (after every performance test) to add/remove environment variables or any other change as we are trying out different things to get better performance so not getting the probe timeout and so not able to share logs.

Moreover our main concern now is to get better performance than our existing setup, which we are not able to get and seems like kong is throttling somewhere (since we are getting more 502 and the latencies have increased manyfold)

You should generally see higher performance with DB-less setups.

If you can share a Deployment spec of Kong Ingress, that will help.

we are getting more 502 and the latencies have increased manyfold

This seems more about the issue with Kong’s integration with existing cluster and network than any kind of throttling in Kong. Check and see if there is anything off with DNS settings in Kong and cluster.

Hi @hbagdi,

I have been using below deployment spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ingress-kong
  name: ingress-kong
  namespace: kong
spec:
  replicas: 10
  selector:
    matchLabels:
      app: ingress-kong
  template:
    metadata:
      annotations:
        kuma.io/gateway: enabled
        prometheus.io/port: "9542"
        prometheus.io/scrape: "true"
        traffic.sidecar.istio.io/includeInboundPorts: ""
      labels:
        app: ingress-kong
    spec:
      containers:
      - env:
        - name: KONG_DATABASE
          value: "off"
        - name: KONG_NGINX_WORKER_PROCESSES
          value: "3"
        - name: KONG_NGINX_HTTP_INCLUDE
          value: /kong/servers.conf
        - name: KONG_ADMIN_ACCESS_LOG
          value: /dev/stdout
        - name: KONG_ADMIN_ERROR_LOG
          value: /dev/stderr
        - name: KONG_ADMIN_LISTEN
          value: 0.0.0.0:8001
        - name: KONG_TRUSTED_IPS
          value: "0.0.0.0/0"
        image: kong:2.0
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - kong quit
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9001
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: proxy
        ports:
        - containerPort: 8000
          name: proxy
          protocol: TCP
        - containerPort: 8001
          name: proxy-admin
          protocol: TCP
        - containerPort: 8443
          name: proxy-ssl
          protocol: TCP
        - containerPort: 9542
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9001
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          runAsUser: 1000
        volumeMounts:
        - mountPath: /kong
          name: kong-server-blocks
        resources:
          #limits:
          #  memory: '700Mi'
          #  cpu: '1000m'
          requests:
            cpu: '500m'
            memory: '500Mi'
      - args:
        - /kong-ingress-controller
        - --kong-admin-url=http://localhost:8001
        - --publish-service=kong/kong-proxy
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: kong-docker-kubernetes-ingress-controller.bintray.io/kong-ingress-controller:0.8.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: ingress-controller
        ports:
        - containerPort: 8080
          name: webhook
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
      serviceAccountName: kong-serviceaccount
      volumes:
      - configMap:
          name: kong-server-blocks
        name: kong-server-blocks 

And I have been experimenting with and without any combination of below variables:

- name: KONG_NGINX_HTTP_UPSTREAM_KEEPALIVE
  value: "60"
- name: KONG_NGINX_PROXY_PROXY_CONNECT_TIMEOUT
  value: 5s
- name: KONG_NGINX_PROXY_PROXY_SEND_TIMEOUT
  value: 60s
- name: KONG_NGINX_PROXY_PROXY_READ_TIMEOUT
  value: 60s
- name: KONG_NGINX_PROXY_PROXY_BUFFER_SIZE
  value: 16k
- name: KONG_NGINX_PROXY_PROXY_BUFFERS
  value: "4 16k"
- name: KONG_NGINX_PROXY_PROXY_REQUEST_BUFFERING
  value: "on"
- name: KONG_NGINX_PROXY_PROXY_NEXT_UPSTREAM_TRIES
  value: "3"
- name: KONG_NGINX_PROXY_CLIENT_MAX_BODY_SIZE
  value: 25m
- name: KONG_NGINX_PROXY_PROXY_HTTP_VERSION
  value: "1.1"
- name: KONG_NGINX_PROXY_PROXY_SET_HEADER
  value: "Connection \"\""

To be true that’s because currently we are using nginx-ingress in front of kong api gateway, where I could see these different nginx directives so I thought may be some of them are making difference but the results are more or less same.

If I share with you the results of our performance test:

  1. With our current setup (nginx ingress (2 replicas) + kong api gateway (8 replicas)):

  2. With kong-ingress(dbless, 10 replicas each with 3 worker):

We just now finished one more test using kong-ingress (with postgres AWS RDS and kong-proxy running as DaemonSet and 3 instances of ingress-controller ) but lower RPS and the results look somewhat better:

Hi @ajay , were you able to identify the rootcause of this issue.
I am having similar issue with performance testing. Direct application call with load are scaling up , with kong they arent.