Kong 1.4, K8S, DB-less, 504 Gateway timeout

Hi all,

We have 3 Kong replicas deployed in Kubernetes from stable/kong 0.19.1 Helm chart). We have upgraded Kong to 1.4 using the alpine images. Kong is configured in DB-less mode.

We have checked that when the edge load balancer points to a particular replica, for a particular route, the service is not responding, returning 504 Gateway timeout after 60 secs. 2/3 response KO and 1/3 OK.

When accesing through a different replica there is no issue.

Next picture shows how requests balanced to pod “stingray-kong-6bf98f84d6-q7vrn” response properly with a 200 and requests to the other two pods fail with the client request closed (499)

these behaviour cannot be reproduced requesting other apps and three nodes serve request correctly.

On the other hand we can see that the pod which serves properly that route has different memory consumption.

the default Kong ingress configuration is used and no plugins for that Ingress.

Updated with more info:

  • Restarting Kong deployments the issue dissapears for some hours.
  • if the upstream is removed (deleting the pod) the issue persists. Restarting the upstream seems not working. It seems there is no a issue in the upstream, Kong might no be able to proxy the request.
  • Deleting the ingress and creating it again doesn’t fix the issue.

Interesting.

Do you see any error in the controller of the two pods that were not able to serve requests?

Nothing!

Currently one of them has been deleted and Kong retrieves 503 (Right behavior) and 504 alternatively.
We can access directly with port-forwarding without any issue.

Attached a recent evidence made with Apachebench. Kong pods “-7rw…” and “-2cv…” retrieve 504, “-dnx…” retrieves 503.

It seems that unstable pods are not releasing memory after 17:30

What is your k8s environment and version?
The issue doesn’t seem to be with Kong itself but with Kubernetes or the load-balancer in-front of Kong.

Kong is responding with 499, meaning the load balancer closed the connection before Kong could send back the response.

Some things to ensure:

  1. Is this issue happening with specific k8s worker nodes? Kong might not be able to reach the services in the rest of the cluster due a networking problem.
  2. The Load-Balancer is able to reach Kong pods in other zones if you are running in a cloud provider.
  3. All the worker nodes have proper networking configured and are not out of sync in any ways.

Thank you,

  • Each replica (3) is deployed in a different node.
  • Currently there is no target upstream, it is not deployed.
  • Load balancer discarded, I have just tried requesting directly to each pod (with port forwarding) using curl and only “-dnx” is responding properly.
  • Kubernetes version is 1.11
  • We have a lot of environments deployed in the cluster and these are the only two pods with the issue.
  • One pod is Jenkins and the other is this: https://hub.docker.com/r/brndnmtthws/nginx-echo-headers/
  1. Do you mean that only two services that are being proxies by Kong have this issue or is it two Kong pods that have this issue?

  2. After LB is removed, what error do you get back from Kong when you curl it directly over the port-forwarding tunnel?

Yes, only these two services have the issue, the others work perfectly.

I have repeated the test asking resources to Jenkins directly to a Kong pod (without ELB) and this is the result:

> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 [lua] init.lua:800: balancer(): failed to retry the dns/balancer resolver for jenkins.ci.svc' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3", host: "****" 
> 
> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 [lua] balancer.lua:900: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: ["(short)jenkins.ci.svc:(na) - cache-miss","jenkins.ci.svc.ingress.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.ingress.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.ingress.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"] while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3", host: "*****" t
> 
> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 upstream timed out (110: Operation timed out) while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3"

I have tried a nslookup within the Kong pod and Jenkins.ci.svc is resolved correctly to 100.67.34.45.
The Jenkins Kubernetes service has the following config:

kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{“apiVersion”:“v1”,“kind”:“Service”,“metadata”:{“annotations”:{},“labels”:{“name”:“jenkins”},“name”:“jenkins”,“namespace”:“ci”},“spec”:{“ports”:[{“name”:“jenkins-http”,“port”:8080,“protocol”:“TCP”,“targetPort”:8080}],“selector”:{“name”:“jenkins”}}}
creationTimestamp: “2019-07-30T12:56:37Z”
labels:
name: jenkins
name: jenkins
namespace: ci
resourceVersion: “50311438”
selfLink: /api/v1/namespaces/ci/services/jenkins
uid: 777266da-b2c9-11e9-9c10-027d1765881c
spec:
clusterIP: 100.67.34.45
ports:

  • name: jenkins-http
    port: 8080
    protocol: TCP
    targetPort: 8080
    selector:
    name: jenkins
    sessionAffinity: None
    type: ClusterIP
    status:
    loadBalancer: {}

why do the traces show the upstream with port 80?
if I get the upstream from the Kong API it display the right Jenkins pod IP with right port.

|data||
|---|---|
|0||
|created_at|1574373014.286|
|upstream||
|id|"e9753ef9-c994-59e5-b348-4a132b793242"|
|id|"9f8052ed-8645-5129-8eab-84f168238129"|
|target|"100.96.8.221:8080"|
|weight|100|


NAME            ENDPOINTS            AGE
jenkins         100.96.8.221:8080    114d
jenkins-agent   100.96.8.221:50000   114d

Can you share the Ingress resource for Jenkins?

By any chance, do you have the ingress.kubernetes.io/service-upstream annotation applied anywhere?

Yes!

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    configuration.konghq.com: jenkins-ingress
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"kubernetes.io/ingress.class":"kong"},"name":"jenkins-ingress","namespace":"ci"},"spec":{"rules":[{"host":"****","http":{"paths":[{"backend":{"serviceName":"jenkins","servicePort":"jenkins-http"}}]}}]}}
    kubernetes.io/ingress.class: kong
  creationTimestamp: "2019-11-15T14:08:19Z"
  generation: 3
  name: jenkins-ingress
  namespace: ci
  resourceVersion: "51274360"
  selfLink: /apis/extensions/v1beta1/namespaces/ci/ingresses/jenkins-ingress
  uid: 607ae6ed-07b1-11ea-9c10-027d1765881c
spec:
  rules:
  - host: *****
    http:
      paths:
      - backend:
          serviceName: jenkins
          servicePort: jenkins-http
status:
  loadBalancer:
    ingress:
    - hostname: ****

We are not using service-upstream annotations.

And this is the KongIngress resource content:

proxy:
  protocol: http
route:
  connect_timeout: 20000
  preserve_host: false
  regex_priority: 1
  strip_path: false

There is another interesting discussion here. “proxy” and “route” config works perfectly within the KongIngress CRD but “upstreams” doesn’t. We tried to update the http_path and http_status for the healtchecks but Kong admin API returns always the default upstreams configuration.

Okay. Nothing interesting here.
Kong should not be doing a DNS lookup for jenkins.ci.svc hostname. That should be coming form Upstreams.

Do you see an upstream in Kong’s ADmin API associated with jenkins.ci.svc (and also targets)?

Yes! upstream and targets are discovered correctly by Kong. However we get the same issue with and without targets.

First picture (Nginx traces) In this topic shows 200 responses and 499 when pod is running and target discovered.

Third picture, also Nginx traces, shows 503 response and 499 when pod is not running.

Currently with the pod running target matches the Pod IP and port.

{
  "created_at": 1574461462.702,
  "upstream": {
    "id": "e9753ef9-c994-59e5-b348-4a132b793242"
  },
  "id": "97a6c375-0209-516a-948e-964bacce09f8",
  "target": "100.96.8.25:8080",
  "weight": 100,
  "health": "HEALTHCHECKS_OFF"
}

Hi!

I have realised that Grafana Kong dashboard shows the following result:

What is happening with the kong-process-events?

You can ignore the kong_process_events being full. It is allocated to the full amount but is not in complete use. At least, that’s the case in most DB-less deployments.

Regarding the original problem, I’ve no clue of why Kong is trying to resolve the DNS name jenkins.ci.svc.
Can you make sure that the route -> service -> upstream -> target connections hold up correctly?
Meaning, the service <-> upstream connection is setup correctly?
The way those two are connected are by the host property of the service object should be same as the name property of the upstream object.

Following the changes described in this topic, I have applied them and I will response with the feedback.

These changes didn’t fix my issue, i will try to investigate more.

After one week working with Kong 1.4.1, i confirm that new version solves this issue.

Glad to finally hear this!
Thank you for leaving this message here.

Hello abenitovsc
I have same issue. Is it clear the issue?
Please let me know the version of kong

“balancer.lua:917: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed.”

I use kong 1.4.2 ingress 0.6.0 DB-less

Hello,

the real issue was the 504 HTTP code, can you provide metrics for the Kong pods?

It is different from my issue.
I use cookie to save jwt token.
If there are 10 more JTW token, Kong occur error “DNS resolution failed: dns server error: 100 cache only lookup failed”
I saw the your error message at Nov '19.
So I think this is the same issue but I think original issue is different