Kong 1.4, K8S, DB-less, 504 Gateway timeout

abenitovsc · November 20, 2019, 3:37pm

Hi all,

We have 3 Kong replicas deployed in Kubernetes from stable/kong 0.19.1 Helm chart). We have upgraded Kong to 1.4 using the alpine images. Kong is configured in DB-less mode.

We have checked that when the edge load balancer points to a particular replica, for a particular route, the service is not responding, returning 504 Gateway timeout after 60 secs. 2/3 response KO and 1/3 OK.

When accesing through a different replica there is no issue.

Next picture shows how requests balanced to pod “stingray-kong-6bf98f84d6-q7vrn” response properly with a 200 and requests to the other two pods fail with the client request closed (499)

these behaviour cannot be reproduced requesting other apps and three nodes serve request correctly.

On the other hand we can see that the pod which serves properly that route has different memory consumption.

the default Kong ingress configuration is used and no plugins for that Ingress.

Updated with more info:

Restarting Kong deployments the issue dissapears for some hours.
if the upstream is removed (deleting the pod) the issue persists. Restarting the upstream seems not working. It seems there is no a issue in the upstream, Kong might no be able to proxy the request.
Deleting the ingress and creating it again doesn’t fix the issue.

hbagdi · November 21, 2019, 7:26pm

Interesting.

Do you see any error in the controller of the two pods that were not able to serve requests?

abenitovsc · November 21, 2019, 7:51pm

Nothing!

Currently one of them has been deleted and Kong retrieves 503 (Right behavior) and 504 alternatively.
We can access directly with port-forwarding without any issue.

Attached a recent evidence made with Apachebench. Kong pods “-7rw…” and “-2cv…” retrieve 504, “-dnx…” retrieves 503.

It seems that unstable pods are not releasing memory after 17:30

hbagdi · November 21, 2019, 8:19pm

What is your k8s environment and version?
The issue doesn’t seem to be with Kong itself but with Kubernetes or the load-balancer in-front of Kong.

Kong is responding with 499, meaning the load balancer closed the connection before Kong could send back the response.

Some things to ensure:

Is this issue happening with specific k8s worker nodes? Kong might not be able to reach the services in the rest of the cluster due a networking problem.
The Load-Balancer is able to reach Kong pods in other zones if you are running in a cloud provider.
All the worker nodes have proper networking configured and are not out of sync in any ways.

abenitovsc · November 21, 2019, 8:35pm

Thank you,

Each replica (3) is deployed in a different node.
Currently there is no target upstream, it is not deployed.
Load balancer discarded, I have just tried requesting directly to each pod (with port forwarding) using curl and only “-dnx” is responding properly.
Kubernetes version is 1.11
We have a lot of environments deployed in the cluster and these are the only two pods with the issue.
One pod is Jenkins and the other is this: https://hub.docker.com/r/brndnmtthws/nginx-echo-headers/

hbagdi · November 21, 2019, 9:34pm

Do you mean that only two services that are being proxies by Kong have this issue or is it two Kong pods that have this issue?
After LB is removed, what error do you get back from Kong when you curl it directly over the port-forwarding tunnel?

abenitovsc · November 21, 2019, 10:07pm

Yes, only these two services have the issue, the others work perfectly.

I have repeated the test asking resources to Jenkins directly to a Kong pod (without ELB) and this is the result:

> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 [lua] init.lua:800: balancer(): failed to retry the dns/balancer resolver for jenkins.ci.svc' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3", host: "****" 
> 
> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 [lua] balancer.lua:900: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: ["(short)jenkins.ci.svc:(na) - cache-miss","jenkins.ci.svc.ingress.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:33 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.ingress.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:1 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.ingress.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.cluster.local:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed","jenkins.ci.svc:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"] while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3", host: "*****" t
> 
> 
> 2019/11/21 21:42:46 [error] 36#0: *182367757 upstream timed out (110: Operation timed out) while connecting to upstream, client: 127.0.0.1, server: kong, request: "GET /testIssue3 HTTP/1.1", upstream: "http://100.67.34.45:80/testIssue3"

I have tried a nslookup within the Kong pod and Jenkins.ci.svc is resolved correctly to 100.67.34.45.
The Jenkins Kubernetes service has the following config:

kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{“apiVersion”:“v1”,“kind”:“Service”,“metadata”:{“annotations”:{},“labels”:{“name”:“jenkins”},“name”:“jenkins”,“namespace”:“ci”},“spec”:{“ports”:[{“name”:“jenkins-http”,“port”:8080,“protocol”:“TCP”,“targetPort”:8080}],“selector”:{“name”:“jenkins”}}}
creationTimestamp: “2019-07-30T12:56:37Z”
labels:
name: jenkins
name: jenkins
namespace: ci
resourceVersion: “50311438”
selfLink: /api/v1/namespaces/ci/services/jenkins
uid: 777266da-b2c9-11e9-9c10-027d1765881c
spec:
clusterIP: 100.67.34.45
ports:

name: jenkins-http
port: 8080
protocol: TCP
targetPort: 8080
selector:
name: jenkins
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}

why do the traces show the upstream with port 80?
if I get the upstream from the Kong API it display the right Jenkins pod IP with right port.

|data||
|---|---|
|0||
|created_at|1574373014.286|
|upstream||
|id|"e9753ef9-c994-59e5-b348-4a132b793242"|
|id|"9f8052ed-8645-5129-8eab-84f168238129"|
|target|"100.96.8.221:8080"|
|weight|100|


NAME            ENDPOINTS            AGE
jenkins         100.96.8.221:8080    114d
jenkins-agent   100.96.8.221:50000   114d

hbagdi · November 21, 2019, 10:29pm

Can you share the Ingress resource for Jenkins?

By any chance, do you have the ingress.kubernetes.io/service-upstream annotation applied anywhere?

abenitovsc · November 22, 2019, 8:12am

Yes!

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    configuration.konghq.com: jenkins-ingress
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"kubernetes.io/ingress.class":"kong"},"name":"jenkins-ingress","namespace":"ci"},"spec":{"rules":[{"host":"****","http":{"paths":[{"backend":{"serviceName":"jenkins","servicePort":"jenkins-http"}}]}}]}}
    kubernetes.io/ingress.class: kong
  creationTimestamp: "2019-11-15T14:08:19Z"
  generation: 3
  name: jenkins-ingress
  namespace: ci
  resourceVersion: "51274360"
  selfLink: /apis/extensions/v1beta1/namespaces/ci/ingresses/jenkins-ingress
  uid: 607ae6ed-07b1-11ea-9c10-027d1765881c
spec:
  rules:
  - host: *****
    http:
      paths:
      - backend:
          serviceName: jenkins
          servicePort: jenkins-http
status:
  loadBalancer:
    ingress:
    - hostname: ****

We are not using service-upstream annotations.

And this is the KongIngress resource content:

proxy:
  protocol: http
route:
  connect_timeout: 20000
  preserve_host: false
  regex_priority: 1
  strip_path: false

There is another interesting discussion here. “proxy” and “route” config works perfectly within the KongIngress CRD but “upstreams” doesn’t. We tried to update the http_path and http_status for the healtchecks but Kong admin API returns always the default upstreams configuration.

hbagdi · November 22, 2019, 9:15pm

Okay. Nothing interesting here.
Kong should not be doing a DNS lookup for jenkins.ci.svc hostname. That should be coming form Upstreams.

Do you see an upstream in Kong’s ADmin API associated with jenkins.ci.svc (and also targets)?

abenitovsc · November 22, 2019, 10:39pm

Yes! upstream and targets are discovered correctly by Kong. However we get the same issue with and without targets.

First picture (Nginx traces) In this topic shows 200 responses and 499 when pod is running and target discovered.

Third picture, also Nginx traces, shows 503 response and 499 when pod is not running.

Currently with the pod running target matches the Pod IP and port.

{
  "created_at": 1574461462.702,
  "upstream": {
    "id": "e9753ef9-c994-59e5-b348-4a132b793242"
  },
  "id": "97a6c375-0209-516a-948e-964bacce09f8",
  "target": "100.96.8.25:8080",
  "weight": 100,
  "health": "HEALTHCHECKS_OFF"
}

abenitovsc · November 25, 2019, 4:39pm

Hi!

I have realised that Grafana Kong dashboard shows the following result:

What is happening with the kong-process-events?

hbagdi · November 26, 2019, 12:07am

You can ignore the kong_process_events being full. It is allocated to the full amount but is not in complete use. At least, that’s the case in most DB-less deployments.

Regarding the original problem, I’ve no clue of why Kong is trying to resolve the DNS name jenkins.ci.svc.
Can you make sure that the route -> service -> upstream -> target connections hold up correctly?
Meaning, the service <-> upstream connection is setup correctly?
The way those two are connected are by the host property of the service object should be same as the name property of the upstream object.

abenitovsc · November 27, 2019, 11:47am

Following the changes described in this topic, I have applied them and I will response with the feedback.

abenitovsc · November 28, 2019, 7:49am

These changes didn’t fix my issue, i will try to investigate more.

abenitovsc · December 9, 2019, 2:00pm

After one week working with Kong 1.4.1, i confirm that new version solves this issue.

hbagdi · December 9, 2019, 4:46pm

Glad to finally hear this!
Thank you for leaving this message here.

HYUN_SUK_JUNG · January 22, 2020, 11:51am

Hello abenitovsc
I have same issue. Is it clear the issue?
Please let me know the version of kong

“balancer.lua:917: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed.”

I use kong 1.4.2 ingress 0.6.0 DB-less

abenitovsc · January 22, 2020, 12:15pm

Hello,

the real issue was the 504 HTTP code, can you provide metrics for the Kong pods?

HYUN_SUK_JUNG · January 23, 2020, 1:17am

It is different from my issue.
I use cookie to save jwt token.
If there are 10 more JTW token, Kong occur error “DNS resolution failed: dns server error: 100 cache only lookup failed”
I saw the your error message at Nov '19.
So I think this is the same issue but I think original issue is different

Topic		Replies	Views
Need help with resolving intermittent 502 errors Questions	1	174	June 18, 2024
Kong via k8s - memory spikes at an hourly frequency Questions kubernetes , kong-gateway	0	17	November 12, 2024
Kong times out by sending requests to stale upstream ip's Questions	0	475	February 8, 2023
Timeout issue with kong Questions	3	4823	December 11, 2017
Performance issue with kong-ingress service-mesh , kong-gateway	8	3038	August 19, 2021

Kong 1.4, K8S, DB-less, 504 Gateway timeout

Related topics