Timeout in connection to upstream

Hi guys,

We are using Kong Ingress Controller to router all incoming requests to the several GUIs that our app has. We are randomly experiencing an error where we get a “Kong Error” message in the browser. Looking at kong logs we found this:

2019/12/03 16:35:33 [error] 36#0: *28506728 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X, server: kong, request: "GET /mat/preprod/workflow/ HTTP/1.1", upstream: "http://10.233.69.49:8080/mat/preprod/workflow/", host: "mat.iquall.net"

We suspected an issue in the pod serving the application, so we restarted the app pod to no avail. Only after restarting kong pods (it is deployed as a daemonset), we got the service up and running.

Our Kubertenes cluster has 4 worker nodes, and one of the pods in the daemonset worked correctly, while the other 3 pods where unable to contact the upstream server.

We are using kong version 1.2.2, and our configuration is as follows:

INGRESS
# kubectl describe ingresses -n mat workflow-preprod-webserver 
Name:             workflow-preprod-webserver
Namespace:        mat
Address:          
Default backend:  default-http-backend:80 (<none>)
Rules:
  Host                       Path  Backends
  ----                       ----  --------
  mat.iquall.net  
                             /mat/preprod/workflow/   workflow-preprod-webserver:8080 (10.233.84.241:8080)
Annotations:
  configuration.konghq.com:     kongingress-webserver-preprod-workflow
  kubernetes.io/ingress.class:  kong
Events:                         <none>

########### SERVICE #################

root@mat-master2:~# kubectl describe svc workflow-preprod-webserver -n mat
Name:              workflow-preprod-webserver
Namespace:         mat
Labels:            apps=workflow-webserver
                   env=preprod
Annotations:       <none>
Selector:          app=workflow-preprod-webserver
Type:              ClusterIP
IP:                10.233.47.57
Port:              webserver  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.233.84.241:8080
Session Affinity:  None
Events:            <none>

########### POD ###################

root@mat-master2:~# kubectl describe pods -n mat workflow-preprod-webserver-6566cf6f46-6d89m 
Name:           workflow-preprod-webserver-6566cf6f46-6d89m
Namespace:      mat
Priority:       0
Node:           mat-worker1/10.48.72.39
Start Time:     Tue, 03 Dec 2019 13:26:56 -0300
Labels:         app=workflow-preprod-webserver
                pod-template-hash=6566cf6f46
Annotations:    <none>
Status:         Running
IP:             10.233.84.241
Controlled By:  ReplicaSet/workflow-preprod-webserver-6566cf6f46
Containers:
  workflow:
.....

############# KONG CONFIGURATION #############

kong=# SELECT * FROM routes WHERE "paths" = '{/mat/preprod/workflow/}';
-[ RECORD 1 ]--------------+-------------------------------------
id                         | 07fd60f7-78cc-4327-982e-d53591e528aa
created_at                 | 2019-12-03 16:51:07+00
updated_at                 | 2019-12-03 16:51:07+00
service_id                 | 2edac309-4c14-4e78-b471-6e1c09bd3b0b
protocols                  | {http,https}
methods                    | 
hosts                      | {mat-operaciones.claro.amx}
paths                      | {/mat/preprod/workflow/}
regex_priority             | 0
strip_path                 | f
preserve_host              | t
name                       | mat.workflow-preprod-webserver.00
snis                       | 
sources                    | 
destinations               | 
tags                       | {managed-by-ingress-controller}
https_redirect_status_code | 426

kong=# SELECT * FROM services WHERE "id" = '2edac309-4c14-4e78-b471-6e1c09bd3b0b';
-[ RECORD 1 ]---+-------------------------------------
id              | 2edac309-4c14-4e78-b471-6e1c09bd3b0b
created_at      | 2019-10-10 18:43:04+00
updated_at      | 2019-10-10 18:43:04+00
name            | mat.workflow-preprod-webserver.8080
retries         | 5
protocol        | http
host            | workflow-preprod-webserver.mat.svc
port            | 80
path            | /
connect_timeout | 60000
write_timeout   | 60000
read_timeout    | 60000
tags            | {managed-by-ingress-controller}

What caught my attention is that the IP in the Backend configuration of the Ingress is the one of the POD and not the IP of the Service. Is this correct?

Any clues of where to look or further troubleshooting steps we could make?

Thanks in advance.

Regards,

Diego

Yes. Kong by default routes traffic directly to the pod and by-passes kube-proxy.

Ok, and how does it get updated when a pod is restarted?
In Kong’s configuration in the DB I have the service name, so the ingress should somehow get updated when the service endpoints change.

The actual behaviour I see seems to point to the pod IP Address not being updated for some of kong instances. I cannot forcibly reproduce the issue since it happens randomly (or at least I couldn’t determine a cause).

Kong’s controller is notified about the change and then the configuration is put into Kong’s database.
There can be a small delay (up to 5-10 seconds) for this change to take effect.

Are all of these instances connected to the same DB and can reach the database?

Yes, they all share a single posgres instance. The cluster is deployed in VMs within a single datacenter.

That delay sounds reasonable. What we’ve found to solve the issue also is deleting and re-creating the ingress, which (following what you’ve answered so far), should delete the old config and put the new config in Kong’s database.

Maybe I should try upgrading kong to a newer version? Looking an 1.3 changelog here, It sounds like we might be hitting the “Route” bugs listed there.

I’ll keep you posted on this after testing the upgrade.