DNS resolution failed in kong ingress pod

I got this error and found the following solution:
Apply the “dns_stale_ttl = 125” configuration in kong.conf
See: https://github.com/Kong/kong/issues/4496
After setting configuration, this problem seemed to disappear.

After setting configuration and restarting the Kong pod, the same problem occurs
after a while(the service address could not be found in the kong ingress, Error code 404).
Kong Ingress works correctly after restarting the kong pod, but after some time the kong ingress pod does not find the service address.

[env]

  • Kubernete : v1.15.0
  • KongIngress : 0.6.0
  • kong : 1.4.2
  • dbless
  • use lrucache in custom plugin (size 10^7)

[kong.conf]
dns_stale_ttl = 125
client_max_body_size=256m
client_body_buffer_size=256k

[First Error]
indent preformatted text by 4 spaces
2020/01/22 11:27:48 [error] 32#0: *185079 upstream timed out (110: Operation timed out) while connecting to upstream, client: 10.251.94.11, server: kong, request: “GET / HTTP/1.1”, upstream: “http://10.221.32.177:80/”, host: “10.251.147.11”
10.251.94.64 - - [22/Jan/2020:11:27:48 +0000] “GET / HTTP/1.1” 500 131 “-” “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”
2020/01/22 11:27:48 [error] 32#0: *185079 [lua] balancer.lua:917: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed.
Tried: ["(short)test-service.name-space.svc:(na) - cache-miss",
“test-service.name-space.svc.name-space.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed”,
“test-service.name-space.svc.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed”,
“test-service.name-space.svc.cluster.local:33 - cache only lookup failed/dns server error: 100 cache only lookup failed”,
“test-service.name-space.svc.openstacklocal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed”,
“test-service.name-space.svc:33 - cache only lookup failed/dns server error: 100 cache only lookup failed”,
“test-service.name-space.svc.name-space.svc.cluster.local:1 - cache only lookup failed
/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.svc.cluster.local:1 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.cluster.local:1 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.openstacklocal:1 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc:1 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.name-space.svc.cluster.local:5 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.svc.cluster.local:5 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.cluster.local:5 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc.openstacklocal:5 -
cache only lookup failed/dns server error: 100 cache only lookup failed”,“test-service.name-space.svc:5 -
cache only lookup failed/dns server error: 100 cache only lookup failed”] while connecting to upstream, client: 10.251.94.11, server: kong, request: “GET / HTTP/1.1”, upstream: “http://10.221.32.177:80/”, host: “10.251.147.11”
2020/01/22 11:27:48 [error] 32#0: *185079 [lua] init.lua:800: balancer(): failed to retry the dns/balancer resolver for test-service.name-space.svc’ with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 10.251.94.11, server: kong, request: “GET / HTTP/1.1”, upstream: “http://10.221.32.177:80/”, host: “10.251.147.11”

[Error after applying configuration]
2020/01/29 09:23:31 [notice] 1#0: signal 17 (SIGCHLD) received from 31
2020/01/29 09:23:31 [alert] 1#0: worker process 31 exited on signal 9
2020/01/29 09:23:31 [notice] 1#0: start worker process 34
2020/01/29 09:23:31 [notice] 1#0: signal 29 (SIGIO) received
10.251.91.11 - - [29/Jan/2020:09:23:43 +0000] "POST /test-service/api/v1/aaa HTTP/1.1" 404 48 “http://10.251.147.11/aaa” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.251.91.11 - - [29/Jan/2020:09:23:39 +0000] “POST /test-service/api/v1/aaa HTTP/1.1” 404 48 “http://10.251.147.11/aaa” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”

Can you verify if the 404 is being returned by Kong or by your upstream service?
Since there is a 404, this doesn’t seem to be a DNS error.

I already checked the logs of kong ingree and service pod at the same time
The kong ingress write 404 error code.
And there is no request log at service pod.
If service pod receive the request, service pod have to write some log.
At that time service pod write other logs.
I tried 10 times requests but all failed by 404

If I meet the 404 error log in kong ingress pod at next time, I will restart service pod first.
But I think service pod working correctly at that time.

I met the same issue today
What does mean both signal 17 and signal 9?
After this log Kong ingress could not find service.

I restarted upstream service, still kong made 404 error log
I requested another service pod, kong ingress also show 404 error log

[First Error]
2020/01/30 06:24:25 [notice] 1#0: signal 17 (SIGCHLD) received from 32
2020/01/30 06:24:25 [alert] 1#0: worker process 32 exited on signal 9
2020/01/30 06:24:25 [notice] 1#0: start worker process 34
2020/01/30 06:24:25 [notice] 1#0: signal 29 (SIGIO) received
10.222.22.22 - - [30/Jan/2020:06:24:25 +0000] “GET /icon-48-px-error.png HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:40 +0000] “GET /runtime-es2015.js HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:40 +0000] “GET /polyfills-es2015.js HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:40 +0000] “GET /styles-es2015.js HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:40 +0000] “GET /vendor-es2015.js HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:40 +0000] “GET /main-es2015.js HTTP/1.1” 404 48 “http://10.222.11.11/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:45 +0000] “GET /service-1/api/v1/get-data-list HTTP/1.1” 404 48 “http://10.222.11.11/service-1” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:45 +0000] “GET /service-1/api/v1/get-data-list HTTP/1.1” 404 48 “http://10.222.11.11/service-1” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:45 +0000] “GET /service-1/api/v1/get-name-list HTTP/1.1” 404 48 “http://10.222.11.11/service-1” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”
10.222.22.22 - - [30/Jan/2020:06:24:45 +0000] “GET /service-1/api/v1/get-name-list HTTP/1.1” 404 48 “http://10.222.11.11/service-1” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”

[another service request]
10.222.22.22 - - [30/Jan/2020:06:25:30 +0000] “GET /service-3/api/v1/request-server-name HTTP/1.1” 404 48 “http://10.222.11.11/service-2/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 SECSSOBrowserChrome”

Yes, you are right. 404 is not DNS error
Upstream service also ok at that time

I think signal 9 means OOME (Out Of Memory Error)
Kong ingress memory grow up to 2.5G in our testbed.
I use lrucache in custom plugin, the cache size grow up over 1.5G.
So kong maybe met OOME(signal 9).
During we have sometime after OOME(signal 9), kong working correctly again.

After I fix the memory issue in custom plugin, kong ingress memory was not growing until now.
Thank you for your help.