Upgrade Kong 3.7.1 to 3.9.0 dns resolution issues related to migrations up/finissh

we are using the 3.9.0 Ubuntu Image available at hub.docker.com

we are deploying image to Azure AKS environment. This is an upgrade, so bootstrap is not being leveraged.

dns resolution issues are being reported such as

Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms”]]

we are on Postgresql V14. if i connect to the pod running kong I am unable to do a nslookup on the endpoint without issue.

here are some of the logs such as when i do a kong migrations list

i exec’d into the pod running the cp and did a kong migrations --v list

$ kong migrations --v list

2025/01/17 19:16:59 [notice] 1449#0: using the “epoll” event method

2025/01/17 19:16:59 [notice] 1449#0: openresty/1.25.3.2

2025/01/17 19:16:59 [notice] 1449#0: OS: Linux 5.15.173.1-1.cm2

2025/01/17 19:16:59 [notice] 1449#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576

2025/01/17 19:16:59 [notice] 1449#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer

2025/01/17 19:16:59 [verbose] Kong: 3.9.0

2025/01/17 19:16:59 [verbose] no config file found at /etc/kong/kong.conf

2025/01/17 19:16:59 [verbose] no config file found at /etc/kong.conf

2025/01/17 19:16:59 [verbose] no config file, skip loading

2025/01/17 19:16:59 [verbose] prefix in use: /kong_prefix

2025/01/17 19:16:59 [notice] 1449#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer

2025/01/17 19:16:59 [verbose] preparing nginx prefix directory at /kong_prefix

2025/01/17 19:16:59 [verbose] SSL enabled on proxy, no custom certificate set: using default certificates

2025/01/17 19:16:59 [verbose] proxy SSL certificate found at /kong_prefix/ssl/kong-default.crt

2025/01/17 19:16:59 [verbose] proxy SSL certificate found at /kong_prefix/ssl/kong-default-ecdsa.crt

2025/01/17 19:16:59 [verbose] SSL enabled on admin_gui, no custom certificate set: using default certificates

2025/01/17 19:16:59 [verbose] admin_gui SSL certificate found at /kong_prefix/ssl/admin-gui-kong-default.crt

2025/01/17 19:16:59 [verbose] admin_gui SSL certificate found at /kong_prefix/ssl/admin-gui-kong-default-ecdsa.crt

2025/01/17 19:16:59 [verbose] generating trusted certs combined file in /kong_prefix/.ca_combined

2025/01/17 19:16:59 [info] 1449#0: *2 [lua] node.lua:303: new(): kong node-id: 6bcb69bd-c5a8-4a66-abea-2cbc5ee0c453, context: ngx.timer

Error:

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms”]]

stack traceback:

[C]: in function 'assert'

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: in function 'cmd_exec'

/usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:31>

[C]: in function 'xpcall'

/usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:15>

(command line -e):5: in function 'inline_gen'

init_worker_by_lua(nginx.conf:185):44: in function <init_worker_by_lua(nginx.conf:185):43>

[C]: in function 'xpcall'

init_worker_by_lua(nginx.conf:185):52: in function <init_worker_by_lua(nginx.conf:185):50>

which sort of lines up with what the kong discussion was, but yes, that looks like it was 3.7.1

if i do debug mode i get some additional info

2025/01/17 19:19:01 [debug] 1465#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:-1

2025/01/17 19:19:01 [debug] 1465#0: *2 [lua] client.lua:534: [dns_client] cache miss, try to query psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:-1

2025/01/17 19:19:02 [debug] 1465#0: *2 [lua] client.lua:362: resolve_query(): [dns_client] r:query(psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:1) ans:- t:451 ms

2025/01/17 19:19:02 [debug] 1465#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:-1 ans:- hlv:fail

Error:

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 451 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 451 ms”]]

stack traceback:

[C]: in function 'assert'

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: in function 'cmd_exec'

/usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:31>

[C]: in function 'xpcall'

/usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:15>

(command line -e):5: in function 'inline_gen'

init_worker_by_lua(nginx.conf:185):44: in function <init_worker_by_lua(nginx.conf:185):43>

[C]: in function 'xpcall'

init_worker_by_lua(nginx.conf:185):52: in function <init_worker_by_lua(nginx.conf:185):50>

here is the nslookup of the endpoint in question that i see

nslookup psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53

Non-authoritative answer:
psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com canonical name = psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com.
Name: psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com
Address: 10.15.34.69

so at the pod level it does resolve, but when running migrations it doesn’t.

also, here is the pods /etc/resolv.conf

$ cat /etc/resolv.conf
search dmz-kong.svc.cluster.local svc.cluster.local cluster.local 13jqinnqegaetjxzt0guttm2sb.gx.internal.cloudapp.net
nameserver 10.0.0.10
options ndots:5

i don’t know if this is pertinent, however most of the time i do get the dns lookup issue, but just ran it once and did get this

$ kong migrations list
Executed migrations:
core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213, 012_213_to_220, 013_220_to_230, 014_230_to_270, 015_270_to_280, 016_280_to_300, 017_300_to_310, 018_310_to_320, 019_320_to_330, 020_330_to_340, 021_340_to_350, 022_350_to_360, 023_360_to_370, 024_380_to_390
acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
acme: 000_base_acme, 001_280_to_300, 002_320_to_330, 003_350_to_360
ai-proxy: 001_360_to_370
basic-auth: 000_base_basic_auth, 002_130_to_140, 003_200_to_210
bot-detection: 001_200_to_210
hmac-auth: 000_base_hmac_auth, 002_130_to_140, 003_200_to_210
http-log: 001_280_to_300
ip-restriction: 001_200_to_210
jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
key-auth: 000_base_key_auth, 002_130_to_140, 003_200_to_210, 004_320_to_330
oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211, 006_320_to_330, 007_320_to_330
opentelemetry: 001_331_to_332
post-function: 001_280_to_300
pre-function: 001_280_to_300
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210, 005_320_to_330, 006_350_to_360
response-ratelimiting: 000_base_response_rate_limiting, 001_350_to_360
session: 000_base_session, 001_add_ttl_index, 002_320_to_330

i then ran it again and saw the dns resolution issue again.

@rick is there someone who can look at this issue?

@dgresham This appears to be an environment issue with sporadic failures making it very challenging to assist. If you’re able to create deterministically reproducible instructions and configuration, please file a GitHub issue in the project repository. GitHub · Where software is built

@rick thanks, i posted an issue there, is there any sla or how often the issues are monitored? just trying to manage expectations on when i might get a response. another team member also put in a issue the other day.

@rick i should also mention that my company now has enterprise support, but currently we are still using OSS until we can plan to use the Enterprise images.

@dgresham I believe you are able to open a support ticket with your enterprise support agreement.

ok, i’ll look into that, but fyi, the git issues are 5 days old now without any response.
not sure what our expectation should be on this.
@rick