X-Kong-Proxy-Latency sometimes high

Hi, we noticed that some times kong’s response is quite slow, and we have some test that send 1000 sequential requests to the server(one by one), and noticed that some requests (about 0.1%) will slow up to 2000ms, the interesting thing is that X-Kong-Proxy-Latency is always around 2000ms (we never see something like 3000ms). May I know how should I dig into this problem to find out how to solve this problem?
I have already delete all the plugins, and this case is still there.

Most response are have low latency (99.9%), but some is not, for example:

<CIMultiDictProxy('Server': 'nginx/1.13.8', 'Date': 'Thu, 22 Mar 2018 03:07:29 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Kong-Upstream-Latency': '13', 'X-Kong-Proxy-Latency': '2023', 'Via': 'kong/0.10.0', 'Content-Encoding': 'gzip')>

Here’s the kong version and some other information

{
timers: {
running: 0,
pending: 5
},
configuration: {
pg_password: "******",
cassandra_lb_policy: "RoundRobin",
lua_code_cache: "on",
serf_path: "serf",
nginx_kong_conf: "/usr/local/kong/nginx-kong.conf",
proxy_listen: "0.0.0.0:8000",
ssl_cert: "/usr/local/kong/ssl/kong-default.crt",
lua_ssl_verify_depth: 1,
admin_listen: "0.0.0.0:8001",
admin_listen_ssl: "0.0.0.0:8444",
ssl_cert_default: "/usr/local/kong/ssl/kong-default.crt",
cassandra_repl_factor: 1,
cluster_profile: "wan",
upstream_keepalive: 60,
lua_package_cpath: "",
cassandra_consistency: "ONE",
admin_ssl_cert_default: "/usr/local/kong/ssl/admin-kong-default.crt",
pg_host: "alihk-master.qbtrade.org",
pg_port: 5432,
cluster_ttl_on_failure: 3600,
nginx_err_logs: "/usr/local/kong/logs/error.log",
plugins: {
syslog: true,
ldap-auth: true,
rate-limiting: true,
correlation-id: true,
jwt: true,
runscope: true,
hmac-auth: true,
http-log: true,
tcp-log: true,
response-transformer: true,
statsd: true,
basic-auth: true,
request-transformer: true,
bot-detection: true,
request-size-limiting: true,
galileo: true,
datadog: true,
acl: true,
file-log: true,
ip-restriction: true,
udp-log: true,
cors: true,
loggly: true,
aws-lambda: true,
oauth2: true,
response-ratelimiting: true,
key-auth: true
},
nginx_optimizations: true,
cassandra_keyspace: "kong",
cassandra_data_centers: [
"dc1:2",
"dc2:3"
],
ssl_cert_key: "/usr/local/kong/ssl/kong-default.key",
ssl: true,
cassandra_ssl: false,
admin_ssl: true,
ssl_cert_key_default: "/usr/local/kong/ssl/kong-default.key",
anonymous_reports: true,
kong_env: "/usr/local/kong/.kong_env",
serf_log: "/usr/local/kong/logs/serf.log",
log_level: "notice",
proxy_listen_ssl: "0.0.0.0:8443",
nginx_worker_processes: "auto",
cassandra_contact_points: [
"127.0.0.1"
],
proxy_ip: "0.0.0.0",
pg_ssl: false,
proxy_ssl_port: 8443,
nginx_admin_acc_logs: "/usr/local/kong/logs/admin_access.log",
pg_database: "kong",
admin_ip: "0.0.0.0",
admin_ssl_ip: "0.0.0.0",
proxy_port: 8000,
admin_ssl_port: 8444,
serf_node_id: "/usr/local/kong/serf/serf.id",
database: "postgres",
cassandra_port: 9042,
dns_resolver: { },
admin_port: 8001,
cassandra_repl_strategy: "SimpleStrategy",
cluster_listen: "0.0.0.0:7946",
admin_ssl_cert: "/usr/local/kong/ssl/admin-kong-default.crt",
dns_hostsfile: "/etc/hosts",
ssl_cert_csr_default: "/usr/local/kong/ssl/kong-default.csr",
prefix: "/usr/local/kong",
admin_ssl_cert_key: "/usr/local/kong/ssl/admin-kong-default.key",
nginx_daemon: "off",
custom_plugins: { },
lua_socket_pool_size: 30,
serf_pid: "/usr/local/kong/pids/serf.pid",
mem_cache_size: "128m",
pg_ssl_verify: false,
cassandra_username: "kong",
pg_user: "postgres",
cassandra_timeout: 5000,
lua_package_path: "?/init.lua;./kong/?.lua",
cassandra_ssl_verify: false,
proxy_ssl_ip: "0.0.0.0",
nginx_conf: "/usr/local/kong/nginx.conf",
admin_ssl_cert_key_default: "/usr/local/kong/ssl/admin-kong-default.key",
serf_event: "/usr/local/kong/serf/serf_event.sh",
nginx_pid: "/usr/local/kong/pids/nginx.pid",
nginx_acc_logs: "/usr/local/kong/logs/access.log",
admin_ssl_cert_csr_default: "/usr/local/kong/ssl/admin-kong-default.csr",
cluster_listen_rpc: "127.0.0.1:7373"
},
version: "0.10.0",
prng_seeds: {
pid: 80: 154111922457,
pid: 82: 254202138298,
pid: 81: 226802372218,
pid: 79: 176114396661
},
lua_version: "LuaJIT 2.1.0-beta2",
tagline: "Welcome to kong",
hostname: "kong-952928534-z9spv",
plugins: {
enabled_in_cluster: { },
available_on_server: {
syslog: true,
ldap-auth: true,
rate-limiting: true,
correlation-id: true,
jwt: true,
runscope: true,
request-transformer: true,
http-log: true,
loggly: true,
response-transformer: true,
basic-auth: true,
tcp-log: true,
key-auth: true,
oauth2: true,
acl: true,
galileo: true,
udp-log: true,
cors: true,
file-log: true,
ip-restriction: true,
datadog: true,
request-size-limiting: true,
bot-detection: true,
aws-lambda: true,
statsd: true,
response-ratelimiting: true,
hmac-auth: true
}
}
}

I change the upstream from http://xxx.example.org/blabla to http://192.168.3.4/blabla and the latency issue disapear. So it is dns issue. I see that the upstream is keepalived, so dns query should be very few. May I know if this is keepalive error?

Check your logs and count the occurences. The dns cache for Kong is stored per nginx worker. The number of workers (by default, unless changed) matches the number of CPU cores in your machine.

You should only see 1 delayed request per worker, the first one. Upon the initial request Kong cannot proxy, hence it must do dns resolution before it can proxy. Once that is done, you should generally not see it again. When the TTL of the dns record expires, a refresh query will de done in the background (for DNS_STALE_TTL number of seconds, see the docs for that setting).

DNS refresh is only triggered upon requests, so In case of low request volumes, the record might disappear after both the TTL and the extension based on DNS_STALE_TTL have expired. In that case the next request will have to wait for dns resolution again (as the initial request)