Latency spikes under nominal load

We are running Kong 0.13 and under nominal load (10-20% CPU, 40K RPM), we are seeing Latency spikes inside of kong that range from 2-4 seconds in the 95/99% and 20-40ms mean latency. This seems unusually high for our level of traffic. The latency seems to just “cut” off at 2s or 4s, so it seems like some kind of timeout issue. Has anyone run into this?

Could you include the x-axis in your screenshot? And/or describe the periodicity of these latency spikes?

Hey! So the X-Axis is just time and the Y-Axis is latency in ms, and we actually couldn’t see a real pattern to the various spikes in latency on the x-axis. However, we were able to isolate the problem to what we think is a hardware issue with DNS resolution. We were able to mitigate the spikes in latency by changing the dns_stale_ttl to something high like 60 seconds (up from 4 by default). This change resolved the issue. This makes us think stale DNS entries were periodically taking longer than 4 seconds + 1 second (dns_stale_ttl + dns_error_ttl) to resolve, causing some requests to wait for the DNS cache to be updated. If that doesn’t sound way off base, then we think it is def. a hardware/network DNS issue. We are running the Kong Docker container inside a Rancher/Cattle service on AWS so we are now looking into what could be causing network issues that would lead to long DNS resolution times

I am actually seeing spikes around what you report occasionally as well and I run DNS with stale dns ttl at 4 too. As you can see from our cool dynatrace tracing integration we can see where the transaction is being held up, and its always in Kong on these big spikes. I will try your change and report back if it improves things.

Ha, yeah @bringking, I thought the X-axis was time - I was wondering how much time, as I was guessing that the rough periodicity of the latency might have something to do with… DNS! We’ve seen other users have Kong latency issues whose root cause are DNS issues.

@jeremyjpj0916 and @bringking we’d love to hear how you made your DNS consistently fast!

Currently investigating our Prod Environment a bit more to see actual impact before I make changes to the default Kong DNS settings, I also don’t know a good way to really debug this, I don’t want to throw kong in debug mode because it prints soooo many logs when really the best way to validate root cause here would be a way to isolate DNS related logging mechanisms and just alert on anomalies found there(like a DNS lookup taking longer than 2s if this is indeed the root cause for bringking or myself), results are as follows thus far from my digging:

Queries I used to lookup real prod traffic:

index=cba_kong host="gateway" RoutingURL!="*localhost*" TotalLatency >= 2000 BackendLatency <= 1000
index=cba_kong host="gateway" RoutingURL!="*localhost*"

Results are as followed over a 7 day period:

As you can see, a little over 4% of traffic is impacted by whatever is causing this. Will try some changes to see if this can be resolved. @bringking not sure how deep your analytics go, but I am interested to see the % impact to your services if its around the same % I see.

Edit - Maybe to further prove out if its DNS or not I could turn off DNS caching all together in Stage env and see if doing look-ups every-time yield more frequent latency spikes but I don’t see an option in the kong.conf. If anyone knows what lua file and lines I need to modify(only if its a trivial small snippit change) before starting kong to basically disable all dns caching I will apply that patch in Stage and run some tests.

Edit Edit - I researched our Stage env too and noticed 4% latency spikes from the Kong side of things as well, go figure :smiley: . I modified our Stage env to the same DNS STALE TTL 60 value you have tried, will watch it next week and see if any improvements there and 4% goes to 0%. If not back to the drawing board on my side.

@Cooper @jeremyjpj0916 So we figured this out and it turns out the DNS change was only masking the real issue. So we figured caching DNS for 60s shouldn’t be necessary and we dug into why our DNS would periodically take so long. We decided to check the syslog on the server by running dmesg | tail -n 20. This showed us -

[1889097.392072] nf_conntrack: table full, dropping packet
[1889097.398947] nf_conntrack: table full, dropping packet
[1889097.398959] nf_conntrack: table full, dropping packet
[1889097.398966] nf_conntrack: table full, dropping packet
[1889097.398974] nf_conntrack: table full, dropping packet
[1889097.398979] nf_conntrack: table full, dropping packet
[1889097.398987] nf_conntrack: table full, dropping packet
[1889097.398993] nf_conntrack: table full, dropping packet
[1889097.399002] nf_conntrack: table full, dropping packet
...

This looked promising. This means netfilter was dropping packets because the conntrack table was full (which would def. add latency), so we looked at our historical logs and found that this has been happening a while. After that, we used sysctl and saw that our nf_conntrack_max setting was far too low. This was likely due to a change in one of our chef templates. After raising nf_conntrack_max to 524288 and nf_conntrack_tcp_timeout_established to 86400 we immediately saw all the issues go away.

We should have known something was causing network issues because we had been seeing that with high concurrency our latencies would rise, but the CPU was very low (10-20%). After this change we were able to stress the system with 4-5x more RPM with no increase in latency. After freeing up the network, we were able to peg the CPU during the load test as well. Here is a graph showing the change -



Doing some research also yields a good writeup on the change here - http://antmeetspenguin.blogspot.com/2011/01/high-performance-linux-router.html

Thanks for the great product, turns out it was our fault after all :slight_smile:

2 Likes

Thanks for sharing your results, I did decide to check those values, although now I think we are now not experiencing the same root cause problem. Mine seem to be configured okay by default:

 $ sysctl -a | grep netfilter
net.netfilter.nf_conntrack_acct = 0
net.netfilter.nf_conntrack_buckets = 131072
net.netfilter.nf_conntrack_checksum = 1
net.netfilter.nf_conntrack_count = 85
net.netfilter.nf_conntrack_dccp_loose = 1
net.netfilter.nf_conntrack_dccp_timeout_closereq = 64
net.netfilter.nf_conntrack_dccp_timeout_closing = 64
net.netfilter.nf_conntrack_dccp_timeout_open = 43200
net.netfilter.nf_conntrack_dccp_timeout_partopen = 480
net.netfilter.nf_conntrack_dccp_timeout_request = 240
net.netfilter.nf_conntrack_dccp_timeout_respond = 480
net.netfilter.nf_conntrack_dccp_timeout_timewait = 240
net.netfilter.nf_conntrack_events = 1
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.netfilter.nf_conntrack_expect_max = 1024
net.netfilter.nf_conntrack_frag6_high_thresh = 4194304
net.netfilter.nf_conntrack_frag6_low_thresh = 3145728
net.netfilter.nf_conntrack_frag6_timeout = 60
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_helper = 1
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_icmpv6_timeout = 30
net.netfilter.nf_conntrack_log_invalid = 0
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_sctp_timeout_closed = 10
net.netfilter.nf_conntrack_sctp_timeout_cookie_echoed = 3
net.netfilter.nf_conntrack_sctp_timeout_cookie_wait = 3
net.netfilter.nf_conntrack_sctp_timeout_established = 432000
net.netfilter.nf_conntrack_sctp_timeout_heartbeat_acked = 210
net.netfilter.nf_conntrack_sctp_timeout_heartbeat_sent = 30
net.netfilter.nf_conntrack_sctp_timeout_shutdown_ack_sent = 3
net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0
net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_timestamp = 0
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180

Glad to hear your problem is fixed though!