Currently investigating our Prod Environment a bit more to see actual impact before I make changes to the default Kong DNS settings, I also don’t know a good way to really debug this, I don’t want to throw kong in debug mode because it prints soooo many logs when really the best way to validate root cause here would be a way to isolate DNS related logging mechanisms and just alert on anomalies found there(like a DNS lookup taking longer than 2s if this is indeed the root cause for bringking or myself), results are as follows thus far from my digging:
Queries I used to lookup real prod traffic:
index=cba_kong host="gateway" RoutingURL!="*localhost*" TotalLatency >= 2000 BackendLatency <= 1000
index=cba_kong host="gateway" RoutingURL!="*localhost*"
Results are as followed over a 7 day period:
As you can see, a little over 4% of traffic is impacted by whatever is causing this. Will try some changes to see if this can be resolved. @bringking not sure how deep your analytics go, but I am interested to see the % impact to your services if its around the same % I see.
Edit - Maybe to further prove out if its DNS or not I could turn off DNS caching all together in Stage env and see if doing look-ups every-time yield more frequent latency spikes but I don’t see an option in the kong.conf. If anyone knows what lua file and lines I need to modify(only if its a trivial small snippit change) before starting kong to basically disable all dns caching I will apply that patch in Stage and run some tests.
Edit Edit - I researched our Stage env too and noticed 4% latency spikes from the Kong side of things as well, go figure
. I modified our Stage env to the same DNS STALE TTL 60 value you have tried, will watch it next week and see if any improvements there and 4% goes to 0%. If not back to the drawing board on my side.
