I have setup kong ingress controller 0.9.0 and Kong 2.0.4 on EKS. I used the all-in-one-dbless template with customization (internal nlb, resource limits, etc)
Requests from Kong to upstream services are made over HTTPS. The setup is working as expected and am now running performance tests.
I have 2 Kong Pods and an upstream service with simple API (minimal compute operation).
HPA is configured for Kong to scale up on 60% of CPU usage (reduced from 80)
HPA is configured for upstream to scale up on 80% cpu
Tests are run using wrk with connection 40 - 100 and threads 20 - 80 for 10 mins duration
As we drive higher load on this setup, we do see HPA kicking in for upstream service as expected.
However as the load increases, we see liveness probes start failing on Kong proxy causing it to be restarted. This results in errors on client side.
Throughout this period Kong pod’s CPU and MEM usage remains under 25% and 40% respectively so no HPA triggered.
In Kong error logs I see 3 categories of errors
peer closed connection in SSL handshake while SSL handshaking to upstream (~95%)
balancer.lua:628: get_balancer(): balancer not found for (~2.5%)
connect() failed (111: Connection refused) while connecting to upstream (~2.5%)
It seems like Kong PODs are facing some network or sockets IO issues. What are the performance tuning options available in Kong?
Forgot to mention that I have increased the timeout for the liveness probes from 1 second to 9 seconds and it had allowed Kong to run little longer before failing the probes
Kong is only making HTTPS connection to the upstream service on HTTPS port. What would be the recommended CPU and memory limits for Kong Proxy under the load?
I think I found the bottleneck to be in my custom plugin. This plugin calls our authentication service to validate the token passed along the request. If validated successfuly, the request is allowed to go through to the upstream service otherwise returns 401.
To make the call to our authentication service from my custom plugin code, I am using ssl.https module like below
local http = require "ssl.https"
local ltn12 = require("ltn12")
local t = {}
local r, resp_code, resp_headers = http.request{
url = validationUrl,
method = "GET",
headers = token_api_headers,
protocol = conf.tls_protocol,
verify = verify_cert_option,
cafile = conf.cluster_ca_file,
sink = ltn12.sink.table(t) -- sink to collect response body
}
If I disable this call in the plugin code, then performance is much better. And Kong pods auto scale as expected.
We have independently tested the authentication service under the same load and it does not show any signs of slowness. So it definitely looks to be this ssl.https module call as the culprit here.
I am new to Lua and Kong so unaware of the best practice to make such outbound calls from Kong plugin. Do you have any recommendation?
Yesterday I found out about lua-resty-http and switched the implementation in my plugin to
local restyHttp = require "resty.http"
. . .
-- Make HTTPS call and return response body
function make_https_call_with_resty(url, headers, method, conf)
local parameters = {
method = method,
headers = headers,
ssl_verify = conf.verify_certs
}
local httpClient = restyHttp.new()
local res, err = httpClient:request_uri(url, parameters)
if res then
if res.status == HTTP_OK and res.has_body then
-- get response body as string
local resp_body = res.body
kong.log.notice("Auth response Body: ", resp_body)
return resp_body
else
kong.log.notice("validation api call returned ", res.status)
end
else
kong.log.notice("validation api call encountered error ", err)
end
end
But now facing weird issue… these outbound requests from plugin fail alternately. i.e. 1st success, 2nd fail, 3rd success, 4th fail, and so on
Auth service being called is the same as before. If I switch to my old implementation using ssl.https library, requests work 100% of the time. Am I missing something when using lua-resty-http?
I got my issues resolved now. For our needs I had to pass through incoming headers for the requests to our Auth service from the plugin. One of the headers was causing this issue. It took a while to figure it out as it was not failing all the time but rather precise 50-50 pass-fail rate.