Interestingly enough at a decent TPS of 1k+ this issue seems to still rear its head, the keepalive portion comment I mentioned above seems to have helped to a degree but I still see results like this:
2018/04/12 02:39:49 [error] 56#0: *10324471 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 56#0: *10324471 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 56#0: *10324471 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443
Current code:
-- Log to a Http end point.
-- This basically is structured as a timer callback.
-- @param `premature` see openresty ngx.timer.at function
-- @param `conf` plugin configuration table, holds http endpoint details
-- @param `body` raw http body to be logged
-- @param `name` the plugin name (used for logging purposes in case of errors etc.)
local function log(premature, conf, body, name)
if premature then
return
end
name = "[" .. name .. "] "
local ok, err
local parsed_url = parse_url(conf.http_endpoint)
local host = parsed_url.host
local port = tonumber(parsed_url.port)
local sock = ngx.socket.tcp()
sock:settimeout(conf.timeout)
ok, err = sock:connect(host, port)
if not ok then
ngx.log(ngx.ERR, name .. "failed to connect to " .. host .. ":" .. tostring(port) .. ": ", err)
return
end
if parsed_url.scheme == HTTPS then
local _, err = sock:sslhandshake(true, host, false)
if err then
ngx.log(ngx.ERR, name .. "failed to do SSL handshake with " .. host .. ":" .. tostring(port) .. ": ", err)
end
end
ok, err = sock:send(generate_post_payload(conf.method, conf.content_type, parsed_url, body, conf.splunk_access_token))
if not ok then
ngx.log(ngx.ERR, name .. "failed to send data to " .. host .. ":" .. tostring(port) .. ": ", err)
end
-- ok, err = sock:setkeepalive(conf.keepalive)
-- if not ok then
-- ngx.log(ngx.ERR, name .. "failed to keepalive to " .. host .. ":" .. tostring(port) ..":" , err)
--return
-- end
end
Any experts of the https://github.com/openresty/lua-nginx-module#ngxsockettcp have an idea of what is going on and a remedy? Would love to be able to achieve no drops in transactions.
My KONG_LUA_SOCKET_POOL_SIZE is set to 30, is this a decent range for an 8 worker process node handling 1-5k tps? Something has to give here.
Would also like to indicate when I see these errors its not 100% of the transaction logging failing but just a large chunk. Interestingly enough I run statsD plugin as well to an app sidecar to kong that then streams metrics into grafana and it does report proper TPS. When it was logging around 5k TPS(true tps) Splunk was getting about 3k I want to say. So 40% Tx loss probably at high TPS using HTTPS.