HTTP-Log Plugin Error

  ok, err = sock:setkeepalive(conf.keepalive)
  if not ok then
    ngx.log(ngx.ERR, name .. "failed to keepalive to " .. host .. ":" .. tostring(port) .. ": ", err)
    return
end

I see this error in our logs:

2018/02/01 03:06:32 [error] 35#0: *706228 [lua] handler.lua:103: [http-log] failed to keepalive to splunk.company.com:443: connection in dubious state, context: ngx.timer, client: 10.1.51.1, server: 0.0.0.0:8443

I saw the Openresty dev mention a similar error here:

Seems to not happen all the time, can any Kong developers give some insight to this? Is there a way to fix this either on Kong side or Splunk side and is it a dire issue or does it just mean Nginx had to restart a new tcp connection to splunk and re-initiate the ssl handshake for a given transaction?

Edit - I think when this is occurring kong->splunk actually losing transactions :frowning: .

Edit2 - Commenting out keep alive code block entirely seemed to cause no loss of logs fyi.

Thanks,
-Jeremy

Well we ended up keeping the keepalive portion commented out, no more issues but I do imagine our overhead with splunk as the logging endpoint is probably a bit heavier now.

Interestingly enough at a decent TPS of 1k+ this issue seems to still rear its head, the keepalive portion comment I mentioned above seems to have helped to a degree but I still see results like this:

2018/04/12 02:39:49 [error] 56#0: *10324471 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 56#0: *10324471 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 56#0: *10324471 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 55#0: *10324478 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.4.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 54#0: *10324489 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.129.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 peer closed connection in SSL handshake (104: Connection reset by peer), context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 [lua] handler.lua:85: [http-log] failed to do SSL handshake with splunk.company.com:443: handshake failed, context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443
2018/04/12 02:39:49 [error] 57#0: *10324485 [lua] handler.lua:91: [http-log] failed to send data to splunk.company.com:443: closed, context: ngx.timer, client: 10.131.2.1, server: 0.0.0.0:8443

Current code:

-- Log to a Http end point.
-- This basically is structured as a timer callback.
-- @param `premature` see openresty ngx.timer.at function
-- @param `conf` plugin configuration table, holds http endpoint details
-- @param `body` raw http body to be logged
-- @param `name` the plugin name (used for logging purposes in case of errors etc.)
local function log(premature, conf, body, name)
  if premature then
    return
  end
  name = "[" .. name .. "] "
  local ok, err
  local parsed_url = parse_url(conf.http_endpoint)
  local host = parsed_url.host
  local port = tonumber(parsed_url.port)
  local sock = ngx.socket.tcp()
  sock:settimeout(conf.timeout)

  ok, err = sock:connect(host, port)
  if not ok then
    ngx.log(ngx.ERR, name .. "failed to connect to " .. host .. ":" .. tostring(port) .. ": ", err)
    return
  end

  if parsed_url.scheme == HTTPS then
    local _, err = sock:sslhandshake(true, host, false)
    if err then
      ngx.log(ngx.ERR, name .. "failed to do SSL handshake with " .. host .. ":" .. tostring(port) .. ": ", err)
    end
  end

  ok, err = sock:send(generate_post_payload(conf.method, conf.content_type, parsed_url, body, conf.splunk_access_token))
  if not ok then
    ngx.log(ngx.ERR, name .. "failed to send data to " .. host .. ":" .. tostring(port) .. ": ", err)
  end

--  ok, err = sock:setkeepalive(conf.keepalive)
--  if not ok then
--    ngx.log(ngx.ERR, name .. "failed to keepalive to " .. host .. ":" .. tostring(port)  ..":" , err)
    --return 
--  end
end

Any experts of the https://github.com/openresty/lua-nginx-module#ngxsockettcp have an idea of what is going on and a remedy? Would love to be able to achieve no drops in transactions.

My KONG_LUA_SOCKET_POOL_SIZE is set to 30, is this a decent range for an 8 worker process node handling 1-5k tps? Something has to give here.

Would also like to indicate when I see these errors its not 100% of the transaction logging failing but just a large chunk. Interestingly enough I run statsD plugin as well to an app sidecar to kong that then streams metrics into grafana and it does report proper TPS. When it was logging around 5k TPS(true tps) Splunk was getting about 3k I want to say. So 40% Tx loss probably at high TPS using HTTPS.