104: Connection reset by peer while reading response header from upstream

We have kong running on ECS in a docker container, behind an elastic load balancer. We’ve been getting 502 responses back from Kong. Our setup is the following. I

clients -> elastic load balancer -> kong ecs -> kong docker containers -> microservice load balancer -> microserice ecs -> microservice containers

When checking the cloudwatch logs I found this: 2017/12/14 09:35:56 [error] 53#0: *273045 upstream prematurely closed connection while reading response header from upstream, client: …, server: kong, request: “POST /v1/user/settings HTTP/1.1”, upstream: “https://x:443/user/settings”, host: …

As a test I started using http routes instead, and the errors I’m getting now are:

2017/12/18 12:07:58 [error] 53#0: *38590 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.0.32.239, server: kong, request: “POST /v1/user/settings HTTP/1.1”, upstream: “http://x:80/user/settings”, host: …

I’m a bit at a loss, and I’m not really sure where the problem is. I’m using Kong 0.11.2.
It doesn’t seem to hit our microservices when this error occurs, so that makes me think it must be something on the kong level.

I’ve been thinking that maybe it’s related to Connection: keep-alive, and somehow our microservices not honoring this or closing the connections. Is there a way for me to make Kong omit Connection: keep-alive when it’s make the upstream requests as a test?

Any other ideas? I’m not seeing much more in the logs except for these errors. I have the following setup:
export KONG_PROXY_ACCESS_LOG="/dev/stdout"
export KONG_ADMIN_ACCESS_LOG="/dev/stdout"
export KONG_PROXY_ERROR_LOG="/dev/stderr"
export KONG_ADMIN_ERROR_LOG="/dev/stderr"
export KONG_LOG_LEVEL=“debug”

The route is configured like this (we’re using kongfig)

  • name: "user_settings"
    ensure: "present"
    attributes:
    upstream_url: "%route%/settings"
    uris: "/v1/user/settings"
    methods: [“GET”,“POST”]
    https_only: false
    upstream_connect_timeout: 60000
    upstream_read_timeout: 60000
    upstream_send_timeout: 60000
    retries: 10
    plugins:
    • name: jwt
      attributes:
      config:
      key_claim_name: "iss"
      claims_to_verify: "exp"
      secret_is_base64: false
      uri_param_names: “”

We’re using postgres as the datastore

1 Like

Hi @donalddw,

Welcome to the fabulous world of reverse-proxies! “Connection reset by peer” should probably be considered as the initiation rite of any Konger/NGINX user: it isn’t an uncommon error at all. What it means (you might be familiar with it but future readers might not) is that the remote server (your upstream) hung up on Kong and sent an RST packet without bothering doing the usual FIN-ACK handshake. Quite rude.

As the error is saying, it seems like the remote is closing its connection while Kong is receiving the response headers - you already found that out by disabling HTTPs, which didn’t help troubleshoot the issue.

You might also want to:

  • investigate your remote server’s logs to find out if any error occurred at that time that may have caused it to react so abruptly.
  • capture the traffic between Kong and the remote server via tools such as tcpdump or Wireshark to gather information and be able to better draw hypothesis as to the root cause, potentially even replicate the issue without Kong, if you find out what is triggering it.
  • restart Kong with log_level = debug and keep an eye on the logs as such errors occurs. (you already did so, but future readers of this issue may not)
  • ensure that your remote is accepting HTTP/1.1 traffic (vs 1.0).
  • as you suggested, try to tweak the ngx_http_proxy_module directives such that it sends Connection: close by default (although its interpretation is up to your remote anyway - and reminder: in HTTP/1.1, all connections are considered kept-alive unless Connection: close is explicitly set). Another possible tweak is updating the proxy_http_version to 1.0 in case the above turns out to be a requirement. The only way to tweak those settings is by using a Custom NGINX configuration template.

I think those steps already amount to a good/standard way of debugging this issue. If I can think of anything else, I’ll come back and update this list. I’m hopping that since this error is particularly common, your topic will come out when users try to dig for answers before posting their own question.

Good hunting and let us know of your findings!

Best,

4 Likes

Hi,

We are facing the same problem. We are running Kong in Kubernetes with 0.11.2 version. Our upstream servers are running in Tomcat.

Will disabling keep-alive solve the problem?

@chandresh_pancholi you can certainly try it, but the correct solution depends on how your system is configured, that’s why @thibaultcha included a list of possible things to try. If modifying the keep-alive doesn’t help you, try one of the other solutions.

hi @thibaultcha

Thanks for the tips and things to try.

After trying a lot of different things, it seems that setting proxy_http_version to 1.0, and setting the connection header to close, most of the issues are resolved. I’m not 100% sure, but it might be related to the elastic load balancer closing the connection. We’re soon changing our infrastructure, and hopefully we can revert back to 1.1, because I do want to keep using connection keep-alive…

Another problem arose though by turning off keep-alive, although it doesn’t always seem to happen. We’re getting occasional timeouts now.

2018/01/17 19:51:11 [error] 51#0: *48021 upstream timed out (110: Operation timed out) while connecting to upstream, client: 10.0.33.226, server: kong, request: "GET /v1/relationship/following HTTP/1.1", upstream: "https://<ip>:443/relationship/following", host: "<host>"

2018/01/17 19:51:11 [error] 51#0: *48021 [lua] init.lua:314: balancer(): failed to retry the dns/balancer resolver for <host>' with: dns server error: 4 server failure, cache only lookup failed while connecting to upstream, client: 10.0.33.226, server: kong, request: "GET /v1/relationship/following HTTP/1.1", upstream: "https://<ip>:443/relationship/following", host: "<host>"

Often our api calls are very slow to return with the result. I’m not sure if this is the result of turning off keep-alive. Note that none of our api calls should take longer than 5 seconds at most.

What I notice for example is that a lot of the calls, which are taking a long time, return at around 31 seconds. Now 30 seconds is the idle timeout that we set in our aws elastic load balancers. So it doesn’t seem to be a coincidence. It’s almost as if kong hangs for 30 seconds, and then retries and returns the response to the client. I don’t see any mention of the retries in the logs though, even though logs are set to debug.

Any ideas what might be the cause here? Any help is appreciated

@donalddw Have you noticed the API entity’s upstream_*_timeout attributes? Those are upstream_connect_timeout, upstream_send_timeout, and upstream_read_timeout. Could it be that you set the connect timeout to 30s? Are there any error in the logs that may indicate that Kong could not set the upstream timeouts? Are there any errors prior to those at all?
The second error could be the result of the first one, in which Kong failed to resolve the DNS host for your upstream service, and subsequent retries of the upstream request error out as well. Maybe you could share your API’s attributes?

@thibaultcha In the meantime a lot has changed. We moved to kubernetes and we’re no longer using ELB and we get very good results now. I suspect there was some issue with the way the ELB was set up and maybe it was killing connections. We had an ELB on the gateway, and then an ELBs on each individual micro service. Maybe the combination of having those two ELBs was not good.

Also we reduced the upstream_connect_timeout to 1 or 2 seconds with 5 retries. That seems also to get rid of those timeouts we used to have from kong to the upstream. All in all we’re running stable now.

1 Like

Hi,

We have same problem.
Opened an issue;

Is there any way to solve this upstream problem?

Many thanks.