Kong 1.4.0 - Getting random 503 status codes on requests

Hello,

We’ve been having random 5xx errors with Kong. The vast majority are 503 errors, which I will focus on for this thread. It’s been happening since we have been on version 0.14.1 and we have just recently upgraded to 1.4.0 a month ago and that didn’t resolve the issue so we figured it was time to further investigate. Our setup, all in AWS, is the following:

  • Route53 has an A record as an Alias pointing to CloudFront.

  • CloudFront points to Route53 that has an A record as an ALIAS for the Kong Application ELB.

  • Kong Application ELB has the target group to two EC2 Debian linux instances that are only running Kong.

  • We use Postgres RDS instance for storing Kong configurations.

  • We use mainly the default settings for Kong/Nginx. Below is what have in our /etc/kong/kong.conf file. I’ve XXX out the sensitive settings.

    prefix = /usr/local/kong/
    log_level = notice
    proxy_access_log = logs/access.log upstreaminfo
    proxy_error_log = logs/error.log
    admin_access_log = logs/admin_access.log
    admin_error_log = logs/error.log
    proxy_listen = 0.0.0.0:80, 0.0.0.0:443 ssl
    admin_listen = 0.0.0.0:8001, 127.0.0.1:8444 ssl
    headers = off
    nginx_http_set_real_ip_from = 10.252.0.0/19
    real_ip_header = X-Forwarded-For
    nginx_http_lua_shared_dict = prometheus_metrics 50m
    database = postgres
    pg_host = XXXXXXXXXXXXX
    pg_port = 5432
    pg_user = XXXX
    pg_password = XXXXXXXXXX
    pg_database = XXXX

Things we have checked when the 503 happens:

  • The last evidence of the request is at the ELB level where it shows it gets a 503 target_status_code status in the ELB access logs from the Kong EC2 instances.
  • We will see the same request come back with a status of 200 before and after the 503 happens all within a few milliseconds of each other.
  • We don’t see any logging of these requests that have a 503 in /usr/local/kong/log/*.log files or any errors in those Kong log files that would indicate Kong is having an issue.
  • Checked the ELB CloudWatch metrics for UnHealthyHostCount and it’s been reporting 0 unhealthy hosts during the times when we get 503 statuses. So our Kong nodes are healthy when 503 statuses happen. The ELB target group health check is using port 80 and path of / for the Kong nodes and looking for success status codes 200-499.

We’ve noticed Kong v 1.4.0 - Getting 504s when adding plugin to route via REST (SOLVED) in Kong Nation, which isn’t really related to this issue, but haven’t tried setting the anonymous_reports to false.

These are very random 5xx that we get and not a major impact of our services running Kong, but it would be good to figure out the root cause of why we are getting this and get this resolved. So looking to Kong Nation for any advice on what else to troubleshoot, root cause and how to resolved.

Thank you,
Mike

Could you share more details about configuration in Kong?

Do these 503s happen for a specific service in Kong or for all the services?
If it is a specific service in Kong, can you share the configuration of the service/upstream/route for this combination?

Hello @hbagdi

I can’t confirm if it’s on all our services, as we have quite a few, but it does happen on quite a few of them. It doesn’t seem to be a pattern of a specific service that it happens on.

Is this what you need for more details of our Kong configuration? This is the output of http://localhost:8001/ I XXXX out the pghost value. We run these on t3.medium EC2 instances.

Please let me know if you need more details.

Thanks,
Mike

{
  "plugins": {
    "enabled_in_cluster": [
      "cors",
      "response-transformer",
      "request-termination",
      "key-auth",
      "jwt",
      "basic-auth",
      "acl",
      "post-function",
      "aws-lambda",
      "prometheus",
      "ip-restriction"
    ],
    "available_on_server": {
      "correlation-id": true,
      "pre-function": true,
      "cors": true,
      "ldap-auth": true,
      "loggly": true,
      "hmac-auth": true,
      "zipkin": true,
      "request-size-limiting": true,
      "azure-functions": true,
      "request-transformer": true,
      "oauth2": true,
      "response-transformer": true,
      "ip-restriction": true,
      "statsd": true,
      "jwt": true,
      "proxy-cache": true,
      "basic-auth": true,
      "key-auth": true,
      "http-log": true,
      "datadog": true,
      "tcp-log": true,
      "rate-limiting": true,
      "post-function": true,
      "prometheus": true,
      "acl": true,
      "kubernetes-sidecar-injector": true,
      "syslog": true,
      "file-log": true,
      "udp-log": true,
      "response-ratelimiting": true,
      "aws-lambda": true,
      "session": true,
      "bot-detection": true,
      "request-termination": true
    }
  },
  "tagline": "Welcome to kong",
  "configuration": {
    "error_default_type": "text/plain",
    "admin_listen": [
      "0.0.0.0:8001",
      "127.0.0.1:8444 ssl"
    ],
    "proxy_access_log": "logs/access.log upstreaminfo",
    "trusted_ips": {},
    "prefix": "/usr/local/kong",
    "loaded_plugins": {
      "session": true,
      "pre-function": true,
      "cors": true,
      "ldap-auth": true,
      "loggly": true,
      "hmac-auth": true,
      "zipkin": true,
      "request-size-limiting": true,
      "azure-functions": true,
      "request-transformer": true,
      "oauth2": true,
      "response-transformer": true,
      "syslog": true,
      "statsd": true,
      "jwt": true,
      "proxy-cache": true,
      "basic-auth": true,
      "key-auth": true,
      "http-log": true,
      "datadog": true,
      "tcp-log": true,
      "correlation-id": true,
      "post-function": true,
      "bot-detection": true,
      "acl": true,
      "kubernetes-sidecar-injector": true,
      "ip-restriction": true,
      "file-log": true,
      "udp-log": true,
      "response-ratelimiting": true,
      "aws-lambda": true,
      "rate-limiting": true,
      "prometheus": true,
      "request-termination": true
    },
    "cassandra_username": "kong",
    "ssl_cert_key": "/usr/local/kong/ssl/kong-default.key",
    "admin_ssl_cert_key": "/usr/local/kong/ssl/admin-kong-default.key",
    "dns_resolver": {},
    "pg_user": "kong",
    "mem_cache_size": "128m",
    "ssl_ciphers": "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256",
    "nginx_admin_directives": {},
    "nginx_http_upstream_directives": [
      {
        "value": "60s",
        "name": "keepalive_timeout"
      },
      {
        "value": "100",
        "name": "keepalive_requests"
      },
      {
        "value": "60",
        "name": "keepalive"
      }
    ],
    "nginx_http_directives": [
      {
        "value": "TLSv1.1 TLSv1.2 TLSv1.3",
        "name": "ssl_protocols"
      },
      {
        "value": "10.252.224.0/19",
        "name": "set_real_ip_from"
      },
      {
        "value": "prometheus_metrics 50m",
        "name": "lua_shared_dict"
      }
    ],
    "pg_host": "XXXXXXXXXXXXXXXXXXXX",
    "nginx_acc_logs": "/usr/local/kong/logs/access.log",
    "pg_semaphore_timeout": 60000,
    "proxy_listen": [
      "0.0.0.0:80",
      "0.0.0.0:443 ssl"
    ],
    "nginx_kong_stream_conf": "/usr/local/kong/nginx-kong-stream.conf",
    "stream_listeners": {},
    "nginx_http_lua_shared_dict": "prometheus_metrics 50m",
    "dns_no_sync": false,
    "db_update_propagation": 0,
    "cassandra_data_centers": [
      "dc1:2",
      "dc2:3"
    ],
    "stream_listen": [
      "off"
    ],
    "nginx_err_logs": "/usr/local/kong/logs/error.log",
    "router_update_frequency": 1,
    "cassandra_port": 9042,
    "dns_order": [
      "LAST",
      "SRV",
      "A",
      "CNAME"
    ],
    "dns_error_ttl": 1,
    "headers": [
      "off"
    ],
    "cassandra_ssl": false,
    "cassandra_lb_policy": "RequestRoundRobin",
    "status_listen": [
      "off"
    ],
    "dns_stale_ttl": 4,
    "nginx_optimizations": true,
    "nginx_http_upstream_keepalive_timeout": "60s",
    "db_update_frequency": 5,
    "pg_timeout": 60000,
    "nginx_conf": "/usr/local/kong/nginx.conf",
    "nginx_http_upstream_keepalive_requests": "100",
    "database": "postgres",
    "ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr",
    "admin_ssl_cert": "/usr/local/kong/ssl/admin-kong-default.crt",
    "client_ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
    "pg_database": "kong",
    "nginx_worker_processes": "auto",
    "admin_error_log": "logs/error.log",
    "lua_package_cpath": "",
    "client_ssl": false,
    "admin_acc_logs": "/usr/local/kong/logs/admin_access.log",
    "anonymous_reports": true,
    "cassandra_refresh_frequency": 60,
    "lua_package_path": "./?.lua;./?/init.lua;",
    "nginx_pid": "/usr/local/kong/pids/nginx.pid",
    "upstream_keepalive": 60,
    "nginx_http_status_directives": {},
    "ssl_cipher_suite": "modern",
    "proxy_error_log": "logs/error.log",
    "db_cache_ttl": 0,
    "pg_max_concurrent_queries": 0,
    "nginx_stream_directives": {},
    "real_ip_header": "X-Forwarded-For",
    "proxy_listeners": [
      {
        "listener": "0.0.0.0:80",
        "proxy_protocol": false,
        "reuseport": false,
        "transparent": false,
        "ssl": false,
        "ip": "0.0.0.0",
        "deferred": false,
        "http2": false,
        "port": 80,
        "bind": false
      },
      {
        "listener": "0.0.0.0:443 ssl",
        "proxy_protocol": false,
        "reuseport": false,
        "transparent": false,
        "ssl": true,
        "ip": "0.0.0.0",
        "deferred": false,
        "http2": false,
        "port": 443,
        "bind": false
      }
    ],
    "proxy_ssl_enabled": true,
    "nginx_http_upstream_keepalive": "60",
    "db_cache_warmup_entities": [
      "services",
      "plugins"
    ],
    "enabled_headers": {
      "latency_tokens": false,
      "X-Kong-Response-Latency": false,
      "Server": false,
      "X-Kong-Admin-Latency": false,
      "X-Kong-Upstream-Status": false,
      "Via": false,
      "X-Kong-Proxy-Latency": false,
      "server_tokens": false,
      "X-Kong-Upstream-Latency": false
    },
    "nginx_http_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
    "lua_ssl_verify_depth": 1,
    "db_resurrect_ttl": 30,
    "cassandra_timeout": 60000,
    "cassandra_consistency": "ONE",
    "client_max_body_size": "0",
    "service_mesh": false,
    "status_listeners": {},
    "dns_not_found_ttl": 30,
    "pg_ssl": false,
    "pg_password": "******",
    "status_access_log": "off",
    "cassandra_repl_strategy": "SimpleStrategy",
    "status_error_log": "logs/status_error.log",
    "admin_access_log": "logs/admin_access.log",
    "origins": {},
    "log_level": "notice",
    "nginx_kong_conf": "/usr/local/kong/nginx-kong.conf",
    "cassandra_schema_consensus_timeout": 60000,
    "dns_hostsfile": "/etc/hosts",
    "admin_listeners": [
      {
        "listener": "0.0.0.0:8001",
        "proxy_protocol": false,
        "reuseport": false,
        "transparent": false,
        "ssl": false,
        "ip": "0.0.0.0",
        "deferred": false,
        "http2": false,
        "port": 8001,
        "bind": false
      },
      {
        "listener": "127.0.0.1:8444 ssl",
        "proxy_protocol": false,
        "reuseport": false,
        "transparent": false,
        "ssl": true,
        "ip": "127.0.0.1",
        "deferred": false,
        "http2": false,
        "port": 8444,
        "bind": false
      }
    ],
    "kong_env": "/usr/local/kong/.kong_env",
    "ssl_cert": "/usr/local/kong/ssl/kong-default.crt",
    "nginx_http_set_real_ip_from": "10.252.0.0/19",
    "admin_ssl_cert_key_default": "/usr/local/kong/ssl/admin-kong-default.key",
    "cassandra_ssl_verify": false,
    "plugins": [
      "bundled"
    ],
    "pg_ssl_verify": false,
    "real_ip_recursive": "off",
    "cassandra_repl_factor": 1,
    "nginx_proxy_directives": {},
    "nginx_daemon": "on",
    "router_consistency": "strict",
    "lua_socket_pool_size": 30,
    "nginx_sproxy_directives": {},
    "pg_port": 5432,
    "client_ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
    "client_body_buffer_size": "8k",
    "ssl_preread_enabled": true,
    "admin_ssl_cert_default": "/usr/local/kong/ssl/admin-kong-default.crt",
    "ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
    "cassandra_keyspace": "kong",
    "ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
    "admin_ssl_enabled": true,
    "cassandra_contact_points": [
      "127.0.0.1"
    ]
  },
  "version": "1.4.0",
  "node_id": "6426075c-1e06-4615-aaa9-a37b89f246a8",
  "lua_version": "LuaJIT 2.1.0-beta3",
  "prng_seeds": {
    "pid: 973": 193236251134,
    "pid: 996": 129392142058,
    "pid: 995": 132695812116
  },
  "timers": {
    "pending": 7,
    "running": 0
  },
  "hostname": "ip-10-252-234-132"
}

Hello @hbagdi

Is there any more information you need from me on this?

Thanks,
MIke

Dropped the ball on this one. My bad.
I’m not sure what can be going wrong here but it is likely a configuration issue.

Are your connections queue in the kernel high enough?
Since there are no logs in Kong, the request might be either processed in Kong successfully or the request might not be arriving at Kong itself.
Some network captures might be useful here.

Hello @hbagdi

I can work on getting some network captures to see if anything looks interesting there. How would I view the connection queue settings in the kernel that you are asking? I run debian linux.

Mike

We ended up adding retries to our API calls as it’s rare that this happens and that resolved API issues we had with the 503’s for now.

Mike

Maybe there is a problem with your upstream service node.

I might be experiencing the same issue. I noticed that kong returns a 503, but the upstream service has no evidence of being queried:

Kong

10.0.0.51 - - [18/Jul/2020:14:57:17 +0000] "GET /myroute HTTP/1.1" 503 95 "-" "Dart/2.8 (dart:io)"

When I look at the logs for the upstream service handling /myroute, there is no indication of a 503 response:


72.229.241.xxx, 10.0.0.38 - - [18/Jul/2020:00:18:52 +0000] "GET /myroute HTTP/1.1" 200 137453 2.4286
72.229.241.xxx, 10.0.0.51 - - [18/Jul/2020:02:48:18 +0000] "GET /myroute HTTP/1.1" 200 137453 0.9166
72.229.241.xxx 10.0.0.51 - - [18/Jul/2020:14:57:38 +0000] "GET /myroute HTTP/1.1" 200 137453 0.8844
72.229.241.xxx, 10.0.1.169 - - [18/Jul/2020:15:57:56 +0000] "GET /myroute HTTP/1.1" 200 137453 1.3237
72.229.241.xxx, 10.0.1.169 - - [18/Jul/2020:15:58:06 +0000] "GET /myroute HTTP/1.1" 200 137453 2.5357
72.229.241.xxx, 10.0.0.51 - - [18/Jul/2020:15:58:07 +0000] "GET /myroute HTTP/1.1" 200 137453 1.4012

Note, in the logs above, those timestamps don’t match up (there is one that is close?), and there are no 503s.

I’m deploying Kong from a helm chart. I have two kong services running. The 503s are very intermittent, so it’s very challenging to reproduce. We also have istio running. I honestly am not sure what steps I can take to provide helpful information, so any pointers in the right direction could be of value

Thanks!
Jonathan

I’ve found that my upstream application’s Puma worker was crashing, which is why I wasn’t seeing it in the upstream’s logs