Does the Load Balancing work correctly?

I encountered a case where I was getting 502 (Bad Gateway) responses when I was testing what happens when I take down one of the targets (out of two). seems to me like this should not happen as all traffic should be routed to the remaining target.
(I’m replacing the URLs to not have the protocol because of the limit of 5 URLs when posting a message)
So I set up a basic test case for it and I can replicate it very easily :

  1. using docker kong 0.13.1 running on a mac with a Cassandra DB (also docker)

  2. Create upstream:
    curl -X POST localhost:8801/upstreams -H"Content-Type:application/json" -d’{“name”:“test.upstream”, “healthchecks.active.http_path”:"/", “healthchecks.active.healthy.interval”:60, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:2 ,“healthchecks.active.unhealthy.interval”:30}’

  3. Create targets:
    curl -X POST localhost:8801/upstreams/test.upstream/targets -H"Content-Type:application/json" -d’{“target”:“docker.for.mac.localhost:5555”, “weight”:100}’
    curl -X POST localhost:8801/upstreams/test.upstream/targets -H"Content-Type:application/json" -d’{“target”:“docker.for.mac.localhost:5556”, “weight”:100}’

  4. Create service:
    curl -X POST localhost:8801/services/ -H"Content-Type:application/json" -d’{“name”:“test.the.upstream”, “url”:“http://test.upstream”}’

  5. Create route:
    curl -X POST localhost:8801/services/test.the.upstream/routes -H"Content-Type:application/json" -d’{“paths”:["/index.html"],“preserve_host”:true, “strip_path”:false}’

  6. Create index.html in a folder:
    mkdir simpleServer
    cd simpleServer
    echo “HELLO” > index.html

  7. Run two simpleHTTPServers in the folder above:
    python -m SimpleHTTPServer 5555
    python -m SimpleHTTPServer 5556

  8. Test that the targets work:
    curl localhost:5555/index.html HELLO curl localhost:5556/index.html
    HELLO

  9. Test that Kong works:
    $ curl localhost:8000/index.html
    HELLO

  10. Run that same curl in a loop against Kong:
    $ while :; do curl -s localhost:8000/index.html | tr ‘\n’ ‘,’; done
    HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,

  11. Run the same curl in a loop against Kong but while is running, stop one of the simpleHTTPServers. I’d expect to get the same output - a bunch of "HELLO"s but instead I get some “An invalid response was received from the upstream server” responses in there:
    $ while :; do curl -s localhost:8000/index.html | tr ‘\n’ ‘,’; done
    HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,

  12. Start the stopped simpleHTTPServer and notice that there are no more “errors” reported and only "HELLO"s are showing

Did I configure anything wrong? Because this feels like too simple of a config to break that easily…

This configuration is insufficient:

{“healthchecks.active.healthy.interval”:60, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:2 ,“healthchecks.active.unhealthy.interval”:30}’

You also need to specify the number of tcp_failures, timeouts and http_failures in the unhealthy case.

Also, given an interval of 60, it will take up to one minute (if you set the above missing values to 1) to detect that a target is down.

I suggest using lower interval values in active healthchecks for unhealthy traffic, and enabling passive healthchecks for healthy traffic.

Thanks hisham but I’m not sure exactly what are the values that make sense to detect targets being down and keeping that status.

I tried the exact same test only having the creation of the upstream to use this setup :
curl -X POST http://localhost:8801/upstreams -H"Content-Type:application/json" -d’{“name”:“test.upstream”, “healthchecks.active.http_path”:"/", “healthchecks.active.timeout”:1, “healthchecks.active.concurrency”:10, “healthchecks.active.healthy.interval”:10, “healthchecks.active.unhealthy.interval”:5, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:1 ,“healthchecks.active.unhealthy.interval”:30, “healthchecks.active.unhealthy.tcp_failures”:1, “healthchecks.active.unhealthy.timeouts”:1, “healthchecks.active.unhealthy.http_failures”:1, “healthchecks.passive.healthy.successes”:1, “healthchecks.passive.healthy.http_statuses”:[200], “healthchecks.passive.unhealthy.tcp_failures”:1}’

And I still got the same problem:
$ while :; do curl -s http://localhost:8000/index.html | tr ‘\n’ ‘,’; done
HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO

Would you not think some failures woulds still occur? Your check interval is now 10 seconds if I am reading this correctly? Even if you checked every second if your curl is running at greater than 1 tps you cannot achieve flawless load-balancing based on upstream health because the instant it goes down and the health check acknowledges failures a few proxies will occur to that down side. Its the nature of the beast I believe, load-balancing is great because it saves you from routing to down services fairly quickly but to expect 100% of the transactions to load-balance during a chaos test is not possible as the actual check is not running on a per tx basis(and would be extremely inefficient if it did).

The only slightly concerning thing to me is how you do have 2 errors in your print log(unless you switched which node was responsive in your chaos test again), because it should have never returned to attempt to proxy to that down node if you did not switch your backends up.

1 Like