Does the Load Balancing work correctly?

nitzan.harel · May 19, 2018, 1:49am

I encountered a case where I was getting 502 (Bad Gateway) responses when I was testing what happens when I take down one of the targets (out of two). seems to me like this should not happen as all traffic should be routed to the remaining target.
(I’m replacing the URLs to not have the protocol because of the limit of 5 URLs when posting a message)
So I set up a basic test case for it and I can replicate it very easily :

using docker kong 0.13.1 running on a mac with a Cassandra DB (also docker)
Create upstream:
curl -X POST localhost:8801/upstreams -H"Content-Type:application/json" -d’{“name”:“test.upstream”, “healthchecks.active.http_path”:"/", “healthchecks.active.healthy.interval”:60, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:2 ,“healthchecks.active.unhealthy.interval”:30}’
Create targets:
curl -X POST localhost:8801/upstreams/test.upstream/targets -H"Content-Type:application/json" -d’{“target”:“docker.for.mac.localhost:5555”, “weight”:100}’
curl -X POST localhost:8801/upstreams/test.upstream/targets -H"Content-Type:application/json" -d’{“target”:“docker.for.mac.localhost:5556”, “weight”:100}’
Create service:
curl -X POST localhost:8801/services/ -H"Content-Type:application/json" -d’{“name”:“test.the.upstream”, “url”:“http://test.upstream”}’
Create route:
curl -X POST localhost:8801/services/test.the.upstream/routes -H"Content-Type:application/json" -d’{“paths”:["/index.html"],“preserve_host”:true, “strip_path”:false}’
Create index.html in a folder:
mkdir simpleServer
cd simpleServer
echo “HELLO” > index.html
Run two simpleHTTPServers in the folder above:
python -m SimpleHTTPServer 5555
python -m SimpleHTTPServer 5556
Test that the targets work:
curl localhost:5555/index.html HELLO curl localhost:5556/index.html
HELLO
Test that Kong works:
$ curl localhost:8000/index.html
HELLO
Run that same curl in a loop against Kong:
$ while :; do curl -s localhost:8000/index.html | tr ‘\n’ ‘,’; done
HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,
Run the same curl in a loop against Kong but while is running, stop one of the simpleHTTPServers. I’d expect to get the same output - a bunch of "HELLO"s but instead I get some “An invalid response was received from the upstream server” responses in there:
$ while :; do curl -s localhost:8000/index.html | tr ‘\n’ ‘,’; done
HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,
Start the stopped simpleHTTPServer and notice that there are no more “errors” reported and only "HELLO"s are showing

Did I configure anything wrong? Because this feels like too simple of a config to break that easily…

hisham · May 22, 2018, 6:41pm

This configuration is insufficient:

{“healthchecks.active.healthy.interval”:60, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:2 ,“healthchecks.active.unhealthy.interval”:30}’

You also need to specify the number of tcp_failures, timeouts and http_failures in the unhealthy case.

Also, given an interval of 60, it will take up to one minute (if you set the above missing values to 1) to detect that a target is down.

I suggest using lower interval values in active healthchecks for unhealthy traffic, and enabling passive healthchecks for healthy traffic.

nitzan.harel · May 23, 2018, 6:54am

Thanks hisham but I’m not sure exactly what are the values that make sense to detect targets being down and keeping that status.

I tried the exact same test only having the creation of the upstream to use this setup :
curl -X POST http://localhost:8801/upstreams -H"Content-Type:application/json" -d’{“name”:“test.upstream”, “healthchecks.active.http_path”:"/", “healthchecks.active.timeout”:1, “healthchecks.active.concurrency”:10, “healthchecks.active.healthy.interval”:10, “healthchecks.active.unhealthy.interval”:5, “healthchecks.active.healthy.http_statuses”:[200], “healthchecks.active.healthy.successes”:1 ,“healthchecks.active.unhealthy.interval”:30, “healthchecks.active.unhealthy.tcp_failures”:1, “healthchecks.active.unhealthy.timeouts”:1, “healthchecks.active.unhealthy.http_failures”:1, “healthchecks.passive.healthy.successes”:1, “healthchecks.passive.healthy.http_statuses”:[200], “healthchecks.passive.unhealthy.tcp_failures”:1}’

And I still got the same problem:
$ while :; do curl -s http://localhost:8000/index.html | tr ‘\n’ ‘,’; done
HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,An invalid response was received from the upstream server,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO,HELLO

jeremyjpj0916 · May 23, 2018, 7:04am

Would you not think some failures woulds still occur? Your check interval is now 10 seconds if I am reading this correctly? Even if you checked every second if your curl is running at greater than 1 tps you cannot achieve flawless load-balancing based on upstream health because the instant it goes down and the health check acknowledges failures a few proxies will occur to that down side. Its the nature of the beast I believe, load-balancing is great because it saves you from routing to down services fairly quickly but to expect 100% of the transactions to load-balance during a chaos test is not possible as the actual check is not running on a per tx basis(and would be extremely inefficient if it did).

The only slightly concerning thing to me is how you do have 2 errors in your print log(unless you switched which node was responsive in your chaos test again), because it should have never returned to attempt to proxy to that down node if you did not switch your backends up.

Topic		Replies	Views
Active health checks cannot set target "Healthy" automatically again after "down" simulation Questions	0	700	July 15, 2021
Kong declarative config upstream healthcheck fails Questions	4	2072	August 13, 2019
Add upstream Invalid by api, balancer not found health check Questions	2	448	October 17, 2019
How to setup load-balancing with yaml file? Questions	4	2174	August 27, 2019
Troubleshooting a ring-balancer failure Questions kubernetes	5	8870	February 3, 2021

Does the Load Balancing work correctly?

Related topics