Failover API handling

Of course Kong does failovers. Maybe not an explicit failover, but it certainly does cover the functionality.

The main property involved is the retries property of the Service entity. This does exactly what you think it does. The question now is how does Kong select the next upstream service when a failure occurs?

When using simple dns based loadbalancing (without an upstream entity), Kong will do a round-robin on the dns record. So if it is an A-record or SRV record with multiple entries, that’s where it selects the next upstream service.
The catch is that with an SRV record and non-equal weights, multiple tries might end up with the same backend service. Let’s use an example to explain this. Say we have (an extreme) situation with an SRV record containing:

  • name = a.service.local, weight = 1
  • name = b.service.local, weight = 1000

And let’s assume the Service.retries = 5.

In this case if b.service.local returns a 500, we have 5 more tries to go, but due to the weights, 1 vs 1000, there is a big likelihood that each retry will also hit the same b.service.local. Because on DNS records there is no notion of health, Kong will just retry the next one in line, which might actually be the same one. This problem does not occur with equal weights (or with A records since that doesn’t carry any weight info), since every next one is actually a different entry in that case.

All in all this really is a corner case, and if you have a proper set of backend services and a well chosen retries value, this should be of no concern to you.

Slightly more complex is the loadbalancer case (with an upstream entity). In this case it will do the exact same thing, it will select the next entry in the loadbalancer. But here the balancer does have a notion of health. And once an upstream backend service is considered unhealthy (the circuit-breaker tripped after a number of failures), the balancer will not retry that same backend service.

If you set up “passive” healthchecks, each of the failures will count against the health of the backend service. So with a Service.retries = 5 setting, and a passive healthcheck that fails after eg. 3 failures, you’re completely covered.

Does that help?

2 Likes