Upstreams and failover

from one of the older posts “Failover API handling”

Unfortunately we do not observe this behavior,
We run kong 0.14.1
Our setup:
K1,K2,K3 - three kong instances under the load balancer LB, that expose DNS ‘dns.name’

we have upstream U1 configured that has 3 targets with the same weight, T1, T2, T3

we have a plugin on Service, S1, that really does, set_upstream(U1) with the name of configured upstream

we run requests to configured S1, and during this, we terminate service on T1, so T1 returns 502

According to service retry policy, 5, and two remaining healthy instances, intuition, I should not see errors in the client,
but in reality I do see some errors in the client, bug in loadbalancer?

Hmm I believe that active health-checks guarantee a __ # of tx’s fail before marking target down. Passive health-checks are like hit this endpoint every __ number of seconds arbitrarily and if it reports down _ consecutive number of times then mark down. I think you want active health-checks if you want client to only see ___ number of failures before marking it down.

Another important point with the healthcheck lib I haven’t looked into are if its a global or localized to workers. If global you would not see many failures, if its per worker or it takes time for the workers to reach consensus on # of errors then you would see client error count discrepancies even on active healthcheck mode.

I have discovery service that does active monitoring of the servers and Kong plugin that updates upstream/targets based on that, I really don’t want to have yet another active monitor. Even if healthcheck is per worker, still I think logic should not be affected and I believe healthcheck is not related for this.

Suppose I have 3 servers, 100 slots, so according to documentation it will create 100 entries list with 33% distribution and it will use round-robin, so if I happened to hit dead server it should retry on next up to max # of retries, it is done ( should be ) in a context of the same worker. Probability of hitting dead server in this case 5 times in a row very negligible.

Unless I don’t understand how that works.

What errors are you getting? what’s in the Kong logs?

In my test, I have 3 targets, 2 out of 3 targets were returning 500, other returning 200

I do ~ 1M requests, and I see 500 returned to my client, kong list 500 in access logs

my understanding, I should not see any errors, as # of retries I have is 5, and 5 > 3

@likharev please be more detailed in your descriptions, include some log snippets and specifics of the config, the upstream config and or the targets. It is really hard to reason about these things without having the details.

unfortunately I cannot provide this information now, I have to repeat my tests, I’ll do it as soon as I can. What information are you looking for specifically, normal log, debug log, etc ?

please provide a minimal example to reproduce the problem. And logs showing the error message, debug level logs would be nice.

1 Like

Hi all, I’ve almost the same problem. I would like to share my kong.yml because in the end @likarev did not provide examples or configurations leading to this behaviour.

I have a service and a related upstream with two targets. To simulate failover, one target always gives 500 on a certain path, the other one 200. Targets have the same weight.
If I send 100 requests I would like to receive all 200. Instead I receive between 3 and 15 500 errors, depending on how I change some values about retries, success, http_failures and so on.

Here you are my configs:

_format_version: '1.1'

services:
- name: my-service
  host: my-upstream
  port: 8000
  protocol: http
  retries: 3
  routes:
  - name: my-route
    paths:
    - /

upstreams:
- name: my-upstream
  targets:
  - target: 192.168.3.23:10102
    weight: 50
  - target: 192.168.3.28:10102
    weight: 50
  healthchecks:
    active:
      concurrency: 100
      healthy:
        http_statuses:
        - 200
        - 302
        interval: 1
        successes: 1
      http_path: /testKong
      timeout: 1
      type: http
      unhealthy:
        http_failures: 3
        http_statuses:
        - 429
        - 404
        - 500
        - 501
        - 502
        - 503
        - 504
        - 505
        interval: 1
        tcp_failures: 3
        timeouts: 3
    passive:
      healthy:
        http_statuses:
        - 200
        - 201
        - 202
        - 203
        - 204
        - 205
        - 206
        - 207
        - 208
        - 226
        - 300
        - 301
        - 302
        - 303
        - 304
        - 305
        - 306
        - 307
        - 308
        successes: 1
      type: http
      unhealthy:
        http_failures: 1
        http_statuses:
        - 429
        - 500
        - 503
        tcp_failures: 1
        timeouts: 1
  slots: 100

Any suggestion? The two targets have the same weight, so (as far as I understand) 50% of requests go to the unhealthy server. But I do not receive 50 errors, much less. So, many requests are correctly redirected and some of them do not. Am i right?

Thanks

A small update: with active check enabled, the target is relabeled healthy each time it has been labeled unealthy by passive checks.
I disabled active checks so that the target must be manually remarked as healthy (not so good to do it manually, but just to test it and undestand for now). Now, just the first request towards the “bad node” fails with 500 error, then the target is marked unealthy and all the requests go to the other one.
However I would like that also that single request goes to the second target without failing.

@scortopassi - I guess you are able to achieve the requirement with the configurations you have posted above.

Since you Service.retries = 3 and healthchecks.passive.http_failures = 1, all requests that you send will result in 200 , as long as one of the service is alive at all the times.