We are trying to understand some behavior of Kong DB-less mode.
We currently have :
- Amazon EKS 1.16
- Kong 2.0
- Kong Ingress Controller 0.8.1
- Flagger (Kubernetes Operator to automate Canary releases)
- Linkerd 2.7.1 (Service Mesh to enable Canary deployement in combination with Flagger)
- SM v1
- SM v2
Note : SM is our microservice
To deploy our application, we are using Helm.
We are deploying Kong and SM in 2 differents Helm releases.
Our SM chart contains everything (Deployment, Service …) + the Ingress declaration
Our use case :
- SM v1 deployed and we are sending traffic to our micro-service
- Update SM Helm release to have SM Helm v2.
At this point, we will have 2 versions of the micro-services, and the Ingress declaration hasn’t changed
- Observe results
Our results :
Before the update, everything is fine, all requests are responding 200.
During the update, we have a small time window (~4-5 seconds) where we have errors
After that time window, everything is fine.
When we are reading the logs, we observe this in the logs :
Proxy logs: https://pastebin.com/3jRtBuC8
Ingress Controller logs: https://pastebin.com/z70DyHfE
This time window seems to correspond with the Kong configuration update.
My questions are :
- Have we understood correctly the behavior of Kong/Kong Ingress Controller ?
- Do we have this behavior because of the DB-less ?
- How to avoid it ?
Thanks in advance !
KONG_NGINX_WORKER_PROCESSES environment variable set to 2 or higher? Older defaults had it set to 1, which we determined to cause issues around the time of config updates.
If that doesn’t clear the issue, are those 503s definitely coming from Kong, or do they appear in the application logs as well? If they are coming from Kong, they should indicate DNS resolution failures, which usually means that no Pods providing that service are ready yet. However, normal rollouts in Kubernetes shouldn’t bring down the existing Pods or update the DNS listing until a sufficient number of new replicas become ready.
For the environment variable
KONG_NGINX_WORKER_PROCESSES, we have set it to
Currently, we don’t have logs for a 503 error in your SM log.
It’s not a “normal” rollout, by that I mean, it’s not only 1 deployment.
To perform a Canary release we have 2 deployments.
We have the a deployment “primary”, which don’t change, and a deployment “canary” which will be updated.
Then we have a TrafficSplit, managed by Linkerd, that will handle the weight distribution between the 2 services, primary and canary. So technically, Kong will contact a service without knowing what “sub-services” it will contact.
With Linkerd in the mix, perhaps you need to enable the service-upstream annotation on those services? Not sure exactly how the apparent DNS failure is occurring still, but you usually should use that annotation regardless if you have sidecars from a mesh proxy in charge of routing decisions.
We have added
ingress.kubernetes.io/service-upstream: "true" to our services, it’s works