We are trying to understand some behavior of Kong DB-less mode.
We currently have :
Amazon EKS 1.16
Kong 2.0
Kong Ingress Controller 0.8.1
Flagger (Kubernetes Operator to automate Canary releases)
Linkerd 2.7.1 (Service Mesh to enable Canary deployement in combination with Flagger)
SM v1
SM v2
Note : SM is our microservice
To deploy our application, we are using Helm.
We are deploying Kong and SM in 2 differents Helm releases.
Our SM chart contains everything (Deployment, Service …) + the Ingress declaration
Our use case :
SM v1 deployed and we are sending traffic to our micro-service
Update SM Helm release to have SM Helm v2.
At this point, we will have 2 versions of the micro-services, and the Ingress declaration hasn’t changed
Observe results
Our results :
Before the update, everything is fine, all requests are responding 200.
During the update, we have a small time window (~4-5 seconds) where we have errors
After that time window, everything is fine.
Is your KONG_NGINX_WORKER_PROCESSES environment variable set to 2 or higher? Older defaults had it set to 1, which we determined to cause issues around the time of config updates.
If that doesn’t clear the issue, are those 503s definitely coming from Kong, or do they appear in the application logs as well? If they are coming from Kong, they should indicate DNS resolution failures, which usually means that no Pods providing that service are ready yet. However, normal rollouts in Kubernetes shouldn’t bring down the existing Pods or update the DNS listing until a sufficient number of new replicas become ready.
For the environment variable KONG_NGINX_WORKER_PROCESSES, we have set it to auto.
Currently, we don’t have logs for a 503 error in your SM log.
It’s not a “normal” rollout, by that I mean, it’s not only 1 deployment.
To perform a Canary release we have 2 deployments.
We have the a deployment “primary”, which don’t change, and a deployment “canary” which will be updated.
Then we have a TrafficSplit, managed by Linkerd, that will handle the weight distribution between the 2 services, primary and canary. So technically, Kong will contact a service without knowing what “sub-services” it will contact.
With Linkerd in the mix, perhaps you need to enable the service-upstream annotation on those services? Not sure exactly how the apparent DNS failure is occurring still, but you usually should use that annotation regardless if you have sidecars from a mesh proxy in charge of routing decisions.