Admin API DB Health Enhancement

Right now there is a /status

that returns a portion in the json of if the db is “reachable” , I assume that means if Kong is able to establish some form of connection to the database then it maintains the reachable: true status.

I think it would be nice to have a “healthy”: true/false , to indicate if Kong can effectively r/w with the database based on current configured settings.

This could be helpful for detecting when a C* cluster lost too many nodes to establish a quorum or maybe Postgres has connection exhaustion or something of the sort.

The way the the ‘reachable’ establishes a connection with postgres would cover the latter- e.g., if postgres connections were exhausted, the connection attempt would fail.

Checking for the former (cassandra) should probably be in the realm of monitoring tools that directly observe cassandra; having an indirect data point would be a useful alarm, but there should be more direct/proactive monitoring of the Cassandra cluster itself (at least, this is how the Kong Cloud team sets up monitoring for this infrastructure).

1 Like

Fully agree there. Should have tooling directly checking on C* cluster health and availability for sure. More so was viewing the idea of a Kong db validation of sorts to catch things that is more Kong specific. For instance maybe your R/W quorum settings etc. specifically cause Kong’s R/W to fail against the C* cluster but other apps that might be using ONE or something of the likes see no issues.

Even then I could write something that establishes the same consistency in R/W’s that Kong does, would just be nifty if Kong had a bit of programmatic alerting flag set to raise alarms if it does notice issues. Certainly something I could write into an external app to tackle today but just another value add to the Kong application to do so itself imo when looking at future enhancements. Might be tricky to decide when to flag on unhealthy postgres vs C* vs any future db integrations but generally if there is a read or write issue I would likely trip the flag for some X amount of time/duration the issue is presenting itself.