I think that in your case, because you are deploying multiple datacenters, your cluster size is 3, not 6 (3 nodes in each DC, each with their own RF). It doesn’t change much regarding C/A (consistency/availability), but it does mean that each of your nodes holds more than just 33% of your data.
It seems to me like you have two problems here:
- The Oauth2 plugin does a read-upon-write and due to the eventual consistency nature of Cassandra, and the LB policy in effect, there are occasional failures in the plugin.
- Increasing the consistency setting fixes 1., but because of your RF setting, means that you cannot survive the loss a node anymore.
I see two options:
- Using a consistency of ONE and the request-aware LB policy should allow you to keep an RF of 2, survive the loss of a node, and
ensures that subsequent reads from an insert are done on the same node, thus avoiding potential consistency issues-> not true, see below.
However, the request-aware LB policy does not guarantee that the same node will systematically be used in subsequent queries: if the node becomes unreachable between 2 queries, the policy falls back to another node, in a round-robin fashion. Even with an RF of 2, the other node might not have received the token from the C* gossip yet. - Increasing the consistency to LOCAL_QUORUM, but also increasing the RF to 3 in order to be able to survive the loss of a node. To survive the loss of more than one node, you’d need a larger cluster.
In the context of Kong, which isn’t a very write-oriented application, and considering your clusters are relatively small, I think setting your RF to 3 would be fine. Of course, you know better the size of your dataset, which is comprised of all your entities (consumers, oauth2_tokens, rate-limiting rows, etc…) and the available storage on your nodes.
That said, a consistency of ONE is slightly more performant, and isn’t as disruptive to your current deployment. The likeliness of a node becoming unreachable between the insert and the write operation (plus the gossiping not yet being propagated) is small, but should be assessed as a risk.