Lua Cassandra Connection to 1.2.3.4 Failed

While starting a fresh deploy of Kong for the first time, we noticed this in our logs:

2018/06/04 23:30:28 [warn] 31#0: *10791 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, context: ngx.timer
2018/06/04 23:31:38 [warn] 32#0: *267 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, client: 127.0.0.1, server: kong_admin, request: "GET /status HTTP/1.1", host: "localhost:8001"

Kong starts just fine, SSL works, and we can configure proxies. When we call the admin api endpoint, Kong lists it’s cassandra contact points correctly - (none of our hosts actually resolve to 1.2.3.4)

This behavior persists even when we only have a single contact point (the rest of Kong still works)

Is this possibly just the result of some error handling subroutine, and the actual problem is elsewhere?

edit:
We also see these entries on startup - but they do not persist after about 10 seconds

018/06/04 23:17:07 [warn] 29#0: *69 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 29#0: *80 [lua] responses.lua:121: load_plugin_configuration(): failed to get from node cache: could not acquire callback lock: timeout, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 29#0: *80 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/tools/responses.lua:125: API disabled in the current context
stack traceback:
coroutine 0:
	[C]: in function '__newindex'
	/usr/local/share/lua/5.1/kong/tools/responses.lua:125: in function 'load_plugin_configuration'
	/usr/local/share/lua/5.1/kong/core/plugins_iterator.lua:189: in function '(for generator)'
	/usr/local/share/lua/5.1/kong/init.lua:302: in function 'ssl_certificate'
	ssl_certificate_by_lua:2: in function <ssl_certificate_by_lua:1>, context: ssl_certificate_by_lua*, client: 10.128.2.1, server: 0.0.0.0:8443
2018/06/04 23:17:07 [crit] 29#0: *79 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 27#0: *82 [lua] responses.lua:121: load_plugin_configuration(): failed to get from node cache: could not acquire callback lock: timeout, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 27#0: *82 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/tools/responses.lua:125: API disabled in the current context
stack traceback:
coroutine 0:
1 Like

Kong gets the list of Cassandra hosts from the Cassandra system keyspace by connecting itself to one of the provided contact points. Are you sure your list of peers is up to date and not corrupted in any way?

This is the CQL query executed to retrieve the list of peers:

SELECT peer,data_center,rpc_address,release_version FROM system.peers

The error appears to be a direct consequence of the failure of a given worker to connect to a valid Cassandra host in time, causing other workers waiting on the same cache key to timeout. There are retry policies in the Cassandra driver when peers are marked as down, but the core cache instance only allows for a hard-coded 5s TTL for each cache access, meaning that the callback (in this case, the Cassandra query) must return within 5 seconds.

3 Likes

As always your troubleshooting idea reveals the culprit 100% :slight_smile: , we have a troubled node. See the 1.2.3.4 :open_mouth:? Next up is figuring out why the rpc_address is setting itself like so, off to the cassandra.yaml!

*****@cqlsh> SELECT peer,data_center,rpc_address,release_version FROM system.peers;

 peer         | data_center | rpc_address  | release_version
--------------+-------------+--------------+-----------------
 10.xx.xxx.xx |         ELP | 10.xx.xxx.xx |          3.11.2
  10.xx.xxx.xx |         CTP |  10.xx.xxx.xx |          3.11.2
 10.xx.xxx.xx |         CTP |      1.2.3.4 |          3.11.2
  10.xx.xxx.xx |         CTP |  10.xx.xxx.xx |          3.11.2
 10.xx.xxx.xx |         ELP| 10.xx.xxx.xx |          3.11.2

Much appreciated.

EDIT, yep broadcast_rpc_address had been set to 1.2.3.4 as opposed to a proper hostname, changed and restarting the bad c* node now.

In a cassandra cluster of 3 nodes, if 1 is marked down, shouldn’t Kong still work as other two nodes are up ?

@mgkong Yes, that is the case indeed.