Lua Cassandra Connection to 1.2.3.4 Failed

Ross_Sbriscia · June 4, 2018, 11:35pm

While starting a fresh deploy of Kong for the first time, we noticed this in our logs:

2018/06/04 23:30:28 [warn] 31#0: *10791 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, context: ngx.timer
2018/06/04 23:31:38 [warn] 32#0: *267 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, client: 127.0.0.1, server: kong_admin, request: "GET /status HTTP/1.1", host: "localhost:8001"

Kong starts just fine, SSL works, and we can configure proxies. When we call the admin api endpoint, Kong lists it’s cassandra contact points correctly - (none of our hosts actually resolve to 1.2.3.4)

This behavior persists even when we only have a single contact point (the rest of Kong still works)

Is this possibly just the result of some error handling subroutine, and the actual problem is elsewhere?

edit:
We also see these entries on startup - but they do not persist after about 10 seconds

018/06/04 23:17:07 [warn] 29#0: *69 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 1.2.3.4 DOWN, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 29#0: *80 [lua] responses.lua:121: load_plugin_configuration(): failed to get from node cache: could not acquire callback lock: timeout, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 29#0: *80 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/tools/responses.lua:125: API disabled in the current context
stack traceback:
coroutine 0:
	[C]: in function '__newindex'
	/usr/local/share/lua/5.1/kong/tools/responses.lua:125: in function 'load_plugin_configuration'
	/usr/local/share/lua/5.1/kong/core/plugins_iterator.lua:189: in function '(for generator)'
	/usr/local/share/lua/5.1/kong/init.lua:302: in function 'ssl_certificate'
	ssl_certificate_by_lua:2: in function <ssl_certificate_by_lua:1>, context: ssl_certificate_by_lua*, client: 10.128.2.1, server: 0.0.0.0:8443
2018/06/04 23:17:07 [crit] 29#0: *79 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 27#0: *82 [lua] responses.lua:121: load_plugin_configuration(): failed to get from node cache: could not acquire callback lock: timeout, context: ssl_certificate_by_lua*, client: 10.0.01, server: 0.0.0.0:8443
2018/06/04 23:17:07 [error] 27#0: *82 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/tools/responses.lua:125: API disabled in the current context
stack traceback:
coroutine 0:

thibaultcha · June 5, 2018, 2:52am

Kong gets the list of Cassandra hosts from the Cassandra system keyspace by connecting itself to one of the provided contact points. Are you sure your list of peers is up to date and not corrupted in any way?

This is the CQL query executed to retrieve the list of peers:

SELECT peer,data_center,rpc_address,release_version FROM system.peers

The error appears to be a direct consequence of the failure of a given worker to connect to a valid Cassandra host in time, causing other workers waiting on the same cache key to timeout. There are retry policies in the Cassandra driver when peers are marked as down, but the core cache instance only allows for a hard-coded 5s TTL for each cache access, meaning that the callback (in this case, the Cassandra query) must return within 5 seconds.

jeremyjpj0916 · June 5, 2018, 5:55am

As always your troubleshooting idea reveals the culprit 100% , we have a troubled node. See the 1.2.3.4 ? Next up is figuring out why the rpc_address is setting itself like so, off to the cassandra.yaml!

*****@cqlsh> SELECT peer,data_center,rpc_address,release_version FROM system.peers;

 peer         | data_center | rpc_address  | release_version
--------------+-------------+--------------+-----------------
 10.xx.xxx.xx |         ELP | 10.xx.xxx.xx |          3.11.2
  10.xx.xxx.xx |         CTP |  10.xx.xxx.xx |          3.11.2
 10.xx.xxx.xx |         CTP |      1.2.3.4 |          3.11.2
  10.xx.xxx.xx |         CTP |  10.xx.xxx.xx |          3.11.2
 10.xx.xxx.xx |         ELP| 10.xx.xxx.xx |          3.11.2

Much appreciated.

EDIT, yep broadcast_rpc_address had been set to 1.2.3.4 as opposed to a proper hostname, changed and restarting the bad c* node now.

mgkong · May 13, 2019, 4:24pm

In a cassandra cluster of 3 nodes, if 1 is marked down, shouldn’t Kong still work as other two nodes are up ?

thibaultcha · May 13, 2019, 6:37pm

@mgkong Yes, that is the case indeed.

Topic		Replies	Views
[Cassandra error] all hosts tried for query failed. 127.0.0.1: host seems unhealthy, considering it down (connection refused) Questions	1	5052	May 14, 2020
Issue with Kong backed by Cassandra	3	737	August 18, 2021
Kong Ingress Controller + Cassandra not working as expected Questions	1	1254	April 17, 2019
Failed to parse host name "2a00:8a00:4000:20c::87:0fa6": invalid port) Questions	3	793	May 13, 2020
[Cassandra error] all hosts tried for query failed. 127.0.0.1: host seems unhealthy, considering it down	0	1095	May 13, 2019

Lua Cassandra Connection to 1.2.3.4 Failed

Related topics