Kong 1.0.3 on Datastax 6.0 with multiple data-centers unable to bootstrap

#1

Hi all,

I’m experimenting a blocking issue using Kong 1.0.3 on a Cassandra Datastax DSE6.0 cluster.
While deploying Kong 1.0.3 on a single node DSE6.0 cluster works fine, trying to deploy it on a 3 nodes cluster always ends with the following error during the database bootstrap process.

$ kong migrations bootstrap --vv -c ./kong.conf --db-timeout 120 --lock-timeout 120
2019/03/21 13:50:28 [verbose] Kong: 1.0.3
2019/03/21 13:50:28 [debug] ngx_lua: 10013
2019/03/21 13:50:28 [debug] nginx: 1013006
2019/03/21 13:50:28 [debug] Lua: LuaJIT 2.1.0-beta3
2019/03/21 13:50:28 [verbose] reading config file at ../etc/kong.conf
2019/03/21 13:50:28 [debug] reading environment variables
2019/03/21 13:50:28 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001"
2019/03/21 13:50:28 [debug] KONG_PREFIX ENV found with "/opt/test/var"
2019/03/21 13:50:28 [debug] KONG_CASSANDRA_USERNAME ENV found with "admtest"
2019/03/21 13:50:28 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
2019/03/21 13:50:28 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "3"
2019/03/21 13:50:28 [debug] KONG_CASSANDRA_PORT ENV found with "1524"
2019/03/21 13:50:28 [debug] KONG_DATABASE ENV found with "cassandra"
2019/03/21 13:50:28 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "10.10.163.1, 10.10.162.255, 10.10.162.254"
2019/03/21 13:50:28 [debug] KONG_LOG_LEVEL ENV found with "debug"
2019/03/21 13:50:28 [debug] KONG_NGINX_DAEMON ENV found with "off"
2019/03/21 13:50:28 [debug] admin_access_log = "logs/admin_access.log"
2019/03/21 13:50:28 [debug] admin_error_log = "logs/error.log"
2019/03/21 13:50:28 [debug] admin_listen = {"0.0.0.0:8001"}
2019/03/21 13:50:28 [debug] admin_ssl_cert = "/opt/test/etc/certs/changeme-kong_test_com-super-bundle.pem"
2019/03/21 13:50:28 [debug] admin_ssl_cert_key = "/opt/test/etc/certs/changeme-kong_test_com-key.pem"
2019/03/21 13:50:28 [debug] anonymous_reports = false
2019/03/21 13:50:28 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/03/21 13:50:28 [debug] cassandra_contact_points = {"10.10.163.1","10.10.162.255","10.10.162.254"}
2019/03/21 13:50:28 [debug] cassandra_data_centers = {"dc1:1"}
2019/03/21 13:50:28 [debug] cassandra_keyspace = "kong"
2019/03/21 13:50:28 [debug] cassandra_lb_policy = "RequestRoundRobin"
2019/03/21 13:50:28 [debug] cassandra_local_datacenter = "dc1"
2019/03/21 13:50:28 [debug] cassandra_password = "******"
2019/03/21 13:50:28 [debug] cassandra_port = 1524
2019/03/21 13:50:28 [debug] cassandra_repl_factor = 1
2019/03/21 13:50:28 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/03/21 13:50:28 [debug] cassandra_schema_consensus_timeout = 120000
2019/03/21 13:50:28 [debug] cassandra_ssl = false
2019/03/21 13:50:28 [debug] cassandra_ssl_verify = false
2019/03/21 13:50:28 [debug] cassandra_timeout = 5000
2019/03/21 13:50:28 [debug] cassandra_username = "admtest"
2019/03/21 13:50:28 [debug] client_body_buffer_size = "8k"
2019/03/21 13:50:28 [debug] client_max_body_size = "0"
2019/03/21 13:50:28 [debug] client_ssl = false
2019/03/21 13:50:28 [debug] database = "cassandra"
2019/03/21 13:50:28 [debug] db_cache_ttl = 0
2019/03/21 13:50:28 [debug] db_resurrect_ttl = 30
2019/03/21 13:50:28 [debug] db_update_frequency = 5
2019/03/21 13:50:28 [debug] db_update_propagation = 3
2019/03/21 13:50:28 [debug] dns_error_ttl = 1
2019/03/21 13:50:28 [debug] dns_hostsfile = "/etc/hosts"
2019/03/21 13:50:28 [debug] dns_no_sync = false
2019/03/21 13:50:28 [debug] dns_not_found_ttl = 30
2019/03/21 13:50:28 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/03/21 13:50:28 [debug] dns_resolver = {}
2019/03/21 13:50:28 [debug] dns_stale_ttl = 4
2019/03/21 13:50:28 [debug] error_default_type = "application/json"
2019/03/21 13:50:28 [debug] headers = {"server_tokens","latency_tokens"}
2019/03/21 13:50:28 [debug] log_level = "debug"
2019/03/21 13:50:28 [debug] lua_package_cpath = ""
2019/03/21 13:50:28 [debug] lua_package_path = "/opt/test/lib/?.lua;;"
2019/03/21 13:50:28 [debug] lua_socket_pool_size = 30
2019/03/21 13:50:28 [debug] lua_ssl_trusted_certificate = "/opt/test/etc/combined-serverCA.crt"
2019/03/21 13:50:28 [debug] lua_ssl_verify_depth = 9
2019/03/21 13:50:28 [debug] mem_cache_size = "128m"
2019/03/21 13:50:28 [debug] nginx_admin_directives = {{value="/opt/test/lib/TEST_nginx_admin.conf",name="include"}}
2019/03/21 13:50:28 [debug] nginx_admin_include = "/opt/test/lib/TEST_nginx_admin.conf"
2019/03/21 13:50:28 [debug] nginx_daemon = "off"
2019/03/21 13:50:28 [debug] nginx_http_directives = {{value="/opt/test/lib/TEST_nginx_http.conf",name="include"}}
2019/03/21 13:50:28 [debug] nginx_http_include = "/opt/test/lib/TEST_nginx_http.conf"
2019/03/21 13:50:28 [debug] nginx_optimizations = true
2019/03/21 13:50:28 [debug] nginx_proxy_directives = {{value="/opt/test/lib/TEST_nginx_proxy.conf",name="include"}}
2019/03/21 13:50:28 [debug] nginx_proxy_include = "/opt/test/lib/TEST_nginx_proxy.conf"
2019/03/21 13:50:28 [debug] nginx_user = "bob test"
2019/03/21 13:50:28 [debug] nginx_worker_processes = "auto"
2019/03/21 13:50:28 [debug] origins = {}
2019/03/21 13:50:28 [debug] pg_database = "kong"
2019/03/21 13:50:28 [debug] pg_host = "127.0.0.1"
2019/03/21 13:50:28 [debug] pg_port = 5432
2019/03/21 13:50:28 [debug] pg_ssl = false
2019/03/21 13:50:28 [debug] pg_ssl_verify = false
2019/03/21 13:50:28 [debug] pg_timeout = 5000
2019/03/21 13:50:28 [debug] pg_user = "kong"
2019/03/21 13:50:28 [debug] plugins = {"bundled"}
2019/03/21 13:50:28 [debug] prefix = "/opt/test/var"
2019/03/21 13:50:28 [debug] proxy_access_log = "logs/access.log"
2019/03/21 13:50:28 [debug] proxy_error_log = "logs/error.log"
2019/03/21 13:50:28 [debug] proxy_listen = {"127.0.0.1:8000","0.0.0.0:8443 ssl"}
2019/03/21 13:50:28 [debug] real_ip_header = "X-Real-IP"
2019/03/21 13:50:28 [debug] real_ip_recursive = "off"
2019/03/21 13:50:28 [debug] ssl_cert = "/opt/test/etc/certs/changeme-kong_test_com-super-bundle.pem"
2019/03/21 13:50:28 [debug] ssl_cert_key = "/opt/test/etc/certs/changeme-kong_test_com-key.pem"
2019/03/21 13:50:28 [debug] ssl_cipher_suite = "modern"
2019/03/21 13:50:28 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/03/21 13:50:28 [debug] stream_listen = {"off"}
2019/03/21 13:50:28 [debug] trusted_ips = {}
2019/03/21 13:50:28 [debug] upstream_keepalive = 60
2019/03/21 13:50:28 [verbose] prefix in use: /opt/test/var
2019/03/21 13:50:29 [debug] loading subsystems migrations...
2019/03/21 13:50:29 [verbose] retrieving keyspace schema state...
2019/03/21 13:50:29 [verbose] schema state retrieved
2019/03/21 13:50:29 [info] bootstrapping database...
2019/03/21 13:50:29 [debug] creating 'kong' keyspace if not existing...
2019/03/21 13:50:29 [debug] successfully created 'kong' keyspace
2019/03/21 13:50:29 [debug] creating 'schema_meta' table if not existing...
2019/03/21 13:50:29 [debug] successfully created 'schema_meta' table
2019/03/21 13:50:29 [debug] creating 'locks' table if not existing...
2019/03/21 13:50:29 [debug] successfully created 'locks' table
2019/03/21 13:50:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2019/03/21 13:50:30 [verbose] Cassandra schema consensus: reached
Error:
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] failed to insert cluster lock: [Write timeout] Operation timed out - received only 0 responses.
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:54: in function <init_worker_by_lua:52>
        [C]: in function 'xpcall'
        init_worker_by_lua:61: in function <init_worker_by_lua:59>

The keyspace is created, the early tables are created, the consensus after these updates is reached… and then it looks the cluster lock before starting the overall schema creation cannot be obtained.
The issue is systematic.

I’ve been looking for similar issues, analyzing https://github.com/Kong/kong/issues/4226, https://github.com/Kong/kong/issues/4228, https://github.com/Kong/kong/issues/4229 and https://github.com/Kong/kong/issues/4335, but no way to make it work :frowning: .

I tried applying --db-timeout 120 --lock-timeout 120 but no positive result as well (and anyway the error is raised only after a couple of seconds or execution, and not at all after 120 seconds).

I encounter the error specifically on a Datastax DSE6.0; it is working well on DSE 5.

Is there any positive feedbacks from Kong team or Kong’s community regarding the usage of DSE 6.0?
Any hint on what could be the root cause of this issue?

Thanks a lot for your help.

0 Likes

#2

@pamiel Hi,

Do you see any logs in the Cassandra nodes?

0 Likes

#3

Hi @thibaultcha,
Sorry for the late reply, I was running complementary tests. Indeed I realized that my Cassandra cluster was distributed on 2 data centers:

  • 3 nodes on a first DC
  • and 1 node on a second DC

Here are some tests results I got:

  • With a single DC of 3 nodes (so decommissioning the 4th node, i.e. the 2nd DC), the migration runs OK. As in the test I mentioned in my first post, the key configuration data are: cassandra_consistency = “LOCAL_QUORUM”, cassandra_data_centers = “dc1:3”
  • Any other attempt to have 2 DC as stated above fails:
    – If only dc1 is specified in cassandra_data_centers, then the error mentioned in my first post is raised
    – If dc1 and dc2 are specified in cassandra_data_centers (“dc1:3,dc2:1”), the following error is raised
2019/03/28 15:09:38 [info] bootstrapping database...
2019/03/28 15:09:38 [debug] creating 'pat6' keyspace if not existing...
2019/03/28 15:09:38 [debug] successfully created 'pat6' keyspace
2019/03/28 15:09:38 [debug] creating 'schema_meta' table if not existing...
2019/03/28 15:09:39 [debug] successfully created 'schema_meta' table
2019/03/28 15:09:39 [debug] creating 'locks' table if not existing...
2019/03/28 15:09:39 [debug] successfully created 'locks' table
2019/03/28 15:09:39 [verbose] waiting for Cassandra schema consensus (60000ms timeout)...
2019/03/28 15:09:39 [verbose] Cassandra schema consensus: reached
2019/03/28 15:09:40 [info] migrating core on keyspace 'pat6'...
2019/03/28 15:09:40 [debug] running migration: 000_base
2019/03/28 15:09:44 [info] core migrated up to: 000_base (executed)
2019/03/28 15:09:44 [debug] running migration: 001_14_to_15
2019/03/28 15:09:45 [verbose] waiting for Cassandra schema consensus (60000ms timeout)...
2019/03/28 15:09:49 [verbose] Cassandra schema consensus: reached
Error:
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:321: [Unavailable exception] Cannot achieve consistency level LOCAL_ONE
stack traceback:
        [C]: in function 'assert'
        ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:321: in function <...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:264>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:574: in function 'run_migrations'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

This happens whatever the value of cassandra_repl_factor (I tried setting it to 1 and to 3).
Same failure if specifying all 4 IP addresses in the cassandra_contact_points (for other tests, only the 3 IP addresses of dc1 were set)
Same failure if specifying “ALL” to the cassandra_consistency.

PS: I renamed the post title as it looks more related to a multi-DC topology.

0 Likes

#4

Hi @thibaultcha,
Any feedback ? How can I help trying to identify the root case?
Thanks for your help!

0 Likes