I have an issue in applying a kong migration (up or bootstrap) when backed by a Datastax Enterprise cassandra cluster. The issue happens on versions 0.13.1, 0.14.1, 0.15.0 and 1.0.2.
If i have a cassandra cluster of 1 node there is no issue, if i have a cassandra cluster of multiple nodes then the migrations fail and they fail at multiple spots.
So for example in Kong 1.0.2
root@api-gateway-1:~# kong migrations bootstrap -vv
2019/01/21 08:51:08 [verbose] Kong: 1.0.2
2019/01/21 08:51:08 [debug] ngx_lua: 10013
2019/01/21 08:51:08 [debug] nginx: 1013006
2019/01/21 08:51:08 [debug] Lua: LuaJIT 2.1.0-beta3
2019/01/21 08:51:08 [verbose] reading config file at /etc/kong/kong.conf
2019/01/21 08:51:08 [debug] reading environment variables
2019/01/21 08:51:08 [debug] admin_access_log = "logs/admin_access.log"
2019/01/21 08:51:08 [debug] admin_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] admin_listen = {"0.0.0.0:8001"}
2019/01/21 08:51:08 [debug] anonymous_reports = true
2019/01/21 08:51:08 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/01/21 08:51:08 [debug] cassandra_contact_points = {"oltp-cassandra-1.private.george1.do.eng.vixpulse.com","oltp-cassandra-2.private.george1.do.eng.vixpulse.com","oltp-cassandra-3.private.george1.do.eng.vixpulse.com"}
2019/01/21 08:51:08 [debug] cassandra_data_centers = {"OLTP:2"}
2019/01/21 08:51:08 [debug] cassandra_keyspace = "pulse_api_gateway"
2019/01/21 08:51:08 [debug] cassandra_lb_policy = "RequestRoundRobin"
2019/01/21 08:51:08 [debug] cassandra_password = "******"
2019/01/21 08:51:08 [debug] cassandra_port = 9042
2019/01/21 08:51:08 [debug] cassandra_repl_factor = 2
2019/01/21 08:51:08 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/01/21 08:51:08 [debug] cassandra_schema_consensus_timeout = 10000
2019/01/21 08:51:08 [debug] cassandra_ssl = false
2019/01/21 08:51:08 [debug] cassandra_ssl_verify = false
2019/01/21 08:51:08 [debug] cassandra_timeout = 5000
2019/01/21 08:51:08 [debug] cassandra_username = "pulse"
2019/01/21 08:51:08 [debug] client_body_buffer_size = "8k"
2019/01/21 08:51:08 [debug] client_max_body_size = "0"
2019/01/21 08:51:08 [debug] client_ssl = false
2019/01/21 08:51:08 [debug] database = "cassandra"
2019/01/21 08:51:08 [debug] db_cache_ttl = 0
2019/01/21 08:51:08 [debug] db_resurrect_ttl = 30
2019/01/21 08:51:08 [debug] db_update_frequency = 300
2019/01/21 08:51:08 [debug] db_update_propagation = 2
2019/01/21 08:51:08 [debug] dns_error_ttl = 1
2019/01/21 08:51:08 [debug] dns_hostsfile = "/etc/hosts"
2019/01/21 08:51:08 [debug] dns_no_sync = false
2019/01/21 08:51:08 [debug] dns_not_found_ttl = 30
2019/01/21 08:51:08 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/01/21 08:51:08 [debug] dns_resolver = {}
2019/01/21 08:51:08 [debug] dns_stale_ttl = 4
2019/01/21 08:51:08 [debug] error_default_type = "text/plain"
2019/01/21 08:51:08 [debug] headers = {"server_tokens","latency_tokens"}
2019/01/21 08:51:08 [debug] log_level = "notice"
2019/01/21 08:51:08 [debug] lua_package_cpath = ""
2019/01/21 08:51:08 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2019/01/21 08:51:08 [debug] lua_socket_pool_size = 30
2019/01/21 08:51:08 [debug] lua_ssl_verify_depth = 1
2019/01/21 08:51:08 [debug] mem_cache_size = "128m"
2019/01/21 08:51:08 [debug] nginx_admin_directives = {}
2019/01/21 08:51:08 [debug] nginx_daemon = "on"
2019/01/21 08:51:08 [debug] nginx_http_directives = {}
2019/01/21 08:51:08 [debug] nginx_optimizations = true
2019/01/21 08:51:08 [debug] nginx_proxy_directives = {}
2019/01/21 08:51:08 [debug] nginx_user = "nobody nobody"
2019/01/21 08:51:08 [debug] nginx_worker_processes = "auto"
2019/01/21 08:51:08 [debug] origins = {}
2019/01/21 08:51:08 [debug] pg_database = "kong"
2019/01/21 08:51:08 [debug] pg_host = "127.0.0.1"
2019/01/21 08:51:08 [debug] pg_port = 5432
2019/01/21 08:51:08 [debug] pg_ssl = false
2019/01/21 08:51:08 [debug] pg_ssl_verify = false
2019/01/21 08:51:08 [debug] pg_timeout = 5000
2019/01/21 08:51:08 [debug] pg_user = "kong"
2019/01/21 08:51:08 [debug] plugins = {"bundled"}
2019/01/21 08:51:08 [debug] prefix = "/usr/local/kong/"
2019/01/21 08:51:08 [debug] proxy_access_log = "logs/access.log"
2019/01/21 08:51:08 [debug] proxy_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl"}
2019/01/21 08:51:08 [debug] real_ip_header = "X-Forwarded-For"
2019/01/21 08:51:08 [debug] real_ip_recursive = "off"
2019/01/21 08:51:08 [debug] ssl_cipher_suite = "modern"
2019/01/21 08:51:08 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/01/21 08:51:08 [debug] stream_listen = {"off"}
2019/01/21 08:51:08 [debug] trusted_ips = {"172.31.0.0/24","172.31.1.0/24","172.31.2.0/24"}
2019/01/21 08:51:08 [debug] upstream_keepalive = 60
2019/01/21 08:51:08 [verbose] prefix in use: /usr/local/kong
2019/01/21 08:51:08 [warn] 12527#0: *2 [lua] cluster.lua:496: refresh(): [lua-cassandra] found contact point with '0.0.0.0' as rpc_address, using 'oltp-cassandra-1.private.george1.do.eng.vixpulse.com' to contact it instead. If this is incorrect you should avoid using '0.0.0.0' in rpc_address, context: ngx.timer
2019/01/21 08:51:08 [debug] loading subsystems migrations...
2019/01/21 08:51:08 [verbose] retrieving keyspace schema state...
2019/01/21 08:51:08 [verbose] schema state retrieved
2019/01/21 08:51:08 [info] bootstrapping database...
2019/01/21 08:51:08 [debug] creating 'pulse_api_gateway' keyspace if not existing...
2019/01/21 08:51:08 [debug] successfully created 'pulse_api_gateway' keyspace
2019/01/21 08:51:08 [debug] creating 'schema_meta' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'schema_meta' table
2019/01/21 08:51:08 [debug] creating 'locks' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'locks' table
2019/01/21 08:51:08 [verbose] waiting for Cassandra schema consensus (60000ms timeout)...
2019/01/21 08:51:09 [verbose] Cassandra schema consensus: reached
2019/01/21 08:51:09 [info] migrating core on keyspace 'pulse_api_gateway'...
2019/01/21 08:51:09 [debug] running migration: 000_base
2019/01/21 08:51:11 [info] core migrated up to: 000_base (executed)
2019/01/21 08:51:11 [debug] running migration: 001_14_to_15
Error:
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
stack traceback:
[C]: in function 'assert'
...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: in function <...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:264>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:570: in function 'run_migrations'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
[C]: in function 'assert'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
[C]: in function 'error'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
In this case it fails at running core teardown 001_14_to_15 but i have seen it get past this and fail at oauth and even further down.
Running a migration on 0.13.0 gives the same error type but with a different migrations system, ie
root@api-gateway-1:~# kong migrations up -vv
2019/01/21 08:17:21 [verbose] Kong: 0.13.1
2019/01/21 08:17:21 [debug] ngx_lua: 10011
2019/01/21 08:17:21 [debug] nginx: 1013006
2019/01/21 08:17:21 [debug] Lua: LuaJIT 2.1.0-beta3
2019/01/21 08:17:21 [verbose] reading config file at /etc/kong/kong.conf
2019/01/21 08:17:21 [debug] admin_access_log = "logs/admin_access.log"
2019/01/21 08:17:21 [debug] admin_error_log = "logs/error.log"
2019/01/21 08:17:21 [debug] admin_listen = {"0.0.0.0:8001"}
2019/01/21 08:17:21 [debug] anonymous_reports = true
2019/01/21 08:17:21 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/01/21 08:17:21 [debug] cassandra_contact_points = {"oltp-cassandra-1.private.george1.do.eng.vixpulse.com","oltp-cassandra-2.private.george1.do.eng.vixpulse.com","oltp-cassandra-3.private.george1.do.eng.vixpulse.com"}
2019/01/21 08:17:21 [debug] cassandra_data_centers = {"OLTP:2"}
2019/01/21 08:17:21 [debug] cassandra_keyspace = "pulse_api_gateway"
2019/01/21 08:17:21 [debug] cassandra_lb_policy = "RoundRobin"
2019/01/21 08:17:21 [debug] cassandra_password = "******"
2019/01/21 08:17:21 [debug] cassandra_port = 9042
2019/01/21 08:17:21 [debug] cassandra_repl_factor = 2
2019/01/21 08:17:21 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/01/21 08:17:21 [debug] cassandra_schema_consensus_timeout = 10000
2019/01/21 08:17:21 [debug] cassandra_ssl = false
2019/01/21 08:17:21 [debug] cassandra_ssl_verify = false
2019/01/21 08:17:21 [debug] cassandra_timeout = 5000
2019/01/21 08:17:21 [debug] cassandra_username = "pulse"
2019/01/21 08:17:21 [debug] client_body_buffer_size = "8k"
2019/01/21 08:17:21 [debug] client_max_body_size = "0"
2019/01/21 08:17:21 [debug] client_ssl = false
2019/01/21 08:17:21 [debug] custom_plugins = {}
2019/01/21 08:17:21 [debug] database = "cassandra"
2019/01/21 08:17:21 [debug] db_cache_ttl = 3600
2019/01/21 08:17:21 [debug] db_update_frequency = 300
2019/01/21 08:17:21 [debug] db_update_propagation = 2
2019/01/21 08:17:21 [debug] dns_error_ttl = 1
2019/01/21 08:17:21 [debug] dns_hostsfile = "/etc/hosts"
2019/01/21 08:17:21 [debug] dns_no_sync = false
2019/01/21 08:17:21 [debug] dns_not_found_ttl = 30
2019/01/21 08:17:21 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/01/21 08:17:21 [debug] dns_resolver = {}
2019/01/21 08:17:21 [debug] dns_stale_ttl = 4
2019/01/21 08:17:21 [debug] error_default_type = "text/plain"
2019/01/21 08:17:21 [debug] latency_tokens = true
2019/01/21 08:17:21 [debug] log_level = "notice"
2019/01/21 08:17:21 [debug] lua_package_cpath = ""
2019/01/21 08:17:21 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2019/01/21 08:17:21 [debug] lua_socket_pool_size = 30
2019/01/21 08:17:21 [debug] lua_ssl_verify_depth = 1
2019/01/21 08:17:21 [debug] mem_cache_size = "128m"
2019/01/21 08:17:21 [debug] nginx_daemon = "on"
2019/01/21 08:17:21 [debug] nginx_optimizations = true
2019/01/21 08:17:21 [debug] nginx_user = "nobody nobody"
2019/01/21 08:17:21 [debug] nginx_worker_processes = "auto"
2019/01/21 08:17:21 [debug] pg_database = "kong"
2019/01/21 08:17:21 [debug] pg_host = "127.0.0.1"
2019/01/21 08:17:21 [debug] pg_port = 5432
2019/01/21 08:17:21 [debug] pg_ssl = false
2019/01/21 08:17:21 [debug] pg_ssl_verify = false
2019/01/21 08:17:21 [debug] pg_user = "kong"
2019/01/21 08:17:21 [debug] prefix = "/usr/local/kong/"
2019/01/21 08:17:21 [debug] proxy_access_log = "logs/access.log"
2019/01/21 08:17:21 [debug] proxy_error_log = "logs/error.log"
2019/01/21 08:17:21 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl"}
2019/01/21 08:17:21 [debug] real_ip_header = "X-Forwarded-For"
2019/01/21 08:17:21 [debug] real_ip_recursive = "off"
2019/01/21 08:17:21 [debug] server_tokens = true
2019/01/21 08:17:21 [debug] ssl_cipher_suite = "modern"
2019/01/21 08:17:21 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/01/21 08:17:21 [debug] trusted_ips = {"172.31.0.0/24","172.31.1.0/24","172.31.2.0/24"}
2019/01/21 08:17:21 [debug] upstream_keepalive = 60
2019/01/21 08:17:21 [verbose] prefix in use: /usr/local/kong
2019/01/21 08:17:21 [verbose] running datastore migrations
2019/01/21 08:17:21 [verbose] Running Core Migrations
2019/01/21 08:17:21 [info] migrating core for keyspace pulse_api_gateway
2019/01/21 08:17:21 [info] could not switch to pulse_api_gateway keyspace ([Invalid] Keyspace 'pulse_api_gateway' does not exist), attempting to create it
2019/01/21 08:17:21 [info] successfully created pulse_api_gateway keyspace
2019/01/21 08:17:21 [info] core migrated up to: 2015-01-12-175310_skeleton
2019/01/21 08:17:22 [info] core migrated up to: 2015-01-12-175310_init_schema
2019/01/21 08:17:22 [info] core migrated up to: 2015-11-23-817313_nodes
2019/01/21 08:17:23 [info] core migrated up to: 2016-02-25-160900_remove_null_consumer_id
2019/01/21 08:17:23 [info] core migrated up to: 2016-02-29-121813_remove_ttls
2019/01/21 08:17:23 [info] core migrated up to: 2016-09-05-212515_retries_step_1
2019/01/21 08:17:24 [info] core migrated up to: 2016-09-05-212515_retries_step_2
2019/01/21 08:17:25 [info] core migrated up to: 2016-09-16-141423_upstreams
2019/01/21 08:17:25 [info] core migrated up to: 2016-12-14-172100_move_ssl_certs_to_core
2019/01/21 08:17:25 [info] core migrated up to: 2016-11-11-151900_new_apis_router_1
2019/01/21 08:17:26 [info] core migrated up to: 2016-11-11-151900_new_apis_router_2
2019/01/21 08:17:27 [info] core migrated up to: 2016-11-11-151900_new_apis_router_3
2019/01/21 08:17:27 [info] core migrated up to: 2017-01-24-132600_upstream_timeouts
Error:
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:34: [cassandra error] Error during migration 2017-01-24-132600_upstream_timeouts_2: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
stack traceback:
[C]: in function 'assert'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:34: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
It does not matter what options i use in the kong.conf it always fails. I have tried different options
cassandra_consistency: Used default value, LOCAL_QUORUM, QUORUM, ALL
cassandra_repl_strategy: NetworkTopologyStrategy & SimpleStrategy
cassandra_repl_factor: 1 or 2with 3 nodes
cassandra_data_centers: OLTP:1 or OLTP:2 with 3 nodes
cassandra_lb_policy: All options
This appears to be a cassandra schema consensus issue rather than a Kong Migration itself.
If i make some changes to the migration code to call the wait for schema consensus it all works fine. Is anyone else having issues running against a cassandra cluster of more that 1 node or is this specifically a DSE cassandra schema consensus issue. I had originally created a PR to get migrations and kong to work with DSE but I suspect this is a bigger issue if this is a generic cassandra cluster issue. My changes below were just a test to see if calling wait for consensus between each up and teardown worked and it did but I am not sure if thats the best way.
For reference the code changes i have made are
VERSION: 0.15.0
& 1.0.2
FILE: /usr/local/share/lua/5.1/kong/db/init.lua
FUNCTION: function DB:run_migrations(migrations, options)
LOCATION: inside the loop for _, mig in ipairs(t.migrations) do
at the end of the if statement if run_up then
AND at the end of the if statement if run_teardown and strategy_migration.teardown then
ADD: The same code that is at the end of the function that waits for consensus for up migrations only because this function is exposed and all it does is wait for consensus. If this function will do other things at the end of the up migration then it might be worth exposing the wait_for_consensus function directly. Ie
function DB:run_migrations(migrations, options)
....
for _, t in ipairs(migrations) do
....
for _, mig in ipairs(t.migrations) do
....
if run_up then
....
--START
ok, err = self.connector:post_run_up_migrations()
if not ok then
self.connector:close()
return nil, prefix_err(self, err)
end
--END
end
VERSION: 0.13.0
& 0.14.1
FILE: /usr/local/share/lua/5.1/kong/dao/factory.lua
FUNCTION: local function migrate(self, identifier, migrations_modules, cur_migrations, on_migrate, on_success)
LOCATION: At the end of the loop beginning with for _, migration in ipairs(to_run) do
ADD: Add the same lines in the run_migrations function to wait for schema consesus, ie
local function migrate(self, identifier, migrations_modules, cur_migrations, on_migrate, on_success)
....
for _, migration in ipairs(to_run) do
....
--START
if self.db.name == "cassandra" then
log("waiting for Cassandra schema consensus (%dms timeout)...",
self.db.cluster.max_schema_consensus_wait)
local ok, err = self.db:wait_for_schema_consensus()
if not ok then
return ret_error_string(self.db.name, nil,
"failed to wait for schema consensus: " .. err)
end
log("Cassandra schema consensus: reached")
end
--END
end