Kong migrations against a cassandra cluster fails


#1

I have an issue in applying a kong migration (up or bootstrap) when backed by a Datastax Enterprise cassandra cluster. The issue happens on versions 0.13.1, 0.14.1, 0.15.0 and 1.0.2.

If i have a cassandra cluster of 1 node there is no issue, if i have a cassandra cluster of multiple nodes then the migrations fail and they fail at multiple spots.

So for example in Kong 1.0.2

root@api-gateway-1:~# kong migrations bootstrap -vv
2019/01/21 08:51:08 [verbose] Kong: 1.0.2
2019/01/21 08:51:08 [debug] ngx_lua: 10013
2019/01/21 08:51:08 [debug] nginx: 1013006
2019/01/21 08:51:08 [debug] Lua: LuaJIT 2.1.0-beta3
2019/01/21 08:51:08 [verbose] reading config file at /etc/kong/kong.conf
2019/01/21 08:51:08 [debug] reading environment variables
2019/01/21 08:51:08 [debug] admin_access_log = "logs/admin_access.log"
2019/01/21 08:51:08 [debug] admin_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] admin_listen = {"0.0.0.0:8001"}
2019/01/21 08:51:08 [debug] anonymous_reports = true
2019/01/21 08:51:08 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/01/21 08:51:08 [debug] cassandra_contact_points = {"oltp-cassandra-1.private.george1.do.eng.vixpulse.com","oltp-cassandra-2.private.george1.do.eng.vixpulse.com","oltp-cassandra-3.private.george1.do.eng.vixpulse.com"}
2019/01/21 08:51:08 [debug] cassandra_data_centers = {"OLTP:2"}
2019/01/21 08:51:08 [debug] cassandra_keyspace = "pulse_api_gateway"
2019/01/21 08:51:08 [debug] cassandra_lb_policy = "RequestRoundRobin"
2019/01/21 08:51:08 [debug] cassandra_password = "******"
2019/01/21 08:51:08 [debug] cassandra_port = 9042
2019/01/21 08:51:08 [debug] cassandra_repl_factor = 2
2019/01/21 08:51:08 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/01/21 08:51:08 [debug] cassandra_schema_consensus_timeout = 10000
2019/01/21 08:51:08 [debug] cassandra_ssl = false
2019/01/21 08:51:08 [debug] cassandra_ssl_verify = false
2019/01/21 08:51:08 [debug] cassandra_timeout = 5000
2019/01/21 08:51:08 [debug] cassandra_username = "pulse"
2019/01/21 08:51:08 [debug] client_body_buffer_size = "8k"
2019/01/21 08:51:08 [debug] client_max_body_size = "0"
2019/01/21 08:51:08 [debug] client_ssl = false
2019/01/21 08:51:08 [debug] database = "cassandra"
2019/01/21 08:51:08 [debug] db_cache_ttl = 0
2019/01/21 08:51:08 [debug] db_resurrect_ttl = 30
2019/01/21 08:51:08 [debug] db_update_frequency = 300
2019/01/21 08:51:08 [debug] db_update_propagation = 2
2019/01/21 08:51:08 [debug] dns_error_ttl = 1
2019/01/21 08:51:08 [debug] dns_hostsfile = "/etc/hosts"
2019/01/21 08:51:08 [debug] dns_no_sync = false
2019/01/21 08:51:08 [debug] dns_not_found_ttl = 30
2019/01/21 08:51:08 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/01/21 08:51:08 [debug] dns_resolver = {}
2019/01/21 08:51:08 [debug] dns_stale_ttl = 4
2019/01/21 08:51:08 [debug] error_default_type = "text/plain"
2019/01/21 08:51:08 [debug] headers = {"server_tokens","latency_tokens"}
2019/01/21 08:51:08 [debug] log_level = "notice"
2019/01/21 08:51:08 [debug] lua_package_cpath = ""
2019/01/21 08:51:08 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2019/01/21 08:51:08 [debug] lua_socket_pool_size = 30
2019/01/21 08:51:08 [debug] lua_ssl_verify_depth = 1
2019/01/21 08:51:08 [debug] mem_cache_size = "128m"
2019/01/21 08:51:08 [debug] nginx_admin_directives = {}
2019/01/21 08:51:08 [debug] nginx_daemon = "on"
2019/01/21 08:51:08 [debug] nginx_http_directives = {}
2019/01/21 08:51:08 [debug] nginx_optimizations = true
2019/01/21 08:51:08 [debug] nginx_proxy_directives = {}
2019/01/21 08:51:08 [debug] nginx_user = "nobody nobody"
2019/01/21 08:51:08 [debug] nginx_worker_processes = "auto"
2019/01/21 08:51:08 [debug] origins = {}
2019/01/21 08:51:08 [debug] pg_database = "kong"
2019/01/21 08:51:08 [debug] pg_host = "127.0.0.1"
2019/01/21 08:51:08 [debug] pg_port = 5432
2019/01/21 08:51:08 [debug] pg_ssl = false
2019/01/21 08:51:08 [debug] pg_ssl_verify = false
2019/01/21 08:51:08 [debug] pg_timeout = 5000
2019/01/21 08:51:08 [debug] pg_user = "kong"
2019/01/21 08:51:08 [debug] plugins = {"bundled"}
2019/01/21 08:51:08 [debug] prefix = "/usr/local/kong/"
2019/01/21 08:51:08 [debug] proxy_access_log = "logs/access.log"
2019/01/21 08:51:08 [debug] proxy_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl"}
2019/01/21 08:51:08 [debug] real_ip_header = "X-Forwarded-For"
2019/01/21 08:51:08 [debug] real_ip_recursive = "off"
2019/01/21 08:51:08 [debug] ssl_cipher_suite = "modern"
2019/01/21 08:51:08 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/01/21 08:51:08 [debug] stream_listen = {"off"}
2019/01/21 08:51:08 [debug] trusted_ips = {"172.31.0.0/24","172.31.1.0/24","172.31.2.0/24"}
2019/01/21 08:51:08 [debug] upstream_keepalive = 60
2019/01/21 08:51:08 [verbose] prefix in use: /usr/local/kong
2019/01/21 08:51:08 [warn] 12527#0: *2 [lua] cluster.lua:496: refresh(): [lua-cassandra] found contact point with '0.0.0.0' as rpc_address, using 'oltp-cassandra-1.private.george1.do.eng.vixpulse.com' to contact it instead. If this is incorrect you should avoid using '0.0.0.0' in rpc_address, context: ngx.timer
2019/01/21 08:51:08 [debug] loading subsystems migrations...
2019/01/21 08:51:08 [verbose] retrieving keyspace schema state...
2019/01/21 08:51:08 [verbose] schema state retrieved
2019/01/21 08:51:08 [info] bootstrapping database...
2019/01/21 08:51:08 [debug] creating 'pulse_api_gateway' keyspace if not existing...
2019/01/21 08:51:08 [debug] successfully created 'pulse_api_gateway' keyspace
2019/01/21 08:51:08 [debug] creating 'schema_meta' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'schema_meta' table
2019/01/21 08:51:08 [debug] creating 'locks' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'locks' table
2019/01/21 08:51:08 [verbose] waiting for Cassandra schema consensus (60000ms timeout)...
2019/01/21 08:51:09 [verbose] Cassandra schema consensus: reached
2019/01/21 08:51:09 [info] migrating core on keyspace 'pulse_api_gateway'...
2019/01/21 08:51:09 [debug] running migration: 000_base
2019/01/21 08:51:11 [info] core migrated up to: 000_base (executed)
2019/01/21 08:51:11 [debug] running migration: 001_14_to_15
Error:
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
stack traceback:
        [C]: in function 'assert'
        ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: in function <...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:264>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:570: in function 'run_migrations'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>

In this case it fails at running core teardown 001_14_to_15 but i have seen it get past this and fail at oauth and even further down.

Running a migration on 0.13.0 gives the same error type but with a different migrations system, ie

root@api-gateway-1:~# kong migrations up -vv
2019/01/21 08:17:21 [verbose] Kong: 0.13.1
2019/01/21 08:17:21 [debug] ngx_lua: 10011
2019/01/21 08:17:21 [debug] nginx: 1013006
2019/01/21 08:17:21 [debug] Lua: LuaJIT 2.1.0-beta3
2019/01/21 08:17:21 [verbose] reading config file at /etc/kong/kong.conf
2019/01/21 08:17:21 [debug] admin_access_log = "logs/admin_access.log"
2019/01/21 08:17:21 [debug] admin_error_log = "logs/error.log"
2019/01/21 08:17:21 [debug] admin_listen = {"0.0.0.0:8001"}
2019/01/21 08:17:21 [debug] anonymous_reports = true
2019/01/21 08:17:21 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/01/21 08:17:21 [debug] cassandra_contact_points = {"oltp-cassandra-1.private.george1.do.eng.vixpulse.com","oltp-cassandra-2.private.george1.do.eng.vixpulse.com","oltp-cassandra-3.private.george1.do.eng.vixpulse.com"}
2019/01/21 08:17:21 [debug] cassandra_data_centers = {"OLTP:2"}
2019/01/21 08:17:21 [debug] cassandra_keyspace = "pulse_api_gateway"
2019/01/21 08:17:21 [debug] cassandra_lb_policy = "RoundRobin"
2019/01/21 08:17:21 [debug] cassandra_password = "******"
2019/01/21 08:17:21 [debug] cassandra_port = 9042
2019/01/21 08:17:21 [debug] cassandra_repl_factor = 2
2019/01/21 08:17:21 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/01/21 08:17:21 [debug] cassandra_schema_consensus_timeout = 10000
2019/01/21 08:17:21 [debug] cassandra_ssl = false
2019/01/21 08:17:21 [debug] cassandra_ssl_verify = false
2019/01/21 08:17:21 [debug] cassandra_timeout = 5000
2019/01/21 08:17:21 [debug] cassandra_username = "pulse"
2019/01/21 08:17:21 [debug] client_body_buffer_size = "8k"
2019/01/21 08:17:21 [debug] client_max_body_size = "0"
2019/01/21 08:17:21 [debug] client_ssl = false
2019/01/21 08:17:21 [debug] custom_plugins = {}
2019/01/21 08:17:21 [debug] database = "cassandra"
2019/01/21 08:17:21 [debug] db_cache_ttl = 3600
2019/01/21 08:17:21 [debug] db_update_frequency = 300
2019/01/21 08:17:21 [debug] db_update_propagation = 2
2019/01/21 08:17:21 [debug] dns_error_ttl = 1
2019/01/21 08:17:21 [debug] dns_hostsfile = "/etc/hosts"
2019/01/21 08:17:21 [debug] dns_no_sync = false
2019/01/21 08:17:21 [debug] dns_not_found_ttl = 30
2019/01/21 08:17:21 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/01/21 08:17:21 [debug] dns_resolver = {}
2019/01/21 08:17:21 [debug] dns_stale_ttl = 4
2019/01/21 08:17:21 [debug] error_default_type = "text/plain"
2019/01/21 08:17:21 [debug] latency_tokens = true
2019/01/21 08:17:21 [debug] log_level = "notice"
2019/01/21 08:17:21 [debug] lua_package_cpath = ""
2019/01/21 08:17:21 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2019/01/21 08:17:21 [debug] lua_socket_pool_size = 30
2019/01/21 08:17:21 [debug] lua_ssl_verify_depth = 1
2019/01/21 08:17:21 [debug] mem_cache_size = "128m"
2019/01/21 08:17:21 [debug] nginx_daemon = "on"
2019/01/21 08:17:21 [debug] nginx_optimizations = true
2019/01/21 08:17:21 [debug] nginx_user = "nobody nobody"
2019/01/21 08:17:21 [debug] nginx_worker_processes = "auto"
2019/01/21 08:17:21 [debug] pg_database = "kong"
2019/01/21 08:17:21 [debug] pg_host = "127.0.0.1"
2019/01/21 08:17:21 [debug] pg_port = 5432
2019/01/21 08:17:21 [debug] pg_ssl = false
2019/01/21 08:17:21 [debug] pg_ssl_verify = false
2019/01/21 08:17:21 [debug] pg_user = "kong"
2019/01/21 08:17:21 [debug] prefix = "/usr/local/kong/"
2019/01/21 08:17:21 [debug] proxy_access_log = "logs/access.log"
2019/01/21 08:17:21 [debug] proxy_error_log = "logs/error.log"
2019/01/21 08:17:21 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl"}
2019/01/21 08:17:21 [debug] real_ip_header = "X-Forwarded-For"
2019/01/21 08:17:21 [debug] real_ip_recursive = "off"
2019/01/21 08:17:21 [debug] server_tokens = true
2019/01/21 08:17:21 [debug] ssl_cipher_suite = "modern"
2019/01/21 08:17:21 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/01/21 08:17:21 [debug] trusted_ips = {"172.31.0.0/24","172.31.1.0/24","172.31.2.0/24"}
2019/01/21 08:17:21 [debug] upstream_keepalive = 60
2019/01/21 08:17:21 [verbose] prefix in use: /usr/local/kong
2019/01/21 08:17:21 [verbose] running datastore migrations
2019/01/21 08:17:21 [verbose] Running Core Migrations
2019/01/21 08:17:21 [info] migrating core for keyspace pulse_api_gateway
2019/01/21 08:17:21 [info] could not switch to pulse_api_gateway keyspace ([Invalid] Keyspace 'pulse_api_gateway' does not exist), attempting to create it
2019/01/21 08:17:21 [info] successfully created pulse_api_gateway keyspace
2019/01/21 08:17:21 [info] core migrated up to: 2015-01-12-175310_skeleton
2019/01/21 08:17:22 [info] core migrated up to: 2015-01-12-175310_init_schema
2019/01/21 08:17:22 [info] core migrated up to: 2015-11-23-817313_nodes
2019/01/21 08:17:23 [info] core migrated up to: 2016-02-25-160900_remove_null_consumer_id
2019/01/21 08:17:23 [info] core migrated up to: 2016-02-29-121813_remove_ttls
2019/01/21 08:17:23 [info] core migrated up to: 2016-09-05-212515_retries_step_1
2019/01/21 08:17:24 [info] core migrated up to: 2016-09-05-212515_retries_step_2
2019/01/21 08:17:25 [info] core migrated up to: 2016-09-16-141423_upstreams
2019/01/21 08:17:25 [info] core migrated up to: 2016-12-14-172100_move_ssl_certs_to_core
2019/01/21 08:17:25 [info] core migrated up to: 2016-11-11-151900_new_apis_router_1
2019/01/21 08:17:26 [info] core migrated up to: 2016-11-11-151900_new_apis_router_2
2019/01/21 08:17:27 [info] core migrated up to: 2016-11-11-151900_new_apis_router_3
2019/01/21 08:17:27 [info] core migrated up to: 2017-01-24-132600_upstream_timeouts
Error:
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:34: [cassandra error] Error during migration 2017-01-24-132600_upstream_timeouts_2: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:34: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
        /usr/local/bin/kong:7: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>

It does not matter what options i use in the kong.conf it always fails. I have tried different options
cassandra_consistency: Used default value, LOCAL_QUORUM, QUORUM, ALL
cassandra_repl_strategy: NetworkTopologyStrategy & SimpleStrategy
cassandra_repl_factor: 1 or 2with 3 nodes
cassandra_data_centers: OLTP:1 or OLTP:2 with 3 nodes
cassandra_lb_policy: All options

This appears to be a cassandra schema consensus issue rather than a Kong Migration itself.

If i make some changes to the migration code to call the wait for schema consensus it all works fine. Is anyone else having issues running against a cassandra cluster of more that 1 node or is this specifically a DSE cassandra schema consensus issue. I had originally created a PR to get migrations and kong to work with DSE but I suspect this is a bigger issue if this is a generic cassandra cluster issue. My changes below were just a test to see if calling wait for consensus between each up and teardown worked and it did but I am not sure if thats the best way.

For reference the code changes i have made are
VERSION: 0.15.0 & 1.0.2
FILE: /usr/local/share/lua/5.1/kong/db/init.lua
FUNCTION: function DB:run_migrations(migrations, options)
LOCATION: inside the loop for _, mig in ipairs(t.migrations) do
at the end of the if statement if run_up then
AND at the end of the if statement if run_teardown and strategy_migration.teardown then
ADD: The same code that is at the end of the function that waits for consensus for up migrations only because this function is exposed and all it does is wait for consensus. If this function will do other things at the end of the up migration then it might be worth exposing the wait_for_consensus function directly. Ie

  function DB:run_migrations(migrations, options)
  ....
    for _, t in ipairs(migrations) do
    ....
      for _, mig in ipairs(t.migrations) do
      ....
        if run_up then
        ....
          --START
          ok, err = self.connector:post_run_up_migrations()
          if not ok then
            self.connector:close()
            return nil, prefix_err(self, err)
          end
          --END
        end

VERSION: 0.13.0 & 0.14.1
FILE: /usr/local/share/lua/5.1/kong/dao/factory.lua
FUNCTION: local function migrate(self, identifier, migrations_modules, cur_migrations, on_migrate, on_success)
LOCATION: At the end of the loop beginning with for _, migration in ipairs(to_run) do
ADD: Add the same lines in the run_migrations function to wait for schema consesus, ie

local function migrate(self, identifier, migrations_modules, cur_migrations, on_migrate, on_success)
  ....
  for _, migration in ipairs(to_run) do
    ....
      --START
      if self.db.name == "cassandra" then
        log("waiting for Cassandra schema consensus (%dms timeout)...",
            self.db.cluster.max_schema_consensus_wait)
  
        local ok, err = self.db:wait_for_schema_consensus()
        if not ok then
          return ret_error_string(self.db.name, nil,
                                  "failed to wait for schema consensus: " .. err)
        end
  
        log("Cassandra schema consensus: reached")
      end
      --END
  end

#2

I run Apache Cassandra and have not had any such issues: 2 DCs, 3 C* per DC. Have you considered trying the Apache version and seeing if your errors go away?

        - name: KONG_CASSANDRA_USERNAME
          value: dbusername
        - name: KONG_CASSANDRA_PASSWORD
          value: dbpassword
        - name: KONG_CASSANDRA_CONSISTENCY
          value: LOCAL_QUORUM
        - name: KONG_CASSANDRA_LB_POLICY
          value: RequestDCAwareRoundRobin
        - name: KONG_CASSANDRA_LOCAL_DATACENTER
          value: DC1
        - name: KONG_DB_UPDATE_FREQUENCY
          value: '5'
        - name: KONG_DB_UPDATE_PROPAGATION
          value: '5'
        - name: KONG_DB_CACHE_TTL
          value: '0'

Try setting long timeouts, also on your C* keyspace try full replication, that fixed many issues for me.


#3

Thanks Jeremy so it looks like it might be dse specific! I have tried long time outs on frequency & propagation but I suspect they are only used for cache updates not migration as it makes no difference.


#4

More info here. The place where 0.15.0/1.0.2 fail the most is the oauth2 001_14_to_15 migration during the teardown step of a bootstrap. In this case the UP migration alters the oauth2_credentials table and adds the field redirect_uris, the teardown creates a function that selects redirect_uris from oauth2_credentials and its here where it fails.
So putting a wait_for_consensus after an up and a teardown fixes this by making sure the schema changes have propagated and the cassandra client knows about the schema changes. Alternatively a sleep works just as good if not a little crude.

Another change in liew of above was to change the bootstrap function in the ...kong/cmd/utils/migrations.lua file so that it runs a db:run_migrations with run_up= true first followed db:run_migrations with run_teardown= true so it does an up and an a finish as separate steps rather than as one. This also works as it will call the wait_for_consensus when each stage is completed. This is probably a better fix than above.

In the older version 0.13.1 the failure occurs in the 2017-01-24-132600_upstream_timeouts_2 migration. This fails during the select * from APIS query as the previous stage 2017-01-24-132600_upstream_timeouts has altered the APIS table and added timeout fields. Again this is a consensus issue. It appears 0.14.1 fixed this by making the call directly to the wait_for_schema_consensus in the upstream_timeouts_2 migration.


#5

Issue created https://github.com/Kong/kong/issues/4229 and a PR has been done to fix the issue.