Jan Lindström <jan.lindstrom@mariadb.com> writes:
I have reached to situation where I need help fixing remaining test failures and mentoring is current status good enough for 10.2 Beta. I need help on test failures on parts where I do not follow why test is failing or server crashes I do not know how to fix.
First of all current merge result is on bb-10.2-jan with commit ca4f194460da469e29624d930e6d0027349d047c
Higher level of status is on https://jira.mariadb.org/browse/MDEV-10024
Note that mtr thinks currently that innodb_plugin compiled using static linking is xtradb (but not have_xtradb as xtradb is not compiled). Furthermore galera is not compiled (waiting for codership).
Here are current issues where I need help and possible persons who could help:
(1) Parallel replication lock wait/deadlock handling on InnoDB 5.7 with refactored lock0lock.cc does not work https://jira.mariadb.org/browse/MDEV-10550 (Monty, Kristian)
From a quick look at the patch, this seems to be because of code removed from innobase_kill_query(). Specifically:
@@ -4799,65 +5511,11 @@ innobase_kill_query( DBUG_ENTER("innobase_kill_query"); DBUG_ASSERT(hton == innodb_hton_ptr);
- if (!wsrep_thd_is_BF(trx->mysql_thd, FALSE) && - trx->abort_type == TRX_SERVER_ABORT) { - ut_ad(!lock_mutex_own()); - lock_mutex_enter();
When parallel replication discovers a case where a transaction T1 goes to wait on a row lock held by a later (in the replication stream) transaction T2, T2 will be killed to avoid a deadlock. In this situation, we are already holding lock mutex. So this code is there in 10.1 to prevent re-taking the mutex, code that is removed here in your patch.
+ if (trx != NULL) { + /* Cancel a pending lock request if there are any */ + lock_trx_handle_wait(trx);
+lock_trx_handle_wait( +/*=================*/ + trx_t* trx) /*!< in/out: trx lock state */ { dberr_t err; - ulint heap_no;
- ut_ad(!dict_index_is_clust(index)); - ut_ad(!dict_index_is_online_ddl(index)); - ut_ad(block->frame == page_align(rec)); - ut_ad(page_rec_is_user_rec(rec) || page_rec_is_supremum(rec)); - ut_ad(rec_offs_validate(rec, index, offsets)); - ut_ad(mode == LOCK_X || mode == LOCK_S); + lock_mutex_enter();
But in the new code here, the protection on lock_mutex_enter is no longer there, lock_mutex_enter() is done unconditionally. The code which sets the flag TRX_SERVER_ABORT seems to be still there, though:
+void +lock_report_waiters_to_mysql( +/*=======================*/ + struct thd_wait_reports* waitee_buf_ptr, /*!< in: set of trxs */ + THD* mysql_thd, /*!< in: THD */ + trx_id_t victim_trx_id) /*!< in: Trx selected + as deadlock victim, if + any */ +{ + struct thd_wait_reports* p; + struct thd_wait_reports* q; + ulint i; + + p = waitee_buf_ptr; + while (p) { + i = 0; + while (i < p->used) { + trx_t *w_trx = p->waitees[i]; + /* There is no need to report waits to a trx already + selected as a victim. */ + if (w_trx->id != victim_trx_id) { + /* If thd_report_wait_for() decides to kill the + transaction, then we will get a call back into + innobase_kill_query. We mark this by setting + current_lock_mutex_owner, so we can avoid trying + to recursively take lock_sys->mutex. */ + w_trx->abort_type = TRX_REPLICATION_ABORT; + thd_report_wait_for(mysql_thd, w_trx->mysql_thd); + w_trx->abort_type = TRX_SERVER_ABORT;
abort_type is set to TRX_REPLICATION_ABORT around thd_report_wait_for(), during which innobase_kill_query() may be invoked with lock mutex already held. So it would appear that a first attempt should be to restore the protection around lock_mutex_enter()... Note that, if w_trx->abort_type is not functional in your current patch, the parallel replication code was originally working without this, you can see in the snippet above the comment about current_lock_mutex_owner, which was (mistakenly) not removed when this code was changed to use Galera stuff. So you can check git history and go back to the original code (using current_lock_mutex_owner) if necessary, perhaps. Or maybe w_trx->abort_type is fine, I'm not familiar with that change.
(2) mysqld: /home/jan/mysql/10.2/sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. https://jira.mariadb.org/browse/MDEV-10549 (Serg)
(3) Some of the debug sync waits do not work InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10548 (Monty, Serg)
(4) Test multi_update_innodb fails with InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10550 (Petrunia)
(5) Test innodb.defrag_mdl-9155 hangs on InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10551 (Jan, Serg)
(6) IS tables are not found on 10.2 InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10200 (Jan)
Hope this helps, - Kristian.