Re: [Maria-developers] Status of InnoDB 5.7 merge to 10.2
Jan Lindström
I have reached to situation where I need help fixing remaining test failures and mentoring is current status good enough for 10.2 Beta. I need help on test failures on parts where I do not follow why test is failing or server crashes I do not know how to fix.
First of all current merge result is on bb-10.2-jan with commit ca4f194460da469e29624d930e6d0027349d047c
Higher level of status is on https://jira.mariadb.org/browse/MDEV-10024
Note that mtr thinks currently that innodb_plugin compiled using static linking is xtradb (but not have_xtradb as xtradb is not compiled). Furthermore galera is not compiled (waiting for codership).
Here are current issues where I need help and possible persons who could help:
(1) Parallel replication lock wait/deadlock handling on InnoDB 5.7 with refactored lock0lock.cc does not work https://jira.mariadb.org/browse/MDEV-10550 (Monty, Kristian)
From a quick look at the patch, this seems to be because of code removed from innobase_kill_query(). Specifically:
@@ -4799,65 +5511,11 @@ innobase_kill_query( DBUG_ENTER("innobase_kill_query"); DBUG_ASSERT(hton == innodb_hton_ptr);
- if (!wsrep_thd_is_BF(trx->mysql_thd, FALSE) && - trx->abort_type == TRX_SERVER_ABORT) { - ut_ad(!lock_mutex_own()); - lock_mutex_enter();
When parallel replication discovers a case where a transaction T1 goes to wait on a row lock held by a later (in the replication stream) transaction T2, T2 will be killed to avoid a deadlock. In this situation, we are already holding lock mutex. So this code is there in 10.1 to prevent re-taking the mutex, code that is removed here in your patch.
+ if (trx != NULL) { + /* Cancel a pending lock request if there are any */ + lock_trx_handle_wait(trx);
+lock_trx_handle_wait( +/*=================*/ + trx_t* trx) /*!< in/out: trx lock state */ { dberr_t err; - ulint heap_no;
- ut_ad(!dict_index_is_clust(index)); - ut_ad(!dict_index_is_online_ddl(index)); - ut_ad(block->frame == page_align(rec)); - ut_ad(page_rec_is_user_rec(rec) || page_rec_is_supremum(rec)); - ut_ad(rec_offs_validate(rec, index, offsets)); - ut_ad(mode == LOCK_X || mode == LOCK_S); + lock_mutex_enter();
But in the new code here, the protection on lock_mutex_enter is no longer there, lock_mutex_enter() is done unconditionally. The code which sets the flag TRX_SERVER_ABORT seems to be still there, though:
+void +lock_report_waiters_to_mysql( +/*=======================*/ + struct thd_wait_reports* waitee_buf_ptr, /*!< in: set of trxs */ + THD* mysql_thd, /*!< in: THD */ + trx_id_t victim_trx_id) /*!< in: Trx selected + as deadlock victim, if + any */ +{ + struct thd_wait_reports* p; + struct thd_wait_reports* q; + ulint i; + + p = waitee_buf_ptr; + while (p) { + i = 0; + while (i < p->used) { + trx_t *w_trx = p->waitees[i]; + /* There is no need to report waits to a trx already + selected as a victim. */ + if (w_trx->id != victim_trx_id) { + /* If thd_report_wait_for() decides to kill the + transaction, then we will get a call back into + innobase_kill_query. We mark this by setting + current_lock_mutex_owner, so we can avoid trying + to recursively take lock_sys->mutex. */ + w_trx->abort_type = TRX_REPLICATION_ABORT; + thd_report_wait_for(mysql_thd, w_trx->mysql_thd); + w_trx->abort_type = TRX_SERVER_ABORT;
abort_type is set to TRX_REPLICATION_ABORT around thd_report_wait_for(), during which innobase_kill_query() may be invoked with lock mutex already held. So it would appear that a first attempt should be to restore the protection around lock_mutex_enter()... Note that, if w_trx->abort_type is not functional in your current patch, the parallel replication code was originally working without this, you can see in the snippet above the comment about current_lock_mutex_owner, which was (mistakenly) not removed when this code was changed to use Galera stuff. So you can check git history and go back to the original code (using current_lock_mutex_owner) if necessary, perhaps. Or maybe w_trx->abort_type is fine, I'm not familiar with that change.
(2) mysqld: /home/jan/mysql/10.2/sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. https://jira.mariadb.org/browse/MDEV-10549 (Serg)
(3) Some of the debug sync waits do not work InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10548 (Monty, Serg)
(4) Test multi_update_innodb fails with InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10550 (Petrunia)
(5) Test innodb.defrag_mdl-9155 hangs on InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10551 (Jan, Serg)
(6) IS tables are not found on 10.2 InnoDB 5.7 https://jira.mariadb.org/browse/MDEV-10200 (Jan)
Hope this helps, - Kristian.
Hi,
On Fri, Aug 12, 2016 at 12:26 PM, Kristian Nielsen wrote: From a quick look at the patch, this seems to be because of code removed
from innobase_kill_query(). Specifically: @@ -4799,65 +5511,11 @@ innobase_kill_query(
DBUG_ENTER("innobase_kill_query");
DBUG_ASSERT(hton == innodb_hton_ptr); - if (!wsrep_thd_is_BF(trx->mysql_thd, FALSE) &&
- trx->abort_type == TRX_SERVER_ABORT) {
- ut_ad(!lock_mutex_own());
- lock_mutex_enter(); Thanks for pointing this to me, I did have this code there some time ago
but it also did not work
out-of-the-box as underlying locking was changed. But now that I had fresh
view on this I managed to solve mutexing problems (not exactly with above
but kindof). However, this is not yet enough to get all rpl tests pass.
There is still problem with two tests i.e. rpl_parallel and
rpl_parallel_optimistic see: https://jira.mariadb.org/browse/MDEV-10550 in
both cases slave refuses to sync with master.
R: Jan
participants (2)
-
Jan Lindström
-
Kristian Nielsen