Re: [Maria-developers] Status of InnoDB 5.7 merge to 10.2

12 Aug 2016


      Jan Lindström <jan.lindstrom@mariadb.com> writes:
...
I have reached to situation where I need help fixing remaining test
failures and mentoring is current status good enough for 10.2 Beta. I need
help on test failures on parts where I do not follow why test is failing or
server crashes I do not know how to fix.
First of all current merge result is on bb-10.2-jan with
commit ca4f194460da469e29624d930e6d0027349d047c
Higher level of status is on https://jira.mariadb.org/browse/MDEV-10024
Note that mtr thinks currently that innodb_plugin compiled using static
linking is xtradb (but not have_xtradb as xtradb is not compiled).
Furthermore galera is not compiled (waiting for codership).
Here are current issues where I need help and possible persons who could
help:
(1) Parallel replication lock wait/deadlock handling on InnoDB 5.7 with
refactored lock0lock.cc does not work
https://jira.mariadb.org/browse/MDEV-10550 (Monty, Kristian)
...
From a quick look at the patch, this seems to be because of code removed
from innobase_kill_query(). Specifically:
...
@@ -4799,65 +5511,11 @@ innobase_kill_query(
  DBUG_ENTER("innobase_kill_query");
  DBUG_ASSERT(hton == innodb_hton_ptr);
...
-		if (!wsrep_thd_is_BF(trx->mysql_thd, FALSE) &&
-		    trx->abort_type == TRX_SERVER_ABORT) {
-			ut_ad(!lock_mutex_own());
-			lock_mutex_enter();
When parallel replication discovers a case where a transaction T1 goes to
wait on a row lock held by a later (in the replication stream) transaction
T2, T2 will be killed to avoid a deadlock. In this situation, we are already
holding lock mutex. So this code is there in 10.1 to prevent re-taking the
mutex, code that is removed here in your patch.
...
+	if (trx != NULL) {
+		/* Cancel a pending lock request if there are any */
+		lock_trx_handle_wait(trx);
...
+lock_trx_handle_wait(
+/*=================*/
+	trx_t*	trx)	/*!< in/out: trx lock state */
 {
  dberr_t	err;
-	ulint	heap_no;
-	ut_ad(!dict_index_is_clust(index));
-	ut_ad(!dict_index_is_online_ddl(index));
-	ut_ad(block->frame == page_align(rec));
-	ut_ad(page_rec_is_user_rec(rec) || page_rec_is_supremum(rec));
-	ut_ad(rec_offs_validate(rec, index, offsets));
-	ut_ad(mode == LOCK_X || mode == LOCK_S);
+	lock_mutex_enter();
But in the new code here, the protection on lock_mutex_enter is no longer
there, lock_mutex_enter() is done unconditionally.

The code which sets the flag TRX_SERVER_ABORT seems to be still there,
though:
...
+void
+lock_report_waiters_to_mysql(
+/*=======================*/
+	struct thd_wait_reports*	waitee_buf_ptr,	/*!< in: set of trxs */
+	THD*				mysql_thd,	/*!< in: THD */
+	trx_id_t			victim_trx_id)	/*!< in: Trx selected
+							as deadlock victim, if
+							any */
+{
+	struct thd_wait_reports*	p;
+	struct thd_wait_reports*	q;
+	ulint				i;
+
+	p = waitee_buf_ptr;
+	while (p) {
+		i = 0;
+		while (i < p->used) {
+			trx_t *w_trx = p->waitees[i];
+			/*  There is no need to report waits to a trx already
+			selected as a victim. */
+			if (w_trx->id != victim_trx_id) {
+				/* If thd_report_wait_for() decides to kill the
+				transaction, then we will get a call back into
+				innobase_kill_query. We mark this by setting
+				current_lock_mutex_owner, so we can avoid trying
+				to recursively take lock_sys->mutex. */
+				w_trx->abort_type = TRX_REPLICATION_ABORT;
+				thd_report_wait_for(mysql_thd, w_trx->mysql_thd);
+				w_trx->abort_type = TRX_SERVER_ABORT;
abort_type is set to TRX_REPLICATION_ABORT around thd_report_wait_for(),
during which innobase_kill_query() may be invoked with lock mutex already
held.

So it would appear that a first attempt should be to restore the protection
around lock_mutex_enter()...

Note that, if w_trx->abort_type is not functional in your current patch, the
parallel replication code was originally working without this, you can see
in the snippet above the comment about current_lock_mutex_owner, which was
(mistakenly) not removed when this code was changed to use Galera stuff. So
you can check git history and go back to the original code (using
current_lock_mutex_owner) if necessary, perhaps. Or maybe w_trx->abort_type
is fine, I'm not familiar with that change.
...
(2) mysqld: /home/jan/mysql/10.2/sql/handler.cc:2692: int
handler::ha_index_first(uchar*): Assertion `table_share->tmp_table !=
NO_TMP_TABLE || m_lock_type != 2' failed.
https://jira.mariadb.org/browse/MDEV-10549 (Serg)
(3) Some of the debug sync waits do not work InnoDB 5.7
https://jira.mariadb.org/browse/MDEV-10548 (Monty, Serg)
(4)  Test multi_update_innodb fails with InnoDB 5.7
https://jira.mariadb.org/browse/MDEV-10550 (Petrunia)
(5) Test innodb.defrag_mdl-9155 hangs on InnoDB 5.7
https://jira.mariadb.org/browse/MDEV-10551 (Jan, Serg)
(6) IS tables are not found on 10.2 InnoDB 5.7
https://jira.mariadb.org/browse/MDEV-10200 (Jan)
Hope this helps,

 - Kristian.

    

Re: [Maria-developers] Status of InnoDB 5.7 merge to 10.2

Kristian Nielsen