[MariaDB developers] Re: [Commits] cb06612a9da: MDEV-31655: Parallel replication deadlock victim preference code errorneously removed

7 Aug 2023

      Marko Mäkelä <marko.makela@mariadb.com> writes:
...
On Sun, Aug 6, 2023 at 6:59 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
...
...
If so, seems there's a small mistake in the code. The current condition will
always be true for each new value of victim, so trx_pos will always be set
!= 0. The first condition should probably have been "if (trx == victim)" ?
...
As far as I remember, the original reason for the weight calculation
was that the deadlock resolution would roll back the transaction that
has done the least amount of work, expressed as the "weight" of the
transaction. The weight is computed based on whether
thd_has_edited_nontrans_tables() holds (the transaction has modified
any non-transactional tables, such as MyISAM tables), on the number of
undo log records written so far (trx_t::undo_no), and the number of
InnoDB locks held by the transaction.
Yes. Consider this P-shaped path, with trx=T1 and cycle=T3:

  T1 -> T2 -> T3 -> T2

  T1  undo_no=100 lock.trx_locks length= 4  TRX_WEIGHT(T1)=104
  T2  undo_no=101 lock.trx_locks length=10  TRX_WEIGHT(T2)=115
  T3  undo_no=102 lock.trx_locks length= 2  TRX_WEIGHT(T3)=104

In the first loop iteration:
 - next=T2
 - 115 < ~0ULL, so we assign victim=T2 and victim_weight=115
 - next==victim, so we assign trx_pos=1 (this is wrong, I think).
In the second iteration:
 - next=T3
 - 104 < 115, so we assign victim=T3 and victim_weight=104
 - next==victim, so we assign trx_pos=2 (wrong)
 - next==cycle, so we end the loop.

At the end of the loop, trx_pos != 0 and trx_weight==victim_weight (==104),
so we assign victim=T1.

The result is that we select T1 as the deadlock victim, even though this does
not break the deadlock. I think it's just a small mistake in this code:

        if (next == victim)
          trx_pos= l;

which should have been this:

        if (next == trx)
          trx_pos= l;

My question is, if instead of fixing we should just remove it along with
this code?

      if (trx_pos && trx_weight == victim_weight) {
        victim= trx;
        victim_pos= trx_pos;

I'm thinking it will be rare to have the same value of
(undo_no + LENGTH(lock.trx_locks)) for two transactions, which is the only
way this code can trigger. But I'm not familiar with the original reason for
this piece of the code, that's why I asked.
...
...
would not be broken. I thought about coming up with a test case to show
this, but it looks like it's not easy to get the same TRX_WEIGHT for trx and
victim. If so, maybe this part of the code can just be removed, or what do
you think?
Can you check it once again?
My problem is, I'm not sufficiently familiar with InnoDB internals to know
how to arrange the values of lock.trx_locks and undo_no in particular to
make the condition (trx_weight == victim_weight) be true in a situation
where we have a P-shaped path. It would be fun to try, I might try it, but
it will take some time.
...
In your current 10.6 patch, we may have victim==nullptr for a large
portion of the loop, and therefore, we might choose something else
than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight <
victim_weight) will be true and victim will be initialized =next (and
=cycle) on the first iteration. Thus victim!=nullptr on every iteration
except the first.
...
I think that thd_deadlock_victim_preference() needs to be invoked on
an array of THD* and return a pointer to the victim THD, or nullptr if
there is no preference.
I do not think that is necessary.

The fundamental issue is in-order parallel replication, where we have a
series of transactions T1, T2, T3, ... that we know a-priori _must_ commit
in this order. In case of a deadlock between them, we must avoid T1 being
chosen as a deadlock victim and immediately causing the same deadlock again,
repeatedly. This is the bug seen here.

But as long as we prevent T1 being chosen as the victim, replication will be
able to proceed. Chosing T3 is better, but T2 also works. Thus, we do not
need InnoDB to choose the _best_ deadlock victim. We just need to avoid it
choosing the _worst_ victim, and the thd_deadlock_victim_preference()
facility should ensure that.

That said, the current thd_deadlock_victim_preference() mechanism is not
ideal. The dependencies involved are split between the InnoDB and the
replication layer, neither can solve it fully on their own. Similar problems
must exist eg. for deadlock involving cross-engine transactions. Maybe a
full solution requires server-wide deadlock detection.

(For in-order parallel replication itself, the deadlock detection problem is
much simpler, because we _know_ that any T_(i+1) will eventually have to
wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j
that waits for T_k, with j < k. But this is not true if considering other
transactions that the in-order sequence within one replication domain_id).

The thd_deadlock_victim_preference() is really just a comparison of
thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to
expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if
present (with a few bit operations to make the comparison work out, similar
to the (1ULL << 63) in current code)? It exposes internal replication detail
to InnoDB, but sometimes that is necessary to get the simplest and most
efficient code...
...
...
I think for parallel replication, we have to do the lock wait reporting
always to avoid this. Since we are anyway going to do a lock wait, an
expensive operation, I think it is reasonable to accept the small overhead
of a lock wait report to the server layer.
...
Thank you for the explanation. Can you please also post it to
https://jira.mariadb.org/browse/MDEV-24948 for future reference?
Ok, will do.
...
It would also be useful if you could check the following changes in
MySQL 8.0 that are linked from MDEV-24948:
https://github.com/mysql/mysql-server/commit/30ead1f6966128cbcd32c7b6029ea21...
https://github.com/mysql/mysql-server/commit/3859219875b62154b921e8c6078c751...
Ok, I will check it.
...
...
I pushed an initial merge to 10.6 of this, but I'll refine it once I fully
understand the code around trx_pos:
...
In your current 10.6 patch, we may have victim==nullptr for a large
portion of the loop, and therefore, we might choose something else
than the actual preferred victim.
No, victim_weight is initialized to ~0ULL, so the condition (next_weight <
victim_weight) will be true and victim will be initialized =next (and
=cycle) on the first iteration. Thus victim!=nullptr on every iteration
except the first.
...
I think that thd_deadlock_victim_preference() needs to be invoked on
an array of THD* and return a pointer to the victim THD, or nullptr if
there is no preference.
I do not think that is necessary.

The fundamental issue is in-order parallel replication, where we have a
series of transactions T1, T2, T3, ... that we know a-priori _must_ commit
in this order. In case of a deadlock between them, we must avoid T1 being
chosen as a deadlock victim and immediately causing the same deadlock again,
repeatedly. This is the bug seen here.

But as long as we prevent T1 being chosen as the victim, replication will be
able to proceed. Chosing T3 is best, but T2 also works. Thus, we do not need
InnoDB to choose the _best_ deadlock victim. We just need to avoid it
choosing the _worst_ victim, and the thd_deadlock_victim_preference()
facility should ensure that.

That said, the current thd_deadlock_victim_preference() mechanism is not
ideal. The dependencies involved are split between the InnoDB and the
replication layer, neither can solve it fully on their own. Similar problems
must exist eg. for deadlock involving cross-engine transactions. Maybe a
full solution requires server-wide deadlock detection.

(For in-order parallel replication itself, the deadlock detection problem is
much simpler, because we _know_ that any T_(i+1) will eventually have to
wait for T_(i). So a deadlock will occur if-and-only-if there is some T_j
that waits for T_k, with j < k. But this is not true if considering other
transactions that the on-order sequence within one replication domain_id).

The thd_deadlock_victim_preference() is really just a comparison of
thd->rgi_slave->gtid_sub_id. Maybe it would be simpler and more efficient to
expose that to InnoDB, and simply use the gtid_sub_id as the TRX_WEIGHT, if
present (with a few bit operations to make the comparison work out, similar
to the (1ULL << 63) in current code)? It exposes internal replication detail
to InnoDB, but sometimes that is necessary to get the simplest and most
efficient code...
...
Do you have an idea how to stress test this at a larger scale than is
feasible within the regression test suite?
We can set --slave-transaction-retries=1 and run optimistic parallel
replication on a workload with many row lock conflicts.

If a transaction gets a temporary error (like a deadlock), it will rollback
and wait for all prior transactions to commit before re-trying. Thus, on
retry, all other transactions will be prefered as deadlock victims. If more
than one retry is required, it indicates a bug in the deadlock victim
selection.

The test rpl_parallel_deadlock_victim.test is kind of a stress-test, but
using error injection to check against the wrong victim being selected.

 - Kristian.