Re: [Maria-developers] comments to the plugin.h

1 Jul 2014

Sergei Golubchik <vuvova@gmail.com> writes:
...
This is what I've written in the patch:
I've tried to answer all points in detail, also to help myself better
understand all the issues:
...
...
+  The storage engine can report such conflicting locks using this call. This
+  will allow parallel replication to detect such conflicts and resolve the
+  deadlock (by killing the second transaction to release the locks that the
+  first is waiting for, and then later re-try the second killed transaction).
What will happen if the engine doesn't? Deadlocks will timeout.
Which is annoying, but it only affects the performance, not the
correctness.
No, unfortunately that is not so.

Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2. Then T2
completes its prepare phase, it goes to the group commit queue and waits for
T1 to complete and commit for both of them. T1 gets lock wait timeout, so T1
will roll back and try again. But T1 will just go to wait on the same row lock
again that is held by T2. So T1 will retry over and over until
slave_transaction_retries is exceeded, at which point replication will abort
with an error.

There is no lock wait timeout when T2 waits in the group commit queue.
...
But can it cause the first transaction to be killed, not
the second? Then it's worse - it'll violate the strict commit ordering.
No, this can never happen. Commit order is enforced by the wait_for_commit
facility. T2 will not be allowed to commit before T1. If T1 is killed and
retried, T2 will continue waiting for T1 to commit the second time.

And if T1 fails, it will call wakeup_subsequent_commits() with an error, so
that T1 can abort its commit.
...
And, again, the engine needs to do it for every lock wait. Even while most of
them won't cause any deadlocks.
Right. This is only needed for parallel replication worker threads.
That is why I introduced thd_need_wait_for(). The storage engine can call this
once for a transaction or even for a THD, and save the result. And then it
won't need to call thd_report_wait_for() except for the parallel replication
worker threads. And for those, lock waits should be rare.
...
...
+  The storage engine should not report false positives. That is, it should not
+  report any lock waits that do not actually require one transaction to wait
+  for the other. Nor should it report waits for locks that will be released
+  before the commit of the other transactions.
What will happen if the engine reports false positives? Lock waist
that don't require one transaction to wait for another. What if it 
report waits for locks that will be released before commit of the other
transaction?
If it reports that T1 will wait for T2, then we will kill T2 and retry
it. This will cause unnecessary work if there is no actual deadlock, so mainly
a performance issue.

A deeper problem is caused by the async kill processing. If T1 is reported as
waiting for T2, we will kill T2 in the background thread. But if there is no
such wait, or if the wait is resolved due to T2 releasing locks before commit,
then T1 and T2 could both complete, and their worker threads move on to new
transactions. Then the background thread will be trying to kill a T2 which no
longer exists, which will access incorrect rpl_group_info objects. This is the
part of the patch I am most unhappy about :-/

But we started to discuss on IRC to possibly avoid the async kill, instead
making it possible to call THD::awake() directly inside thd_report_wait_for().
This would be good, as it would solve the deeper problem. In this case, a
false positive would then cause only performance issues, I think, though I
should think the problem through carefully to be sure.
...
...
+void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
You kill the second thread in a background.
What if you convey the ordering to the engine and let it know about a
deadlock? Like, if you return 1 from thd_report_wait_for()
it means a deadlock and the engine will kill the transaction
just normally as for any normal deadlock case.
The problem is that T2 could be executing anywhere in the server, including
server code or another storage engine, where InnoDB does not really have any
possibility to kill it.

This is different from a normal InnoDB deadlock, which only happens if both T1
and T2 are doing an actual row lock wait inside the storage engine itself. But
in this case, only T1 is waiting, T2 is continuing whatever processing it is
doing, eventually it will end up in the group commit queue in binlog or in
wait_for_prior_commit(), and wait for T1 to commit first.
...
...
+int thd_need_wait_for(const MYSQL_THD thd);
I'd suggest to remove it. If you think it's useful, you can call it from
your thd_report_wait_for() and spare storage engine the knowledge of yet
another API function.
See discussion above. The intention is to avoid lots of calls to
thd_report_wait_for() from the storage engine for non-replication threads. So
calling it from inside thd_report_wait_for() makes no sense.

I don't understand the benefits of reducing the functionality available in the
API when that functionality is completely optional? But on the other hand, I
do not feel strongly about this function. I do not even use it in the patch,
as InnoDB does a rather expensive graph traversal of all waiters anyway in its
deadlock detector, so it can be just omitted.

I included this function because without it, there seems no way for a storage
engine to avoid introducing O(N**2) overhead for N transactions contending for
a row lock. That seemed unacceptable to me.
...
...
+int thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
Hmm, really? Couldn't you simply disable gap locks in some other already
existing way? Like decrease the transaction isolation level or
tell the engine that it's RBR?
No, that is not safe. Suppose we have two transactions:

T1: INSERT INTO t1 VALUES (1);
T2: INSERT INTO t2 SELECT * FROM t1;

T1 and T2 are not run in parallel, because they cannot group commit together
on the master. However, on the slave, as soon as T1 _starts_ to commit, T2 is
allowed to start. But T2 cannot see the changes done by T1 until T1
_completes_ its commit. Thus it is critical that gap locks are disabled _only_
among transactions within the same group commit, not between different group
commits.

In fact there is a bug in the patch you got regarding this, as it did not
check the commit_id before relaxing gap locks. I currently have the following
extra patch in my tree to fix this:

=== modified file 'sql/sql_class.cc'

--- sql/sql_class.cc	2014-06-10 08:13:15 +0000
+++ sql/sql_class.cc	2014-06-27 11:47:53 +0000
@@ -4277,6 +4277,8 @@ thd_need_ordering_with(const MYSQL_THD t
     return 1;
   if (rgi->current_gtid.domain_id != other_rgi->current_gtid.domain_id)
     return 1;
+  if (rgi->commit_id != other_rgi->commit_id)
+    return 1;
   /*
     These two threads are doing parallel replication within the same
     replication domain. Their commit order is already fixed, so we do not need

On the other hand, if we were to omit this part of the patch completely and
just keep the gap locks always, then things would still work, because
thd_report_wait_for() would detect a deadlock and do a retry to resolve
it. But we would get extra rollbacks. Maybe such extra rollbacks would be
rare, though.
...
...
+int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2);
+
Okay, I see why this one is needed.
On the other hand...
It seems like it's the only function that you realy need.
Because
thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
can be replaced with
if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock;
And
thd_need_wait_for(const MYSQL_THD thd);
can be removed (replaced with noop). And
thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
can be replaced with
thd_deadlock_victim_preference(thd, other_thd) != 0
I don't understand the purpose of using the same function for these? They do
completely different things, using a single function will only cause
confusion, it seems to me:

1. thd_report_wait_for() is used to allow the _replication code_ to detect a
deadlock related to ordering.

2. thd_deadlock_victim_preference() is only used when the _storage engine_
detects a deadlock, replication related or not.

3. thd_need_ordering_with() is used by the storage engine to decide on lock
wait semantics.
...
Although in that case thd_deadlock_victim_preference() should better be
renamed, because it won't be a generic "victim preference",
but specifically victim preference, depending on the fixed transaction
commit ordering. A better name could be, for example,
thd_get_commit_order(thd, other_thd)
 -1    - thd first
  0    - unspecified
 +1    - other_thd first
How can the storage engine know the difference between T1 and T2 being in the
same group commit, or in different group commits?

How can the storage engine inquire about need for gap locks, without
accidentally reporting a lock wait to the upper layer?
...
I think it'll be much simpler with only one function.
Why? I think it's quite the opposite...
...
And also it becomes clear what happens if the engine doesn't use it.
This function is used to enquire about the commit order, if it is
not used - the commit order cannot be guaranteed.
No, that is not the case.
On the contrary - the storage engine is free not to use this, commit order
will be guaranteed in any case by wait_for_commit().

Hm.
So my idea was to make the new API simple and generic, the upper layer would
handle most of the detail about commit ordering or whatever. The storage
engine only needs to think about simple stuff like lock waits or deadlock
victims, both of which it likely already needs to deal with anyway.

But if I understand correctly, another idea could be to make the new API
specific just to this particular problem of enforced commit order in parallel
replication. So the storage engine would be able to know of such ordering, and
should then handle any necessary actions on this by itself?
...
And you won't need background killing for slave threads
The need for background killing comes because it does not work to call
THD::awake() from the context of where the InnoDB deadlock detector calls
thd_report_wait_for(). The problem is that innobase_kill_query() does a
recursive mutex lock on lock_sys->mutex.

We discussed on IRC to fix this so background kill is not necessary.

One way would be to document that hton->kill_query must be safe to call from
within thd_report_wait_for(). Maybe that is the simplest way. It should not be
hard to fix the InnoDB code to make this work. For example, InnoDB could just
set a flag in the transaction that it already has lock_sys->mutex held, thus
avoiding the recursive mutex lock.

You also had a different suggestion:
...
here's another idea. To have thd_get_commit_order() as I thought originally
*and* allow innodb to kill other threads, that is, add a function for
plugins to invoke THD::awake. This function should not call into the engine
that invoked it. That is, supposedly the engine knows how to kill threads
stuck in that engine, to kill other threads it should use this new helper.
...
In that case innodb can detect the deadlock and can kill the other thread
and it won't break on lock mutex
...
The underlying logic of this killer helper would be "an engine knows how to
kill a transaction waiting in this engine, so we wouldn't bother doing
that. but to kill a transaction waiting elsewhere the engine needs help"
It seems better to me if all the details of enforcing commit order and killing
worker threads in case of deadlocks could be kept inside the replication code?
Otherwise every storage engine would potentially need to re-implement this?
I think that is what I tried to do with my proposed storage engine API, to
cleanly separate the storage engine's responsibilities, which is about
row-level locks and internal deadlock victims, from the replication codes
responsibilities.

Wouldn't it be simpler to do it like this?

1. InnoDB deadlock detector reports lock waits to whatever function serves the
thd_report_wait_for() purpose.

2. We document that thd_report_wait_for() may need to call hton->kill_query().

3. We fix InnoDB to handle (2) correctly (assuming this is possible to do in a
simple way).
...
And anyway, this absolutely must be in a service, not in the global plugin.h.
Why? My understanding was that we do not support ABI stability for storage
engines.

(And I think we should absolutely not provide such support for storage
engines. Storage engines are all about performance and flexibility - ABI
stability will make this harder to provide without sufficient other benefits
to justify this. ABI seems mainly suited for eg. authentication plugins).
...
See? In this patch you've removed a function from the global plugin API
Right, the thd_rpl_is_parallel(). That should have never be allowed into the
code in the first place, it was a temporary hack only. :-(
...
header, strictly speaking, you must increase the API version - the major
number of - making all older plugins incompatible with the new server.  This
is a bit extreme, taking into account that no plugin actually uses
thd_rpl_is_parallel()
To avoid this, please only add new APIs as services.
See libservices/HOWTO
I can do that, if you want. It just seems a needless complication to me...

Are there any performance penalties for using a service over a normal function
call (for built-in storage engines)? I seem to recall there is not?

 - Kristian.

    

Re: [Maria-developers] comments to the plugin.h

Kristian Nielsen