Alex Yurchenko <alexey.yurchenko@codership.com> writes:
On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not,
On Mon, 29 Mar 2010 00:02:09 +0200, Kristian Nielsen <knielsen@knielsen-hq.org> wrote: then
the commit ordering imposed by redundancy service should not be a problem.
Ok, I checked, and indeed there is no requirement that prepare is done in same order as commit.
In fact, there seems to be no requirement on ordering of commit among different engines and binlog at all in the server itself!
(Since the XA/2pc in MySQL assumes every engine ensures durability by itself, there is not requirement for any ordering. In case of a crash, each engine will be able to recover every transaction successfully prepared, so it is just a matter of deciding which of them to commit and which to rollback.)
So agree, there is no problem with the redundancy service imposing some order, with the purpose of enabling recovery even without durability guarantee by each individual engine.
----
Now, InnoDB _does_ have a requirement to commit in the same order as the binlog (due to InnoBackup; if not same commit order, the snapshot made by the backup may not correspond to any position in the binlog, which breaks restore).
The way this is implemented in InnoDB is by taking a global mutex in InnoDB prepare(), which is only release in InnoDB commit().
This is a really bad way to do things :-(. It means that only one (InnoDB) transaction can be running the code between prepare() and commit(). Since this is where the binlog is written (and the point where the redundancy service makes the transaction durable in our discussions), this makes it impossible to do group commit!
The way I understood the above is that global mutex is taken in InnoDB prepare() solely to synchronize binlog and InnoDB commits. Is that so? If it is, than it is precisely the thing we want to achieve, but instead of locking global mutex in Innodb prepare() we'll be doing it in redundancy_service->pre_commit() as discussed earlier: innodb->prepare(); if (redundancy_service->pre_commit() == SUCCESS) // locks commit_order mtx { innodb->commit(); redundancy_service->post_commit(); // unlocks commit_order mtx } ... This way global lock in innnodb->prepare() can be naturally removed without any additional provisions. Am I missing something? On the other hand, if we can reduce the amount of commit ordering operations to the absolute minimum, as you suggest below, it would only benefit performance. I'm just not sure about names. Essentially this means splitting commit() into 2 parts: the one that absolutely must be run under commit_order mutex protection and another that can be run outside of the critical section. I guess in that setup all actual IO can easily go into the 2nd part.
Again, I think a good solution to this is to have an (optional) storage engine callback fix_commit_order(). This will be called after successful prepare(), but before commit(). It should do the minimum amount of work necessary to make sure that transactions are committed in the order that fix_commit_order() is called. The upper layer (/redundancy service) will call fix_commit_order() for all transaction participants under a global mutex, ensuring correct order.
lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex)
(If same commit order is not needed, the fix_commit_order() can be NULL, and if all fix_commit_order() are NULL there is no need to take the muxes).
Then InnoDB does not need to hold a global mutex across prepare() / commit(). In fact all it needs to do in fix_commit_order() is to put the
What I'd like to correct here is that ordering is needed at least in redundancy service. You need global trx ID. And I believe storage engines won't be able to do without it either - otherwise we'll need to deal with holes in commit sequence during recovery. Also, I'd suggest to move the global_commit_order_mutex into what goes by "fix_binlog_or_redundancy_service_commit_order()" (the name is misleading - redundancy service determines the order, it does not have to fix it) in the above pseudocode. Locking it outside may seriously reduce concurrency. transaction
into a sorted list of pending commits. Then each transaction in commit() needs only wait until it is first in this list, which is _much_ better than hanging in prepare() waiting for _all_ transactions to commit!
(There are other implementation possible also, of course).
- Kristian.
Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011