Alex Yurchenko <alexey.yurchenko@codership.com> writes:
On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not, then the commit ordering imposed by redundancy service should not be a problem.
Ok, I checked, and indeed there is no requirement that prepare is done in same order as commit. In fact, there seems to be no requirement on ordering of commit among different engines and binlog at all in the server itself! (Since the XA/2pc in MySQL assumes every engine ensures durability by itself, there is not requirement for any ordering. In case of a crash, each engine will be able to recover every transaction successfully prepared, so it is just a matter of deciding which of them to commit and which to rollback.) So agree, there is no problem with the redundancy service imposing some order, with the purpose of enabling recovery even without durability guarantee by each individual engine. ---- Now, InnoDB _does_ have a requirement to commit in the same order as the binlog (due to InnoBackup; if not same commit order, the snapshot made by the backup may not correspond to any position in the binlog, which breaks restore). The way this is implemented in InnoDB is by taking a global mutex in InnoDB prepare(), which is only release in InnoDB commit(). This is a really bad way to do things :-(. It means that only one (InnoDB) transaction can be running the code between prepare() and commit(). Since this is where the binlog is written (and the point where the redundancy service makes the transaction durable in our discussions), this makes it impossible to do group commit! Again, I think a good solution to this is to have an (optional) storage engine callback fix_commit_order(). This will be called after successful prepare(), but before commit(). It should do the minimum amount of work necessary to make sure that transactions are committed in the order that fix_commit_order() is called. The upper layer (/redundancy service) will call fix_commit_order() for all transaction participants under a global mutex, ensuring correct order. lock(global_commit_order_mutex) fix_binlog_or_redundancy_service_commit_order() for (each storage engine) engine->fix_commit_order() unlock(global_commit_order_mutex) (If same commit order is not needed, the fix_commit_order() can be NULL, and if all fix_commit_order() are NULL there is no need to take the muxes). Then InnoDB does not need to hold a global mutex across prepare() / commit(). In fact all it needs to do in fix_commit_order() is to put the transaction into a sorted list of pending commits. Then each transaction in commit() needs only wait until it is first in this list, which is _much_ better than hanging in prepare() waiting for _all_ transactions to commit! (There are other implementation possible also, of course). - Kristian.