Sergei Golubchik <serg@askmonty.org> writes:
Now, WL#132 - Transaction coordinator plugin
Wouldn't it be simpler to create only group_log_xid() interface, no log_and_order() or log_xid() ? The tc plugin gets the list in group_log_xid() - it can reorder the list any way it wants, call prepare_ordered() and commit_ordered() as needed and so on. In this interpretation, group_log_xid() can meet all the use cases. And there's no need to create a multitude of methods that one needs to get familiar with before implementing a TC plugin.
I do not see how this would work. The group_log_xid() interface as specified here does not allow the TC to reorder transactions, on the contrary the commit order has already been decided by the ordering of transactions in the passed list. But there is no need for multiple interfaces, just one: the log_and_order() interface. That is my main idea with MWL#132: to generalise the TC interface so that something like Galera is able to change commit order as it needs. So there is only one plugin API, log_and_order(). The other interfaces (log_xid() and group_log_xid()) are not plugin APIs, they are just helper classes that one can use to implement some simpler types of TC plugins. I thought they could be useful to provide somehow, but maybe it just confuses the issue. Instead, they could just be examples, or maybe only something we use internally in mysqld to implement TC_LOG_MMAP and TC_LOG_BINLOG. (And as you suggest, maybe we do not need log_xid() at all, we could just rewrite TC_LOG_MMAP to use group_log_xid()). Does that make my intensions clearer? ---- So, to elaborate on the log_and_order() interface: I think it is a nice generalisation. It is easy to implement group_log_xid() in the log_and_order() framework, it is essentially the algorithm from MWL#116. But log_and_order() is more general, since it allows to change commit order, this is not possible in group_log_xid(), since it is called only when commit order has already been decided. This is how I understand Galera works: Galera first runs transactions in complete isolation on each node, buffering row events just like the binlog. Only during commit is the transaction replicated to other nodes. A global transaction ID is assigned to the transaction; this ID is a monotonic sequence which thus specifies the commit ordering relative to all other transactions in the cluster. The events for the transaction are then shipped to all other nodes. A seperate thread (or threads) applies transaction events received from other nodes in global transaction ID order (similar to the slave SQL thread). The commit of a local transaction is delayed until all other transactions with earlier global transaction ID have been applied. Galera uses optimistic concurrency control, assuming transactions can run independently, and aborting one if there turns out to be a conflict after all. They use the certification based replication method to handle such conflicts. As I understand it, the idea is to have each node check for conflicts between transactions individually, but using a deterministic algorithm that ensures that all nodes will make the same decision about which transaction to rollback and which to keep. (Galera keeps track of primary key values of all modified rows for this purpose). (I hope I got this right, we should ask the Galera people for more details). So this is where log_and_order() comes in. Galera would install a TC plugin, and would receive a call into log_and_order() when a transaction commits. It can then replicate the transaction across the cluster and assign global transaction ID. It can then synchronise among threads to invoke prepare_ordered() in correct global transaction ID order, and afterwards commit_ordered() in the same order. Then when it returns from log_and_order(), the commit order has been correctly decided (or it can roll back a conflicting transaction by returning error from log_and_order(). So it seems to be a good fit with Galera (though it still has to be shown to work in practice). Something like group_log_xid(list_of_transactions) does not really work here I think. Galera may need to reorder a local transaction with another transaction that has not even started yet when group_log_xid() is called, so even allowing to reorder the passed-in list seems insufficient. Also the old log_xid() interface seems insufficient, as it provides no way for Galera to control the order that transactions commit in after returning from log_xid(). Hm, maybe it could wait for unlog() from transaction 1 before returning from log_xid() from transaction 2, but that seems not optimal (and would prevent any kind of group commit).
I still see no real value in keeping or supporting log_xid() interface.
I think we can only implement one interface - group_log_xid() - and that's enough.
The central idea in group_log_xid() is the mechanism whereby transactions can queue up while TC is busy making previous transactions durable. So when TC becomes ready, we have a whole list of waiting transactions that can share the next fsync(). This is really an implementation of group commit, not a fully general interface. But it is general enough that it could probably be useful for other binlog-like implementations also. Same for log_xid() more or less. But I agree there is no need to have them as interfaces in the server. They can just serve as examples on how things can be implemented.
A TC based on this interface overrides group_log_xid() and xid_log_after() instead of log_and_order(), and again does not need to deal with any {prepare,commit}_ordered().
Why do you need xid_log_after here ?
I think the original motivation was that group_log_xid() handles many transactions in one thread, so it cannot call my_error() on each transaction individually. After all, it is possible for some transactions to fail while others succeed. So xid_log_after() runs in each individual thread once group_log_xid() is done, and can call my_error() for any deferred error. But it seems in any case appropriate to have a part of TC logging that runs in parallel, giving the TC the opportunity to reduce the amount of work done in the critical code path under the global LOCK_group_commit mutex. Just like the serialised prepare_ordered() and commit_ordered() calls have parallel counterparts prepare() and commit().
If need_prepare_ordered or need_commit_ordered is passed as FALSE, then the corresponding call need not be done. It is safe to do it anyway, however omitting it avoids the need to take a global mutex.
Why would this ever be needed ? (I mean need_prepare_ordered or need_commit_ordered being FALSE)
This is for engines that do not install prepare_ordered() and/or commit_ordered() methods (or that disables them due to user configuration, in case it provides better performance when consistent commit order is not needed). If these calls are not needed, then log_and_order() can take less locks, avoiding LOCK_prepare_ordered and/or LOCK_commit_ordered. Well, we already discussed changing LOCK_prepare_ordered to be the queue lock, and removing LOCK_commit_ordered completely. That may leave nothing to be saved, so I would just remove this. (The only remaining case I can come up with is TC_LOG_MMAP; unless both prepare_ordered() and commit_ordered() are installed, it need not do any queueing at all, as there is no concept of commit order inside it. But this is somewhat of a corner case).
In current MariaDB, we have two different TC implementations (as well as a "dummy" empty implementation that I do not know if is used).
The code in mysqld.cc is
tc_log= (total_ha_2pc > 1 ? (opt_bin_log ? (TC_LOG *) &mysql_bin_log : (TC_LOG *) &tc_log_mmap) : (TC_LOG *) &tc_log_dummy);
so, tc_log_dummy is used when there's at most one xa-capable engine. But MySQL does not use 2pc for a transaction unless it has at least two xa-capable participants. In other words, tc_log_dummy is never used.
Ok, thanks for info. - Kristian.