Re: [Maria-developers] Architecture review of MWL#132 Transaction coordinator plugin
Hi, Kristian! Now, WL#132 - Transaction coordinator plugin
============= High-Level Specification ... In current MariaDB, we have two different TC implementations (as well as a "dummy" empty implementation that I do not know if is used).
The code in mysqld.cc is tc_log= (total_ha_2pc > 1 ? (opt_bin_log ? (TC_LOG *) &mysql_bin_log : (TC_LOG *) &tc_log_mmap) : (TC_LOG *) &tc_log_dummy); so, tc_log_dummy is used when there's at most one xa-capable engine. But MySQL does not use 2pc for a transaction unless it has at least two xa-capable participants. In other words, tc_log_dummy is never used.
Binary log ----------
The binary log implements also a "fake" storage engine, mainly to hook into the commit (and prepare) phase of transaction processing. This is mainly used for statements in non-transactional engines, which are "committed" and written to the binary log outside of the TC and log_xid() framework.
No, this is used to make the number of xa-capable transaction participants more than one and to force MySQL to use 2PC.
TC interface subclasses -----------------------
The MWL#116 has two different algorithms for handling commit order and invoking prepare_ordered() and commit_ordered() handler methods:
- One used with TC_MMAP, which needs no correspondance between engines and TC. This uses the existing log_xid() interface.
- One used with the binary log TC, which ensures same commit order in engines and binary log, and which uses a new single-threaded group_log_xid() TC interface to efficiently do group commit.
In the prototype patch for MWL#116, these two methods are mixed with each other in the function ha_commit_trans(), and the logic is quite complex. Using the log_and_order() TC generalisation provides a nice cleanup of this.
We implement two subclasses of the TC interface:
- One class TC_LOG_unordered for the method used with TC_MMAP. This implements the old log_xid() interface.
- One class TC_LOG_group_commit for the method used for the binary log. This implements the new group_log_xid() interface.
Each subclass implements the corresponding algorithm for invoking prepare_ordered() and commit_ordered(), using the same mechanisms as in MWL#116, but implemented in a cleaner way. The ha_commit_trans() function then has no details about prepare_ordered() or commit_ordered(), it just calls into tc_log->log_and_order(), which handles the necessary details.
Thus a simple TC plugin similar to the binary log or TC_MMAP can implement one of the simple interfaces log_xid() or group_log_xid(), without having to worry about prepare_ordered() and commit_ordered(). But a plugin like Galera that needs to do more can implement the more general interface.
I still see no real value in keeping or supporting log_xid() interface. I think we can only implement one interface - group_log_xid() - and that's enough.
============= Low-Level Design ... log_and_order() Requests a decision to commit (non-zero return) or rollback (zero return) of the transaction. At this point, the transaction has been successfully prepared in all engines.
The method must call run_prepare_ordered(), in a way so that calls in different threads happen in the order that the transactions are committed. This call must be protected by the global LOCK_prepare_ordered mutex.
The method must then call run_commit_ordered(), protected by LOCK_commit_ordered, again so that different threads are called in the order that transactions are committed.
The idea with prepare_ordered() is to call it as early as possible after commit order has been decided, for example to release locks early. In particular, a transaction can still be rolled back after prepare_ordered() (for example in case of a crash). In contrast, commit_ordered() may only be called after the transaction is durably committed in the TC.
If need_prepare_ordered or need_commit_ordered is passed as FALSE, then the corresponding call need not be done. It is safe to do it anyway, however omitting it avoids the need to take a global mutex.
Why would this ever be needed ? (I mean need_prepare_ordered or need_commit_ordered being FALSE) ...
A TC based on this interface overrides group_log_xid() and xid_log_after() instead of log_and_order(), and again does not need to deal with any {prepare,commit}_ordered().
Why do you need xid_log_after here ? General comment: Wouldn't it be simpler to create only group_log_xid() interface, no log_and_order() or log_xid() ? The tc plugin gets the list in group_log_xid() - it can reorder the list any way it wants, call prepare_ordered() and commit_ordered() as needed and so on. In this interpretation, group_log_xid() can meet all the use cases. And there's no need to create a multitude of methods that one needs to get familiar with before implementing a TC plugin. Regards, Sergei P.S. Minor detail - there could be helper functions like iterate_the_list_and_call_prepare_ordered(), that the plugin can use.
Sergei Golubchik <serg@askmonty.org> writes:
Now, WL#132 - Transaction coordinator plugin
Wouldn't it be simpler to create only group_log_xid() interface, no log_and_order() or log_xid() ? The tc plugin gets the list in group_log_xid() - it can reorder the list any way it wants, call prepare_ordered() and commit_ordered() as needed and so on. In this interpretation, group_log_xid() can meet all the use cases. And there's no need to create a multitude of methods that one needs to get familiar with before implementing a TC plugin.
I do not see how this would work. The group_log_xid() interface as specified here does not allow the TC to reorder transactions, on the contrary the commit order has already been decided by the ordering of transactions in the passed list. But there is no need for multiple interfaces, just one: the log_and_order() interface. That is my main idea with MWL#132: to generalise the TC interface so that something like Galera is able to change commit order as it needs. So there is only one plugin API, log_and_order(). The other interfaces (log_xid() and group_log_xid()) are not plugin APIs, they are just helper classes that one can use to implement some simpler types of TC plugins. I thought they could be useful to provide somehow, but maybe it just confuses the issue. Instead, they could just be examples, or maybe only something we use internally in mysqld to implement TC_LOG_MMAP and TC_LOG_BINLOG. (And as you suggest, maybe we do not need log_xid() at all, we could just rewrite TC_LOG_MMAP to use group_log_xid()). Does that make my intensions clearer? ---- So, to elaborate on the log_and_order() interface: I think it is a nice generalisation. It is easy to implement group_log_xid() in the log_and_order() framework, it is essentially the algorithm from MWL#116. But log_and_order() is more general, since it allows to change commit order, this is not possible in group_log_xid(), since it is called only when commit order has already been decided. This is how I understand Galera works: Galera first runs transactions in complete isolation on each node, buffering row events just like the binlog. Only during commit is the transaction replicated to other nodes. A global transaction ID is assigned to the transaction; this ID is a monotonic sequence which thus specifies the commit ordering relative to all other transactions in the cluster. The events for the transaction are then shipped to all other nodes. A seperate thread (or threads) applies transaction events received from other nodes in global transaction ID order (similar to the slave SQL thread). The commit of a local transaction is delayed until all other transactions with earlier global transaction ID have been applied. Galera uses optimistic concurrency control, assuming transactions can run independently, and aborting one if there turns out to be a conflict after all. They use the certification based replication method to handle such conflicts. As I understand it, the idea is to have each node check for conflicts between transactions individually, but using a deterministic algorithm that ensures that all nodes will make the same decision about which transaction to rollback and which to keep. (Galera keeps track of primary key values of all modified rows for this purpose). (I hope I got this right, we should ask the Galera people for more details). So this is where log_and_order() comes in. Galera would install a TC plugin, and would receive a call into log_and_order() when a transaction commits. It can then replicate the transaction across the cluster and assign global transaction ID. It can then synchronise among threads to invoke prepare_ordered() in correct global transaction ID order, and afterwards commit_ordered() in the same order. Then when it returns from log_and_order(), the commit order has been correctly decided (or it can roll back a conflicting transaction by returning error from log_and_order(). So it seems to be a good fit with Galera (though it still has to be shown to work in practice). Something like group_log_xid(list_of_transactions) does not really work here I think. Galera may need to reorder a local transaction with another transaction that has not even started yet when group_log_xid() is called, so even allowing to reorder the passed-in list seems insufficient. Also the old log_xid() interface seems insufficient, as it provides no way for Galera to control the order that transactions commit in after returning from log_xid(). Hm, maybe it could wait for unlog() from transaction 1 before returning from log_xid() from transaction 2, but that seems not optimal (and would prevent any kind of group commit).
I still see no real value in keeping or supporting log_xid() interface.
I think we can only implement one interface - group_log_xid() - and that's enough.
The central idea in group_log_xid() is the mechanism whereby transactions can queue up while TC is busy making previous transactions durable. So when TC becomes ready, we have a whole list of waiting transactions that can share the next fsync(). This is really an implementation of group commit, not a fully general interface. But it is general enough that it could probably be useful for other binlog-like implementations also. Same for log_xid() more or less. But I agree there is no need to have them as interfaces in the server. They can just serve as examples on how things can be implemented.
A TC based on this interface overrides group_log_xid() and xid_log_after() instead of log_and_order(), and again does not need to deal with any {prepare,commit}_ordered().
Why do you need xid_log_after here ?
I think the original motivation was that group_log_xid() handles many transactions in one thread, so it cannot call my_error() on each transaction individually. After all, it is possible for some transactions to fail while others succeed. So xid_log_after() runs in each individual thread once group_log_xid() is done, and can call my_error() for any deferred error. But it seems in any case appropriate to have a part of TC logging that runs in parallel, giving the TC the opportunity to reduce the amount of work done in the critical code path under the global LOCK_group_commit mutex. Just like the serialised prepare_ordered() and commit_ordered() calls have parallel counterparts prepare() and commit().
If need_prepare_ordered or need_commit_ordered is passed as FALSE, then the corresponding call need not be done. It is safe to do it anyway, however omitting it avoids the need to take a global mutex.
Why would this ever be needed ? (I mean need_prepare_ordered or need_commit_ordered being FALSE)
This is for engines that do not install prepare_ordered() and/or commit_ordered() methods (or that disables them due to user configuration, in case it provides better performance when consistent commit order is not needed). If these calls are not needed, then log_and_order() can take less locks, avoiding LOCK_prepare_ordered and/or LOCK_commit_ordered. Well, we already discussed changing LOCK_prepare_ordered to be the queue lock, and removing LOCK_commit_ordered completely. That may leave nothing to be saved, so I would just remove this. (The only remaining case I can come up with is TC_LOG_MMAP; unless both prepare_ordered() and commit_ordered() are installed, it need not do any queueing at all, as there is no concept of commit order inside it. But this is somewhat of a corner case).
In current MariaDB, we have two different TC implementations (as well as a "dummy" empty implementation that I do not know if is used).
The code in mysqld.cc is
tc_log= (total_ha_2pc > 1 ? (opt_bin_log ? (TC_LOG *) &mysql_bin_log : (TC_LOG *) &tc_log_mmap) : (TC_LOG *) &tc_log_dummy);
so, tc_log_dummy is used when there's at most one xa-capable engine. But MySQL does not use 2pc for a transaction unless it has at least two xa-capable participants. In other words, tc_log_dummy is never used.
Ok, thanks for info. - Kristian.
Hi, Kristian! On Sep 07, Kristian Nielsen wrote:
Sergei Golubchik <serg@askmonty.org> writes:
Now, WL#132 - Transaction coordinator plugin
Wouldn't it be simpler to create only group_log_xid() interface, no log_and_order() or log_xid() ? The tc plugin gets the list in group_log_xid() - it can reorder the list any way it wants, call prepare_ordered() and commit_ordered() as needed and so on. In this interpretation, group_log_xid() can meet all the use cases. And there's no need to create a multitude of methods that one needs to get familiar with before implementing a TC plugin.
I do not see how this would work. The group_log_xid() interface as specified here does not allow the TC to reorder transactions, on the contrary the commit order has already been decided by the ordering of transactions in the passed list.
Eh. Above I wrote that group_log_xid() calls prepare_ordered() and commit_ordered() - so it's not the same group_log_xid() that you had in WL#116. Perhaps it's the same as log_and_order() method, and then we agree that it's all what is needed.
Something like group_log_xid(list_of_transactions) does not really work here I think. Galera may need to reorder a local transaction with another transaction that has not even started yet when group_log_xid() is called, so even allowing to reorder the passed-in list seems insufficient.
Why, no. Galera may remove some thd's from the list and put them to a internal tc plugin buffer. That is, it commits only transactions that it knows when and how to commit. Next time log_and_order() is called, Galera will add delayed transactions back to the queue, reorder, remove "not ready" transactions again, and so on. I mean, log_and_order() seems sufficient, although using it in this scenario won't be trivial - but it can be expected, the scenario itself is complex.
A TC based on this interface overrides group_log_xid() and xid_log_after() instead of log_and_order(), and again does not need to deal with any {prepare,commit}_ordered().
Why do you need xid_log_after here ?
I think the original motivation was that group_log_xid() handles many transactions in one thread, so it cannot call my_error() on each transaction individually. After all, it is possible for some transactions to fail while others succeed.
So xid_log_after() runs in each individual thread once group_log_xid() is done, and can call my_error() for any deferred error.
But it seems in any case appropriate to have a part of TC logging that runs in parallel, giving the TC the opportunity to reduce the amount of work done in the critical code path under the global LOCK_group_commit mutex. Just like the serialised prepare_ordered() and commit_ordered() calls have parallel counterparts prepare() and commit().
I'm not completely convinced, but it doesn't matter anyway - if the only interface function is log_and_order() - then there's no need for xid_log_after().
If need_prepare_ordered or need_commit_ordered is passed as FALSE, then the corresponding call need not be done. It is safe to do it anyway, however omitting it avoids the need to take a global mutex.
Why would this ever be needed ? (I mean need_prepare_ordered or need_commit_ordered being FALSE)
This is for engines that do not install prepare_ordered() and/or commit_ordered() methods (or that disables them due to user configuration, in case it provides better performance when consistent commit order is not needed).
If these calls are not needed, then log_and_order() can take less locks, avoiding LOCK_prepare_ordered and/or LOCK_commit_ordered.
Uhm, I don't know if this case is worth optimizing for.
Well, we already discussed changing LOCK_prepare_ordered to be the queue lock, and removing LOCK_commit_ordered completely. That may leave nothing to be saved, so I would just remove this.
Regards, Sergei
participants (2)
-
Kristian Nielsen
-
Sergei Golubchik