
On Thu, 18 Mar 2010 15:18:40 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Alex, I think this discussion is getting really interesting, and I understand your points much better now, thanks for your many comments!
Glad to hear that. It'd be very sad otherwise ;)
Ok, so I think this means that the redundancy service will generate a
new
RS History ID when it starts a new RS History. Exactly when and how this happens will then depend on the particular plugin.
That's precisely the way to put it: "starts a new RS History". When Galera node starts with undefined initial state, it generates a new history UUID. With that, when it will never join a cluster without taking state snapshot first. Likewise, if it is the first node of the cluster, no other node will join it without state snapshot. This works remarkably well in preventing states inconsistency due to operator errors.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work)
Aha, now I see what you mean!
So the redundancy service would provide the durability, using its transaction log or whatever. This is quite different from current MySQL, where every engine, as well as the binlog code, is individually responsible for durability, and then consequently the 2-phase commit is needed to ensure that everyone has the same idea of what should be durable.
So this is actually quite an appealing idea! With this system, each storage engine (and secondary replication plugins like binlog) would no longer need to ensure that transactions are durable, just that they are able to recover after a crash into a consistent state.
Yes, the idea of this model is that the main purpose of redundancy is durability, which, depending on a plugin, can be of much higher degree than flush to disk (e.g. binlog to a remote machine with the point-in-time recovery ability). There is a subtle moment here however: asynchronous redundancy plugin won't give you that durability. So, again there are several way to go about that: some applications may be willing to trade some durability (few last transactions) for speed. Or you can still have an option to ensure durability on engine level. Maybe something else.
This could translate concretely into (for example) setting innodb_flush_log_at_txn_commit=0 without risking to loose transactions. So no fflush() per commit in the engine (and I think therefore no need for group commit).
So only the (primary) redundancy service would need to fflush() (or whatever) to achieve durability.
So this is quite interesting! The whole issue of coordinating the state between engines, master binlog thread, and slave threads, in the case of crash recovery, is quite a mess in current MySQL, and this indeed sounds much cleaner.
So let me list some potential issues with this that spring to my mind, and see how they could be resolved:
1. On crash recovery, the engine must be able to tell where in the RS History they have recovered themselves to, so that they can correctly be supplied with the remaining transactions that they must apply.
The best way to do this is probably to extend the storage engine API (and the engines) with a facility for this, so that the engines can log it in
Exactly. That's our default setting for Galera. their
transaction log. However, it seems simple enough to emulate it for engines that do not implement such extensions. It should be sufficient to create a hidden table inside the engine with a single "transaction_id" column, and just insert a new row with the ID just before committing in the engine.
Now, engine commit must be expected to fail from time to time. So it seems hard to supply the engine with the global transaction ID (as that will not be generated until engine commit_prepare has succeeded).
But we don't actually need the _global_, linearly ordered ID for this. It should be sufficient with the local transaction ID generated at the start of the transaction, as long as the redundancy service records the mapping between this and the global transaction IDs. And of course if we choose to rollback later then there will be no matching global transaction ID (which is fine).
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
As far as I understand, any (semi)synchronous replication requires
This is an interesting option indeed. However 1) the mapping itself should be durable, so every plugin must design the way to recover it in the case of crash. 2) global transaction ID better be integral part of the server state, otherwise it will complicate state snapshot transfer. E.g. it will be insufficient to just use mysqldump or copy db files, you'll need to carry global transaction ID along. The above issues are of course resolvable one way or another. It is just not obvious what is easier: fix the engine or implement a workaround (if you consider it in all entirety) This does not end here. If we want flexibility to define Redundancy Sets (like limiting the scope of replication to certain tables or have several RS's per server) we'll need to be able to maintain several global transaction IDs and map them to RS somehow, as well as store the description of Redundancy Sets themselves. Probably in a table like (rs_name, table_list, plugin_stack, history_uuid, global_trx_ID). The simplest application: graceful server restart. (Notice that since each RS has a separate plugin stack, there are no complications on plugin side) The above consideration removes the global transaction ID per engine, but brings in a separate table that should implicitly participate in every transaction. I see too many variants here which I can't really assess. I'd like to hear what DB guys would say. that.
Right ... I have to study (semi)synchronous replication techniques a bit more, I'm not too familiar with them.
Ok, I might be too rush here. (Semi)synchronous guarantee is that you don't send OK to client until the change is replicated to (at least some) other nodes. So technically you can do replication after local commit. Still there are 3 pros to call redundancy service before commit as I mentioned before: 1) it is required for consistent ordering in multi-master case 2) it gives you durability with synchronous plugins 3) if you go asynchronous replication, why not start doing it really asynchronously, that is ASAP? Note, that if you call redundancy service last, you still have to hold commit_order_lock for the duration of the call, so you don't win anything this way.
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); }
pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
There are (at least) two issues of interest:
1. Can engines commit in different order from redundancy service (and if they can, does it matter)
2. Can we support group commit (or obtain equivalent performance in some other way)?
To give durability, the redundancy service will typically need to fflush() when it commits and generates the global transaction ID. So having group commit means that while such fflush() for transaction T1 is running, other transactions T2, T3, ... Tn are free to start (though not complete of course) committing in the redundancy service. So that they can queue up, and once T1 fflush() is done, a single fflush() will be sufficient to commit all of T2, ... Tn.
So this means that T2 must be able to complete engine->txn_prepare() while T1 is still running inside redundancy_service->pre_commit(). There cannot be a global lock around all of the above code.
So this in turn means that T3 will be able to complete engine->txn_prepare() _before_ T2, even though T3 will commit _after_ T2 (at least in terms of redundancy service).
So the question is if T3 can complete engine->txn_prepare() before T2, yet complete txn_commit() after T2. I do not see why not, but I do not know if this is actually the case, I will need to check.
Hm, how is it different from how it is done currently in MariaDB? Does txn_commit() have to follow the same order as txn_prepare()? If not, then the commit ordering imposed by redundancy service should not be a problem.
By the way, is it even necessary to have redundancy_service->post_commit()? It seems to me that it probably is not needed?
(Though something like pre_commit() and post_commit() will probably be needed to support the SQL-level 2-phase commit/XA. This is the SQL syntax the user can use to do two-phase commit between multiple systems used in the application).
During local transaction execution some resources will be inevitably allocated for transaction by redundancy service (row modifications, etc.), there might be some buffers shared between the engine and redundancy service (opaque to redundancy service indeed), and finally, in the above code redundancy_service->pre_commit() locks commit_order_lock. redundancy_service->post_commit() purpose is to release those resources and the lock. Moreover you will need redundancy_service->post_rollback() to release resources in case of transaction rollback - didn't show it in pseudocode for simplicity. Regards, Alex -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011