Alex, I think this discussion is getting really interesting, and I understand your points much better now, thanks for your many comments! Alex Yurchenko <alexey.yurchenko@codership.com> writes:
On Wed, 17 Mar 2010 10:48:50 +0100, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
So what is not clear to me is how the IDs get assigned to an RS, and to an RS History. Is it assigned by the DBA in server configuration? Is it assigned automatically by the server? Or assigned somehow by the redundancy service?
Ok, that's a good question. The way I see it is
<snip>
underwent different sequences of changes. So here comes RS History ID. It has to be globally unique and is generated by the redundancy service (because it is the one who knows what's going on), just like global transaction sequence number. E.g. in Galera it is a time-based UUID.
Ok, so I think this means that the redundancy service will generate a new RS History ID when it starts a new RS History. Exactly when and how this happens will then depend on the particular plugin.
What if the redundancy service is invoked, it logs (or whatever) the changeset. Then the machine crashes before the engine(s) have time to commit. When the server comes back up, how do we avoid that the redundancy service, (and hence the slaves) will have the change, but the master will not?
The ability to implement reliable crash-recovery is something I see requested a lot, and I think it is really important.
Well, this is what redundancy service is for: when you crash, you can recover, isn't it? My understanding is that you do want the redundancy service to have the change and that you can take it from there and commit uncommitted. This is how Galera works and I believe this is how semi-sync in MySQL works (there's hardly any other way for synchronous replication to work)
Aha, now I see what you mean! So the redundancy service would provide the durability, using its transaction log or whatever. This is quite different from current MySQL, where every engine, as well as the binlog code, is individually responsible for durability, and then consequently the 2-phase commit is needed to ensure that everyone has the same idea of what should be durable. So this is actually quite an appealing idea! With this system, each storage engine (and secondary replication plugins like binlog) would no longer need to ensure that transactions are durable, just that they are able to recover after a crash into a consistent state. This could translate concretely into (for example) setting innodb_flush_log_at_txn_commit=0 without risking to loose transactions. So no fflush() per commit in the engine (and I think therefore no need for group commit). So only the (primary) redundancy service would need to fflush() (or whatever) to achieve durability. So this is quite interesting! The whole issue of coordinating the state between engines, master binlog thread, and slave threads, in the case of crash recovery, is quite a mess in current MySQL, and this indeed sounds much cleaner. So let me list some potential issues with this that spring to my mind, and see how they could be resolved: 1. On crash recovery, the engine must be able to tell where in the RS History they have recovered themselves to, so that they can correctly be supplied with the remaining transactions that they must apply. The best way to do this is probably to extend the storage engine API (and the engines) with a facility for this, so that the engines can log it in their transaction log. However, it seems simple enough to emulate it for engines that do not implement such extensions. It should be sufficient to create a hidden table inside the engine with a single "transaction_id" column, and just insert a new row with the ID just before committing in the engine. Now, engine commit must be expected to fail from time to time. So it seems hard to supply the engine with the global transaction ID (as that will not be generated until engine commit_prepare has succeeded). But we don't actually need the _global_, linearly ordered ID for this. It should be sufficient with the local transaction ID generated at the start of the transaction, as long as the redundancy service records the mapping between this and the global transaction IDs. And of course if we choose to rollback later then there will be no matching global transaction ID (which is fine). So this sounds doable. 2. Multi-engine transactions. So, if a transaction spans multiple engines, during crash recovery each engine may recover to a different point in time. In this case there will be transactions that need to be replayed against engine E1 but not against E2. For row-based events, this seems doable. It should be possible to filter the events to include only those for tables in the relevant engine. Now for statement-based events, this is a problem. Eg. a multi-table updates against multiple engines will be hard to apply to just one engine ... Anyway, I think this is not a showstopper. It seems ok to say that statement-based replication on multi-engine transactions will not guarantee reliable crash recovery. Especially since InnoDB currently is the only transactional engine that supports statement-based replication ... non-transactional engines are in any case impossible to reliably crash-recover of course. So this is definitely an interesting idea. Quite different from how current MySQL replication/binlogging works.
My thinking is that while the engines need to have a guarantee that binlog write was successful, binlog (or any other redundancy service) does not need the same guarantee from the engines. So instead of full blown 2PC you need a simple if()/else clause:
if (redundancy_service->pre_commit() == SUCCESS) { /* commit */ } else { /* rollback */ }
Yes. Until now, I did not consider the possibility that crashed engines could be recovered from the local log in the redundancy service.
Of course, this means that there can only be one redundancy service plugin at a time, doesn't it?
It sounds like that in the general case. But realistically, there'd be few plugins which have to decide on the global transaction ID, others can be told. E.g. unlike multi-master replication, binlog plugin can easily log events in the order it is told to. So you can effectively stack redundancy plugins putting the one which decides on global transaction ID on top. The others just follow the defined order. So you can have the following stack:
multi-master synchronous plugin (for HA) ---- master-slave asynchronous plugin (for fanning out) ---- binlog plugin (for local recovery after crash)
Yes, agree.
But true, it is essential that storage engine(s) and redundancy service(s) commit in the same order. It is however not yet clear to me that asking redundancy service first is sufficient for this, nor that it is necessary.
As far as I understand, any (semi)synchronous replication requires that.
Right ... I have to study (semi)synchronous replication techniques a bit more, I'm not too familiar with them.
Hm. I'm not well versed in MySQL engine interfaces, so I might be missing something here. But why can't we have something like this instead:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_commit(thd); redundancy_service->post_commit(thd); } else { engine->txn_rollback(thd); }
pre_commit() call locks commit_order_lock in the right order. post_commit() releases commit_order_lock and the resources that might be allocated by transaction in the redundancy service. Or, if txn_commit() is to heavy an operation, then maybe like that?:
engine->txn_prepare(thd);
if (redundancy_service->pre_commit(thd) == SUCCESS) { engine->txn_fix_order(thd); redundancy_service->post_commit(thd); engine->txn_commit(thd); } else { engine->txn_rollback(thd); }
There are (at least) two issues of interest: 1. Can engines commit in different order from redundancy service (and if they can, does it matter) 2. Can we support group commit (or obtain equivalent performance in some other way)? To give durability, the redundancy service will typically need to fflush() when it commits and generates the global transaction ID. So having group commit means that while such fflush() for transaction T1 is running, other transactions T2, T3, ... Tn are free to start (though not complete of course) committing in the redundancy service. So that they can queue up, and once T1 fflush() is done, a single fflush() will be sufficient to commit all of T2, ... Tn. So this means that T2 must be able to complete engine->txn_prepare() while T1 is still running inside redundancy_service->pre_commit(). There cannot be a global lock around all of the above code. So this in turn means that T3 will be able to complete engine->txn_prepare() _before_ T2, even though T3 will commit _after_ T2 (at least in terms of redundancy service). So the question is if T3 can complete engine->txn_prepare() before T2, yet complete txn_commit() after T2. I do not see why not, but I do not know if this is actually the case, I will need to check. By the way, is it even necessary to have redundancy_service->post_commit()? It seems to me that it probably is not needed? (Though something like pre_commit() and post_commit() will probably be needed to support the SQL-level 2-phase commit/XA. This is the SQL syntax the user can use to do two-phase commit between multiple systems used in the application). - Kristian.