Hello Kristian, Thanks for the replies to both emails. I think at this point my biggest interest is learning how to ensure that in Maria 10.0, we can choose to not fsync our log on commit. If I can do this without having to implement commit_ordered, I would like to. I think what you are stating is that if the handlerton does some bookkeeping of transactions that have been prepared, then when commit_checkpoint_request is called, we can use this bookkeeping to properly call commit_checkpoint_notify_ha when all the prepared transactions finish committing. This bookkeeping would need to be done anyway, so it might as well be done in our handlerton so that complexity is reduced for engines that implement commit_ordered. If this is accurate, then this sounds reasonable and I look forward to this working for 10.0. I understand and agree with your point about unnecessary stalls when the binary log is rotating. I also (now) realize that MySQL 5.6 serializes commits which is probably a performance issue for us. I have started a thread on the internals MySQL list and hope to learn more on that thread. Your understanding of my explanation about TokuDB's commits is accurate. At this point, here is what I hope we can do for 10.0: - not implement commit_ordered - do some bookkeeping in our handlerton to be able to implement commit_checkpoint_request I hope this leads to reduced fsyncs for our engine. Thank you -Zardosht On Thu, Feb 21, 2013 at 9:36 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
Reading the email, I think this is what is happening. You depend on commit_ordered to order the transactions in the engine, and when the binary log is going to rotate, you call commit_checkpoint_request on the last transaction in that order. When that returns, we know all transactions in the binlog have been committed to disk and the binary log may be rotated.
Is this accurate?
Close, but not quite.
We do not wait for anything before rotating the binlog, as that would unnecessarily stall subsequent commits. But we do ask the storage engines to let us know when all transactions in the previous log file have been durably committed. Until then, we need to scan two binlog files in case of crash recovery, the old one and the new one. Once the storage engines tell us that everything is durable, we write a marker in the new log that the old log is no longer needed.
The implementation and API is quite asynchroneous in this respect.
If so, then perhaps the ordering is adding an unnecessary constraint.
Yes, I think you are right. You have to understand, when I implemented this, I did not really worry about storage engines that do not implement commit_ordered(), because the intention is that all up-to-date engines will want to do this anyway. So it looks easy to make this particular feature work without commit_ordered(), I just did not consider it before.
How would the following work: - when the binary log is to be rotated, wait for all transactions that are in the process of committing to commit.
I do not want to do this, as it introduces unnecessary stalling.
- call each handlerton to ensure all committed transactions are durable. For TokuDB, this would mean fsyncing our recovery log. In
We can still do this.
The contract around commit_checkpoint_request() is that storage engine must not reply until all transactions that have returned from commit_ordered() have become durable. If you do not implement commit_ordered(), then this is hard in the engine, because commit() may not have been called yet for one of your transactions to become durable.
But instead, you can look at all transactions that have returned from prepare(). Any transaction that has reached commit_ordered() will first have done prepare(). Or even just all transactions that have started at all! So just wait until any transaction that has been prepared has durably committed (or been durably rolled back). At that point, invoke commit_checkpoint_notify_ha(). It does not matter if it takes long before this. Any delay has no worse consequences than having to scan a bit more of the binlog if we crash.
For example, maybe you can just wait for your next checkpoint to complete, and invoke commit_checkpoint_notify_ha() at that time, assuming checkpoint makes transactions durable.
We do not have to change anything in the MariaDB code for this to work, just update the comments defining the contract between server and storage engine. It is just a matter of ensuring that commit_checkpoint_notify_ha() is only called after any transaction has been made durable that might have been written to the binlog before commit_checkpoint_request() was called.
Does this sound reasonable?
MySQL 5.6, we intend to use the flush logs command to do this.
Yes, MySQL 5.6 does not allow new commits to proceed while waiting for old binlog to be rotated.
- Kristian.