[Maria-developers] understanding commit and prepare APIs in MariaDB 5.5
Hello all, I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging. As I understand, group commit has been added to the binary log in Maria 5.3, and is also in MariaDB 5.5. Additionally, new APIs prepare_ordered and commit_ordered are added. I am trying to understand how these fit together. >From reading http://kristiannielsen.livejournal.com/12408.html and comments in handler.h, here is what I THINK I know: - In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit - If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order - InnoDB must do this for hot backup to work Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit? >From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash. Also, I assume the fsync after commit is necessary. Only in 10.0 will some new functionality be available that allows us to not fsync on commit. Is my understanding correct? Thanks -Zardosht
Zardosht Kasheff <zardosht@gmail.com> writes:
I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging.
Great!
- In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit
Yes. Without this, binlog and engine may become out-of-sync with each other after a crash.
- If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order
Yes, I think you have the right picture. A detail is that serialisation is achieved by running commit_ordered() for many transactions in a simple loop in one thread. The end result is much the same, but the single-thread approach performs significantly better (I remember blogging about this at some point).
Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit?
Well, I think the answer you are looking for is "no". I took a lot of care when designing this stuff to preserve backwards compatibility in the API, and to make the commit_ordered extensions optional. Things should work fine (and if not let us know and we will fix it). One reason to implement it is this: https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consisten... For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data backup, but only if the storage engine implements commit_ordered(), similar to InnoDB hot backup. But yes, storage engine may not care about this, that is fine. On the other hand, if there is any reason for TokuDB _not_ to implement commit_ordered, I would like to know, then maybe I could fix it. After working and thinking a lot about these issues, I have realised that there are a lot of fundamental benefits to having synchronised commit order between binlog and engines. I think you are right that in 5.5, implementing commit_ordered gives only small performance improvement, if any. But in future versions there might be significant performance gains from implementing commit_ordered, as well as usability gains. Maybe you could give me a short overview of how commit works in TokuDB? I imagine most transactional engines have some sort of serialisation around commit to reserve a slot in the transaction log or whatever (or does TokuDB have multiple independent redo logs?). If such serialisation is done in the commit_ordered() call, then it can run in the same thread for multiple transactions, which reduces the need for context switches. So the intension was that it should be trivial for a transactional engine to support commit_ordered(). I modified both PBXT and InnoDB for this, and it was just a few lines of code to split out the relevant parts of the existing commit() method. So if TokuDB is harder, I would like to know. Of course, there is not much I can do about MySQL 5.6 deciding to implement a very similar algorithm with a completely incompatible API, causing much grief for 3rd party storage engines :-(
From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash.
Yes, unless you can gain a bit of performance at high concurrency by piggybagging your own serialisation needs on the commit_ordered() thread, thereby avoiding some context switches. Which is by no means certain.
Also, I assume the fsync after commit is necessary. Only in 10.0 will
Yes, at least theoretically. It is only necessary to fsync the very last commit in the binlog before binlog rotation. However, with no way to know when such rotation is about to occur, you are left with no option but to fsync always :-(
some new functionality be available that allows us to not fsync on commit.
Yes, so this is a huge performance gain of course. And to use this in 10.0, you will need to implement commit_ordered(). Let me explain why, maybe this will be useful to you. So if we crash, we have to do binlog recovery at next startup to be sure that we have the exact same set of committed transactions in binlog and in engines. To do this we scan the binlog, commit any prepared transactions in engines that are in the binlog, and roll back the rest. Now the issue is - how far back in the binlog should we look? We cannot keep binlogs around forever and scan them all at every crash, after all. So we need some way to know from a storage engine when a commit has been fsync()ed to disk and become durable. At this point, we know we no longer need to scan for that particular commit in the binlog. The method used in 5.5 and below is horribly crude: when commit() returns, we know that this particular commit has become durable. In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the storage engine to notify us when all commits in the old binlog file have become durable. When they have, we no longer need to scan the old binlog file at next crash recovery. By assuming consistent commit order, this is easy. We just call commit_checkpoint_request(), storage engine waits until the latest commit has become durable, then replies with commit_checkpoint_notify_ha(). I did not want to add the overhead of such notification for _every_ commit, only for the occasional binlog rotation. Since commits are ordered, we need only ask for the last commit in the binlog, then we know all the prior commits are durable as well. So if you want to benefit from no fsync in commit() method in MariaDB 10.0, then you need to implement commit_ordered(). And then you might as well do it for 5.5 also, the commit_ordered() API is the same. If this is a problem for you, let me know. Now that I think about it, I believe we can perhaps relax the requirement of binlog_checkpoint_request() to be that all currently prepared transactions have become durably committed, then there is no need to implement commit_ordered. But the point is, when implementing new performance improvements, we will generally assume that commit_ordered is implemented, and not spend effort on older engines with no commit_ordered() (except to ensure they still work).
Is my understanding correct?
Basically yes. Which makes me happy, as that means my existing communications on this subject have at least been somewhat adequate :-) Hopefully my additional details were useful to you as well.
Thanks -Zardosht
BTW, thanks for taking up these discussions, I really appreciate your thoughts and feedback. - Kristian.
Kristian, thank you for the detailed reply. My MariaDB 5.5 questions have been answered. So in this email, I will move on and explain a bit what commit does in TokuDB during commit. In another email, I will continue discussion on 10.0. Here is a very high level overview. In TokuDB, all work to modify a table is done by sending messages into a tree, and those messages trickle down in buffers. A nice explanation is in http://www.tokutek.com/wp-content/uploads/2012/11/Fractal-Tree-Technology-an..., the key part beginning with the slide "Now, What's a B-Tree? & a Fractal Tree Index". Additionally, checkpoints occur in two phases. The first phase is short, and locks out many client operations, sweeps through memory marking all the data as needing a checkpoint. This defines the state of the checkpoint. So no new work may be done on any data in memory until we can ensure that it's checkpointed state will be saved to disk. The second phase is long and does the work to write dirty data to disk. On commit we do the following: - grab a read lock to prevent the short phase of checkpoint from starting, called "begin checkpoint". - log that the transaction has committed in our recovery log. - for every write done provisionally under this transaction, send a message into the dictionary that states the transaction has committed. This is a simplification and we have some optimizations, but algorithmically, this is accurate. This is done so when the message makes it to the appropriate entry in a leaf node, the system knows the value that was once provisionally inserted with this transaction is now committed. We do a similar thing with aborts to let the system know that the value should be rolled back. - remove the transaction from our list of live transactions - unlock whatever is necessary - release the read lock that prevents begin checkpoint - fsync the log So I have been thinking what it would take to possibly implement a commit_ordered, and while it is likely possible (because I believe anything is possible given the motivation), I don't see an elegant way. In the above steps, everything that happens in between the grab and release of the begin checkpoint read lock has to be there. We cannot release the begin checkpoint read lock in between, or we may have bugs during recovery. We cannot do all of this work in in commit_ordered, because the stage that sends messages for every write may be expensive, thereby hurting concurrency. So, for such a thing to work, we would have to find a way to grab the read lock in commit_ordered once for each transaction (and because the lock is fair, we can't just regrab the lock on the same thread), write to the recovery log, then perform everything else under commit, then release the read lock. It can probably be done, but it is messy. If unnecessary, I prefer to not do it. On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging.
Great!
- In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit
Yes. Without this, binlog and engine may become out-of-sync with each other after a crash.
- If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order
Yes, I think you have the right picture. A detail is that serialisation is achieved by running commit_ordered() for many transactions in a simple loop in one thread. The end result is much the same, but the single-thread approach performs significantly better (I remember blogging about this at some point).
Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit?
Well, I think the answer you are looking for is "no". I took a lot of care when designing this stuff to preserve backwards compatibility in the API, and to make the commit_ordered extensions optional. Things should work fine (and if not let us know and we will fix it).
One reason to implement it is this:
https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consisten...
For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data backup, but only if the storage engine implements commit_ordered(), similar to InnoDB hot backup. But yes, storage engine may not care about this, that is fine.
On the other hand, if there is any reason for TokuDB _not_ to implement commit_ordered, I would like to know, then maybe I could fix it. After working and thinking a lot about these issues, I have realised that there are a lot of fundamental benefits to having synchronised commit order between binlog and engines. I think you are right that in 5.5, implementing commit_ordered gives only small performance improvement, if any. But in future versions there might be significant performance gains from implementing commit_ordered, as well as usability gains.
Maybe you could give me a short overview of how commit works in TokuDB? I imagine most transactional engines have some sort of serialisation around commit to reserve a slot in the transaction log or whatever (or does TokuDB have multiple independent redo logs?). If such serialisation is done in the commit_ordered() call, then it can run in the same thread for multiple transactions, which reduces the need for context switches.
So the intension was that it should be trivial for a transactional engine to support commit_ordered(). I modified both PBXT and InnoDB for this, and it was just a few lines of code to split out the relevant parts of the existing commit() method. So if TokuDB is harder, I would like to know.
Of course, there is not much I can do about MySQL 5.6 deciding to implement a very similar algorithm with a completely incompatible API, causing much grief for 3rd party storage engines :-(
From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash.
Yes, unless you can gain a bit of performance at high concurrency by piggybagging your own serialisation needs on the commit_ordered() thread, thereby avoiding some context switches. Which is by no means certain.
Also, I assume the fsync after commit is necessary. Only in 10.0 will
Yes, at least theoretically. It is only necessary to fsync the very last commit in the binlog before binlog rotation. However, with no way to know when such rotation is about to occur, you are left with no option but to fsync always :-(
some new functionality be available that allows us to not fsync on commit.
Yes, so this is a huge performance gain of course. And to use this in 10.0, you will need to implement commit_ordered(). Let me explain why, maybe this will be useful to you.
So if we crash, we have to do binlog recovery at next startup to be sure that we have the exact same set of committed transactions in binlog and in engines. To do this we scan the binlog, commit any prepared transactions in engines that are in the binlog, and roll back the rest.
Now the issue is - how far back in the binlog should we look? We cannot keep binlogs around forever and scan them all at every crash, after all.
So we need some way to know from a storage engine when a commit has been fsync()ed to disk and become durable. At this point, we know we no longer need to scan for that particular commit in the binlog.
The method used in 5.5 and below is horribly crude: when commit() returns, we know that this particular commit has become durable.
In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the storage engine to notify us when all commits in the old binlog file have become durable. When they have, we no longer need to scan the old binlog file at next crash recovery.
By assuming consistent commit order, this is easy. We just call commit_checkpoint_request(), storage engine waits until the latest commit has become durable, then replies with commit_checkpoint_notify_ha(). I did not want to add the overhead of such notification for _every_ commit, only for the occasional binlog rotation. Since commits are ordered, we need only ask for the last commit in the binlog, then we know all the prior commits are durable as well.
So if you want to benefit from no fsync in commit() method in MariaDB 10.0, then you need to implement commit_ordered(). And then you might as well do it for 5.5 also, the commit_ordered() API is the same.
If this is a problem for you, let me know. Now that I think about it, I believe we can perhaps relax the requirement of binlog_checkpoint_request() to be that all currently prepared transactions have become durably committed, then there is no need to implement commit_ordered.
But the point is, when implementing new performance improvements, we will generally assume that commit_ordered is implemented, and not spend effort on older engines with no commit_ordered() (except to ensure they still work).
Is my understanding correct?
Basically yes. Which makes me happy, as that means my existing communications on this subject have at least been somewhat adequate :-)
Hopefully my additional details were useful to you as well.
Thanks -Zardosht
BTW, thanks for taking up these discussions, I really appreciate your thoughts and feedback.
- Kristian.
On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging.
Great!
- In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit
Yes. Without this, binlog and engine may become out-of-sync with each other after a crash.
- If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order
Yes, I think you have the right picture. A detail is that serialisation is achieved by running commit_ordered() for many transactions in a simple loop in one thread. The end result is much the same, but the single-thread approach performs significantly better (I remember blogging about this at some point).
Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit?
Well, I think the answer you are looking for is "no". I took a lot of care when designing this stuff to preserve backwards compatibility in the API, and to make the commit_ordered extensions optional. Things should work fine (and if not let us know and we will fix it).
One reason to implement it is this:
https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consisten...
For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data backup, but only if the storage engine implements commit_ordered(), similar to InnoDB hot backup. But yes, storage engine may not care about this, that is fine.
On the other hand, if there is any reason for TokuDB _not_ to implement commit_ordered, I would like to know, then maybe I could fix it. After working and thinking a lot about these issues, I have realised that there are a lot of fundamental benefits to having synchronised commit order between binlog and engines. I think you are right that in 5.5, implementing commit_ordered gives only small performance improvement, if any. But in future versions there might be significant performance gains from implementing commit_ordered, as well as usability gains.
Maybe you could give me a short overview of how commit works in TokuDB? I imagine most transactional engines have some sort of serialisation around commit to reserve a slot in the transaction log or whatever (or does TokuDB have multiple independent redo logs?). If such serialisation is done in the commit_ordered() call, then it can run in the same thread for multiple transactions, which reduces the need for context switches.
So the intension was that it should be trivial for a transactional engine to support commit_ordered(). I modified both PBXT and InnoDB for this, and it was just a few lines of code to split out the relevant parts of the existing commit() method. So if TokuDB is harder, I would like to know.
Of course, there is not much I can do about MySQL 5.6 deciding to implement a very similar algorithm with a completely incompatible API, causing much grief for 3rd party storage engines :-(
From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash.
Yes, unless you can gain a bit of performance at high concurrency by piggybagging your own serialisation needs on the commit_ordered() thread, thereby avoiding some context switches. Which is by no means certain.
Also, I assume the fsync after commit is necessary. Only in 10.0 will
Yes, at least theoretically. It is only necessary to fsync the very last commit in the binlog before binlog rotation. However, with no way to know when such rotation is about to occur, you are left with no option but to fsync always :-(
some new functionality be available that allows us to not fsync on commit.
Yes, so this is a huge performance gain of course. And to use this in 10.0, you will need to implement commit_ordered(). Let me explain why, maybe this will be useful to you.
So if we crash, we have to do binlog recovery at next startup to be sure that we have the exact same set of committed transactions in binlog and in engines. To do this we scan the binlog, commit any prepared transactions in engines that are in the binlog, and roll back the rest.
Now the issue is - how far back in the binlog should we look? We cannot keep binlogs around forever and scan them all at every crash, after all.
So we need some way to know from a storage engine when a commit has been fsync()ed to disk and become durable. At this point, we know we no longer need to scan for that particular commit in the binlog.
The method used in 5.5 and below is horribly crude: when commit() returns, we know that this particular commit has become durable.
In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the storage engine to notify us when all commits in the old binlog file have become durable. When they have, we no longer need to scan the old binlog file at next crash recovery.
By assuming consistent commit order, this is easy. We just call commit_checkpoint_request(), storage engine waits until the latest commit has become durable, then replies with commit_checkpoint_notify_ha(). I did not want to add the overhead of such notification for _every_ commit, only for the occasional binlog rotation. Since commits are ordered, we need only ask for the last commit in the binlog, then we know all the prior commits are durable as well.
So if you want to benefit from no fsync in commit() method in MariaDB 10.0, then you need to implement commit_ordered(). And then you might as well do it for 5.5 also, the commit_ordered() API is the same.
If this is a problem for you, let me know. Now that I think about it, I believe we can perhaps relax the requirement of binlog_checkpoint_request() to be that all currently prepared transactions have become durably committed, then there is no need to implement commit_ordered.
But the point is, when implementing new performance improvements, we will generally assume that commit_ordered is implemented, and not spend effort on older engines with no commit_ordered() (except to ensure they still work).
Is my understanding correct?
Basically yes. Which makes me happy, as that means my existing communications on this subject have at least been somewhat adequate :-)
Hopefully my additional details were useful to you as well.
Thanks -Zardosht
BTW, thanks for taking up these discussions, I really appreciate your thoughts and feedback.
- Kristian.
In this email, I will continue discussion on 10.0, as I think my questions for 5.5 have been resolved. Reading the email, I think this is what is happening. You depend on commit_ordered to order the transactions in the engine, and when the binary log is going to rotate, you call commit_checkpoint_request on the last transaction in that order. When that returns, we know all transactions in the binlog have been committed to disk and the binary log may be rotated. Is this accurate? If so, then perhaps the ordering is adding an unnecessary constraint. How would the following work: - when the binary log is to be rotated, wait for all transactions that are in the process of committing to commit. - call each handlerton to ensure all committed transactions are durable. For TokuDB, this would mean fsyncing our recovery log. In MySQL 5.6, we intend to use the flush logs command to do this. - after the last two steps have completed, the binary log may be rotated. This allows storage engines to get the benefit of not having to fsync on every commit while also not having to implement commit ordering. How does this compare to existing design ideas? Thanks -Zardosht On Thu, Feb 21, 2013 at 7:37 AM, Zardosht Kasheff <zardosht@gmail.com> wrote:
Kristian, thank you for the detailed reply. My MariaDB 5.5 questions have been answered. So in this email, I will move on and explain a bit what commit does in TokuDB during commit. In another email, I will continue discussion on 10.0.
Here is a very high level overview.
In TokuDB, all work to modify a table is done by sending messages into a tree, and those messages trickle down in buffers. A nice explanation is in http://www.tokutek.com/wp-content/uploads/2012/11/Fractal-Tree-Technology-an..., the key part beginning with the slide "Now, What's a B-Tree? & a Fractal Tree Index".
Additionally, checkpoints occur in two phases. The first phase is short, and locks out many client operations, sweeps through memory marking all the data as needing a checkpoint. This defines the state of the checkpoint. So no new work may be done on any data in memory until we can ensure that it's checkpointed state will be saved to disk. The second phase is long and does the work to write dirty data to disk.
On commit we do the following: - grab a read lock to prevent the short phase of checkpoint from starting, called "begin checkpoint". - log that the transaction has committed in our recovery log. - for every write done provisionally under this transaction, send a message into the dictionary that states the transaction has committed. This is a simplification and we have some optimizations, but algorithmically, this is accurate. This is done so when the message makes it to the appropriate entry in a leaf node, the system knows the value that was once provisionally inserted with this transaction is now committed. We do a similar thing with aborts to let the system know that the value should be rolled back. - remove the transaction from our list of live transactions - unlock whatever is necessary - release the read lock that prevents begin checkpoint - fsync the log
So I have been thinking what it would take to possibly implement a commit_ordered, and while it is likely possible (because I believe anything is possible given the motivation), I don't see an elegant way. In the above steps, everything that happens in between the grab and release of the begin checkpoint read lock has to be there. We cannot release the begin checkpoint read lock in between, or we may have bugs during recovery. We cannot do all of this work in in commit_ordered, because the stage that sends messages for every write may be expensive, thereby hurting concurrency. So, for such a thing to work, we would have to find a way to grab the read lock in commit_ordered once for each transaction (and because the lock is fair, we can't just regrab the lock on the same thread), write to the recovery log, then perform everything else under commit, then release the read lock. It can probably be done, but it is messy. If unnecessary, I prefer to not do it.
On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging.
Great!
- In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit
Yes. Without this, binlog and engine may become out-of-sync with each other after a crash.
- If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order
Yes, I think you have the right picture. A detail is that serialisation is achieved by running commit_ordered() for many transactions in a simple loop in one thread. The end result is much the same, but the single-thread approach performs significantly better (I remember blogging about this at some point).
Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit?
Well, I think the answer you are looking for is "no". I took a lot of care when designing this stuff to preserve backwards compatibility in the API, and to make the commit_ordered extensions optional. Things should work fine (and if not let us know and we will fix it).
One reason to implement it is this:
https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consisten...
For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data backup, but only if the storage engine implements commit_ordered(), similar to InnoDB hot backup. But yes, storage engine may not care about this, that is fine.
On the other hand, if there is any reason for TokuDB _not_ to implement commit_ordered, I would like to know, then maybe I could fix it. After working and thinking a lot about these issues, I have realised that there are a lot of fundamental benefits to having synchronised commit order between binlog and engines. I think you are right that in 5.5, implementing commit_ordered gives only small performance improvement, if any. But in future versions there might be significant performance gains from implementing commit_ordered, as well as usability gains.
Maybe you could give me a short overview of how commit works in TokuDB? I imagine most transactional engines have some sort of serialisation around commit to reserve a slot in the transaction log or whatever (or does TokuDB have multiple independent redo logs?). If such serialisation is done in the commit_ordered() call, then it can run in the same thread for multiple transactions, which reduces the need for context switches.
So the intension was that it should be trivial for a transactional engine to support commit_ordered(). I modified both PBXT and InnoDB for this, and it was just a few lines of code to split out the relevant parts of the existing commit() method. So if TokuDB is harder, I would like to know.
Of course, there is not much I can do about MySQL 5.6 deciding to implement a very similar algorithm with a completely incompatible API, causing much grief for 3rd party storage engines :-(
From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash.
Yes, unless you can gain a bit of performance at high concurrency by piggybagging your own serialisation needs on the commit_ordered() thread, thereby avoiding some context switches. Which is by no means certain.
Also, I assume the fsync after commit is necessary. Only in 10.0 will
Yes, at least theoretically. It is only necessary to fsync the very last commit in the binlog before binlog rotation. However, with no way to know when such rotation is about to occur, you are left with no option but to fsync always :-(
some new functionality be available that allows us to not fsync on commit.
Yes, so this is a huge performance gain of course. And to use this in 10.0, you will need to implement commit_ordered(). Let me explain why, maybe this will be useful to you.
So if we crash, we have to do binlog recovery at next startup to be sure that we have the exact same set of committed transactions in binlog and in engines. To do this we scan the binlog, commit any prepared transactions in engines that are in the binlog, and roll back the rest.
Now the issue is - how far back in the binlog should we look? We cannot keep binlogs around forever and scan them all at every crash, after all.
So we need some way to know from a storage engine when a commit has been fsync()ed to disk and become durable. At this point, we know we no longer need to scan for that particular commit in the binlog.
The method used in 5.5 and below is horribly crude: when commit() returns, we know that this particular commit has become durable.
In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the storage engine to notify us when all commits in the old binlog file have become durable. When they have, we no longer need to scan the old binlog file at next crash recovery.
By assuming consistent commit order, this is easy. We just call commit_checkpoint_request(), storage engine waits until the latest commit has become durable, then replies with commit_checkpoint_notify_ha(). I did not want to add the overhead of such notification for _every_ commit, only for the occasional binlog rotation. Since commits are ordered, we need only ask for the last commit in the binlog, then we know all the prior commits are durable as well.
So if you want to benefit from no fsync in commit() method in MariaDB 10.0, then you need to implement commit_ordered(). And then you might as well do it for 5.5 also, the commit_ordered() API is the same.
If this is a problem for you, let me know. Now that I think about it, I believe we can perhaps relax the requirement of binlog_checkpoint_request() to be that all currently prepared transactions have become durably committed, then there is no need to implement commit_ordered.
But the point is, when implementing new performance improvements, we will generally assume that commit_ordered is implemented, and not spend effort on older engines with no commit_ordered() (except to ensure they still work).
Is my understanding correct?
Basically yes. Which makes me happy, as that means my existing communications on this subject have at least been somewhat adequate :-)
Hopefully my additional details were useful to you as well.
Thanks -Zardosht
BTW, thanks for taking up these discussions, I really appreciate your thoughts and feedback.
- Kristian.
On Thu, Feb 21, 2013 at 5:42 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
I am investigating what our storage engine, TokuDB, needs to do to achieve its best performance in MariaDB 5.5 with respect to preparing and committing transactions, and with respect to binary logging.
Great!
- In MariaDB 5.5, we must fsync our recovery log on handlerton->prepare and handlerton->commit
Yes. Without this, binlog and engine may become out-of-sync with each other after a crash.
- If we want to ensure that the order we commit transactions is the same order that they appear committed in the binary log, then we should perform our commit in two phases, commit_ordered and commit, where commit_ordered is serialized under a global mutex that ensures the correct order
Yes, I think you have the right picture. A detail is that serialisation is achieved by running commit_ordered() for many transactions in a simple loop in one thread. The end result is much the same, but the single-thread approach performs significantly better (I remember blogging about this at some point).
Here is my big question: given TokuDB is not designed to work with InnoDB's hot backup solution, is there ANY reason why we should care about the commit order of our transactions? Are there any performance reasons related to group commit?
Well, I think the answer you are looking for is "no". I took a lot of care when designing this stuff to preserve backwards compatibility in the API, and to make the commit_ordered extensions optional. Things should work fine (and if not let us know and we will fix it).
One reason to implement it is this:
https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consisten...
For example, MariaDB 5.5 has completely non-blocking mysqldump --master-data backup, but only if the storage engine implements commit_ordered(), similar to InnoDB hot backup. But yes, storage engine may not care about this, that is fine.
On the other hand, if there is any reason for TokuDB _not_ to implement commit_ordered, I would like to know, then maybe I could fix it. After working and thinking a lot about these issues, I have realised that there are a lot of fundamental benefits to having synchronised commit order between binlog and engines. I think you are right that in 5.5, implementing commit_ordered gives only small performance improvement, if any. But in future versions there might be significant performance gains from implementing commit_ordered, as well as usability gains.
Maybe you could give me a short overview of how commit works in TokuDB? I imagine most transactional engines have some sort of serialisation around commit to reserve a slot in the transaction log or whatever (or does TokuDB have multiple independent redo logs?). If such serialisation is done in the commit_ordered() call, then it can run in the same thread for multiple transactions, which reduces the need for context switches.
So the intension was that it should be trivial for a transactional engine to support commit_ordered(). I modified both PBXT and InnoDB for this, and it was just a few lines of code to split out the relevant parts of the existing commit() method. So if TokuDB is harder, I would like to know.
Of course, there is not much I can do about MySQL 5.6 deciding to implement a very similar algorithm with a completely incompatible API, causing much grief for 3rd party storage engines :-(
From everything I read, I think the answer is no, but I would appreciate confirmation. I think there are no performance gains to be made by implementing commit_ordered, and given we fsync on commit, there should be no issues with syncing our engine up with the binary log after a crash.
Yes, unless you can gain a bit of performance at high concurrency by piggybagging your own serialisation needs on the commit_ordered() thread, thereby avoiding some context switches. Which is by no means certain.
Also, I assume the fsync after commit is necessary. Only in 10.0 will
Yes, at least theoretically. It is only necessary to fsync the very last commit in the binlog before binlog rotation. However, with no way to know when such rotation is about to occur, you are left with no option but to fsync always :-(
some new functionality be available that allows us to not fsync on commit.
Yes, so this is a huge performance gain of course. And to use this in 10.0, you will need to implement commit_ordered(). Let me explain why, maybe this will be useful to you.
So if we crash, we have to do binlog recovery at next startup to be sure that we have the exact same set of committed transactions in binlog and in engines. To do this we scan the binlog, commit any prepared transactions in engines that are in the binlog, and roll back the rest.
Now the issue is - how far back in the binlog should we look? We cannot keep binlogs around forever and scan them all at every crash, after all.
So we need some way to know from a storage engine when a commit has been fsync()ed to disk and become durable. At this point, we know we no longer need to scan for that particular commit in the binlog.
The method used in 5.5 and below is horribly crude: when commit() returns, we know that this particular commit has become durable.
In MariaDB 10.0, this is refined. When we rotate the binlog, we ask the storage engine to notify us when all commits in the old binlog file have become durable. When they have, we no longer need to scan the old binlog file at next crash recovery.
By assuming consistent commit order, this is easy. We just call commit_checkpoint_request(), storage engine waits until the latest commit has become durable, then replies with commit_checkpoint_notify_ha(). I did not want to add the overhead of such notification for _every_ commit, only for the occasional binlog rotation. Since commits are ordered, we need only ask for the last commit in the binlog, then we know all the prior commits are durable as well.
So if you want to benefit from no fsync in commit() method in MariaDB 10.0, then you need to implement commit_ordered(). And then you might as well do it for 5.5 also, the commit_ordered() API is the same.
If this is a problem for you, let me know. Now that I think about it, I believe we can perhaps relax the requirement of binlog_checkpoint_request() to be that all currently prepared transactions have become durably committed, then there is no need to implement commit_ordered.
But the point is, when implementing new performance improvements, we will generally assume that commit_ordered is implemented, and not spend effort on older engines with no commit_ordered() (except to ensure they still work).
Is my understanding correct?
Basically yes. Which makes me happy, as that means my existing communications on this subject have at least been somewhat adequate :-)
Hopefully my additional details were useful to you as well.
Thanks -Zardosht
BTW, thanks for taking up these discussions, I really appreciate your thoughts and feedback.
- Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
Reading the email, I think this is what is happening. You depend on commit_ordered to order the transactions in the engine, and when the binary log is going to rotate, you call commit_checkpoint_request on the last transaction in that order. When that returns, we know all transactions in the binlog have been committed to disk and the binary log may be rotated.
Is this accurate?
Close, but not quite. We do not wait for anything before rotating the binlog, as that would unnecessarily stall subsequent commits. But we do ask the storage engines to let us know when all transactions in the previous log file have been durably committed. Until then, we need to scan two binlog files in case of crash recovery, the old one and the new one. Once the storage engines tell us that everything is durable, we write a marker in the new log that the old log is no longer needed. The implementation and API is quite asynchroneous in this respect.
If so, then perhaps the ordering is adding an unnecessary constraint.
Yes, I think you are right. You have to understand, when I implemented this, I did not really worry about storage engines that do not implement commit_ordered(), because the intention is that all up-to-date engines will want to do this anyway. So it looks easy to make this particular feature work without commit_ordered(), I just did not consider it before.
How would the following work: - when the binary log is to be rotated, wait for all transactions that are in the process of committing to commit.
I do not want to do this, as it introduces unnecessary stalling.
- call each handlerton to ensure all committed transactions are durable. For TokuDB, this would mean fsyncing our recovery log. In
We can still do this. The contract around commit_checkpoint_request() is that storage engine must not reply until all transactions that have returned from commit_ordered() have become durable. If you do not implement commit_ordered(), then this is hard in the engine, because commit() may not have been called yet for one of your transactions to become durable. But instead, you can look at all transactions that have returned from prepare(). Any transaction that has reached commit_ordered() will first have done prepare(). Or even just all transactions that have started at all! So just wait until any transaction that has been prepared has durably committed (or been durably rolled back). At that point, invoke commit_checkpoint_notify_ha(). It does not matter if it takes long before this. Any delay has no worse consequences than having to scan a bit more of the binlog if we crash. For example, maybe you can just wait for your next checkpoint to complete, and invoke commit_checkpoint_notify_ha() at that time, assuming checkpoint makes transactions durable. We do not have to change anything in the MariaDB code for this to work, just update the comments defining the contract between server and storage engine. It is just a matter of ensuring that commit_checkpoint_notify_ha() is only called after any transaction has been made durable that might have been written to the binlog before commit_checkpoint_request() was called. Does this sound reasonable?
MySQL 5.6, we intend to use the flush logs command to do this.
Yes, MySQL 5.6 does not allow new commits to proceed while waiting for old binlog to be rotated. - Kristian.
Hello Kristian, Thanks for the replies to both emails. I think at this point my biggest interest is learning how to ensure that in Maria 10.0, we can choose to not fsync our log on commit. If I can do this without having to implement commit_ordered, I would like to. I think what you are stating is that if the handlerton does some bookkeeping of transactions that have been prepared, then when commit_checkpoint_request is called, we can use this bookkeeping to properly call commit_checkpoint_notify_ha when all the prepared transactions finish committing. This bookkeeping would need to be done anyway, so it might as well be done in our handlerton so that complexity is reduced for engines that implement commit_ordered. If this is accurate, then this sounds reasonable and I look forward to this working for 10.0. I understand and agree with your point about unnecessary stalls when the binary log is rotating. I also (now) realize that MySQL 5.6 serializes commits which is probably a performance issue for us. I have started a thread on the internals MySQL list and hope to learn more on that thread. Your understanding of my explanation about TokuDB's commits is accurate. At this point, here is what I hope we can do for 10.0: - not implement commit_ordered - do some bookkeeping in our handlerton to be able to implement commit_checkpoint_request I hope this leads to reduced fsyncs for our engine. Thank you -Zardosht On Thu, Feb 21, 2013 at 9:36 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
Reading the email, I think this is what is happening. You depend on commit_ordered to order the transactions in the engine, and when the binary log is going to rotate, you call commit_checkpoint_request on the last transaction in that order. When that returns, we know all transactions in the binlog have been committed to disk and the binary log may be rotated.
Is this accurate?
Close, but not quite.
We do not wait for anything before rotating the binlog, as that would unnecessarily stall subsequent commits. But we do ask the storage engines to let us know when all transactions in the previous log file have been durably committed. Until then, we need to scan two binlog files in case of crash recovery, the old one and the new one. Once the storage engines tell us that everything is durable, we write a marker in the new log that the old log is no longer needed.
The implementation and API is quite asynchroneous in this respect.
If so, then perhaps the ordering is adding an unnecessary constraint.
Yes, I think you are right. You have to understand, when I implemented this, I did not really worry about storage engines that do not implement commit_ordered(), because the intention is that all up-to-date engines will want to do this anyway. So it looks easy to make this particular feature work without commit_ordered(), I just did not consider it before.
How would the following work: - when the binary log is to be rotated, wait for all transactions that are in the process of committing to commit.
I do not want to do this, as it introduces unnecessary stalling.
- call each handlerton to ensure all committed transactions are durable. For TokuDB, this would mean fsyncing our recovery log. In
We can still do this.
The contract around commit_checkpoint_request() is that storage engine must not reply until all transactions that have returned from commit_ordered() have become durable. If you do not implement commit_ordered(), then this is hard in the engine, because commit() may not have been called yet for one of your transactions to become durable.
But instead, you can look at all transactions that have returned from prepare(). Any transaction that has reached commit_ordered() will first have done prepare(). Or even just all transactions that have started at all! So just wait until any transaction that has been prepared has durably committed (or been durably rolled back). At that point, invoke commit_checkpoint_notify_ha(). It does not matter if it takes long before this. Any delay has no worse consequences than having to scan a bit more of the binlog if we crash.
For example, maybe you can just wait for your next checkpoint to complete, and invoke commit_checkpoint_notify_ha() at that time, assuming checkpoint makes transactions durable.
We do not have to change anything in the MariaDB code for this to work, just update the comments defining the contract between server and storage engine. It is just a matter of ensuring that commit_checkpoint_notify_ha() is only called after any transaction has been made durable that might have been written to the binlog before commit_checkpoint_request() was called.
Does this sound reasonable?
MySQL 5.6, we intend to use the flush logs command to do this.
Yes, MySQL 5.6 does not allow new commits to proceed while waiting for old binlog to be rotated.
- Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
to implement commit_ordered, I would like to. I think what you are stating is that if the handlerton does some bookkeeping of transactions that have been prepared, then when commit_checkpoint_request is called, we can use this bookkeeping to properly call commit_checkpoint_notify_ha when all the prepared transactions finish committing. This bookkeeping would need to be done
Yes, exactly.
anyway, so it might as well be done in our handlerton so that complexity is reduced for engines that implement commit_ordered. If this is accurate, then this sounds reasonable and I look forward to this working for 10.0.
Ok, that sounds great!
I also (now) realize that MySQL 5.6 serializes commits which is probably a performance issue for us. I have started a thread on the internals MySQL list and hope to learn more on that thread.
Yes, it will be interesting to follow. Hopefully some solution will arise.
accurate. At this point, here is what I hope we can do for 10.0: - not implement commit_ordered - do some bookkeeping in our handlerton to be able to implement commit_checkpoint_request
Sounds reasonable. - Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
Here is a very high level overview.
Thanks for the detailed explanation! I think an easy way to implement commit_ordered() is that all you do is increment a counter and assign it to the transaction. Then in commit(), each transaction waits for the previous commit to write to recovery log before writing itself (so the order becomes correct). That should be a very small modification to your code. But there may be extra context switching, unless you are clever with your write lock access to the recovery log. Maybe there is a different possibility. If I understand correctly, this is the situation: - You need part of commit to run in parallel in multiple threads for good performance ("send a message into the dictionary for every ..."). - The first phase of checkpointing needs to stall all commits while it runs (but it is short). - When a checkpoint is waiting to start, you also need to stall new commits, to prevent starvation of checkpointing. Is this accurate? In fact, such a situation is exactly why I did the split in commit_ordered() and commit(). So that a storage engine can have freedom to choose which part of commit should run serially, and which in parallel. It seems to me that the problem here is that you are using a simple read lock to handle the stalling and avoid starvation. And your read lock implementation does not allow to take the read lock in one thread and release it in another (which is reasonable). Maybe it can be solved simply by just using a different mechanism? Like, keep a counter of threads running inside commit. When a checkpoint is about to start, set a flag, "checkpoint pending", then wait for counter to drop to zero. When a new thread wants to commit, wait for the "checkpoint pending" flag to clear, stalling the commit until checkpoint has completed. Note that it is not a problem to do the wait for checkpoint complete inside commit_ordered(). Yes, this is single-threaded, but all other committing threads will have to wait anyway, so in fact doing the wait just in the one thread will reduce context switches and speed up things. But you could do the wait for checkpoint to complete in eg. prepare() instead if you want. But maybe I am missing something? Not knowing your implementation, I cannot know of course if this naive second idea is infeasible for some reason...
may be expensive, thereby hurting concurrency. So, for such a thing to work, we would have to find a way to grab the read lock in commit_ordered once for each transaction (and because the lock is fair, we can't just regrab the lock on the same thread), write to the recovery log, then perform everything else under commit, then release the read lock. It can probably be done, but it is messy. If unnecessary, I prefer to not do it.
Yes, I understand, deep surgery on synchronisation primitives in the core of an engine is not trivial stuff... In MariaDB, it is not necessary. The commit_ordered() is optional, though it gives you some benefits (and likely more benefits in future versions). If necessary, we can try make eg. the removal of commit fsync() work without commit_ordered(), so you get some of the benefits regardless. And it sounds like my first suggestion should be an easy way to implement commit_ordered(). Though it might require benchmarking to check that it does not hurt performance. If you try any solution, feel free to send me the patch for review and suggestions. But what can you do in MySQL 5.6? In 5.6, effectively what you get is commit_ordered() only, no commit() (their call of commit() with HA_IGNORE_DURABILITY set is essentially the same as commit_ordered()). So you do not get to decide which code to run serially, and which in parallel. Everything in commit() runs serially. Total breakage of the storage engine API, and their developers do not even understand this when pointed out to them :-( I vaguely remember some option in 5.6 that would disable the serialisation of commit(), maybe you can recommend your users to enable that ... - Kristian (painfully aware of writing too long emails).
participants (2)
-
Kristian Nielsen
-
Zardosht Kasheff