[Maria-developers] a question about group commit
Hi, as we know there are 3 steps in XA transaction committing 1, prepare step 2, write binary log 3, commit step in engines all these steps need a fsync(). Group commit strategy can make a group of transactions durable with one fsync() at step 2 and step 3, which can lead to dramatic performance enchance. But in step 1, each transaction still do its own fsync(). so why not make several transactions durable whith one fsync() in prepare step just like step 2 and 3, which I think can improve performanc further more ? Thanks 2014-08-04 nanyi607rao
"nanyi607rao" <nanyi607rao@gmail.com> writes:
as we know there are 3 steps in XA transaction committing 1, prepare step 2, write binary log 3, commit step in engines
all these steps need a fsync(). Group commit strategy can make a group of transactions durable with one fsync() at step 2 and step 3, which can lead to dramatic performance enchance.
But in step 1, each transaction still do its own fsync(). so why not make several transactions durable whith one fsync() in prepare step just like step 2 and 3, which I think can improve performanc further more ?
Actually, this is already implemented. Further, in MariaDB 10.0, there is no fsync() needed in step 3. This is because in case of a crash, XA crash recovery can repeat the step 3 using the information saved in step 1 and 2. So in 10.0, we only need one shared fsync in step 1 plus one shared fsync in step 2. If you look in the innodb/xtradb code, you can see this. The prepare step calls trx_prepare_for_mysql() in trx/trx0trx.cc. This calls trx_prepare() which goes to trx_flush_log_if_needed_low() and calls log_write_up_to() in log/log0log.cc. And in log_write_up_to(), you will see the group commit logic. The transaction will wait for any previous fsync to complete; then if it still needs the fsync(), it will fsync not just itself, but also any other transactions that are waiting for fsync. There is some description of the removal of fsync() in step 3 here: http://kristiannielsen.livejournal.com/16382.html However, the group commit in step 1 has been in the InnoDB code for many years, as far as I know. Hope this helps, - Kristian.
Actually, this is already implemented.
Further, in MariaDB 10.0, there is no fsync() needed in step 3. This is because in case of a crash, XA crash recovery can repeat the step 3 using the information saved in step 1 and 2. So in 10.0, we only need one shared fsync in step 1 plus one shared fsync in step 2.
If you look in the innodb/xtradb code, you can see this. The prepare step calls trx_prepare_for_mysql() in trx/trx0trx.cc. This calls trx_prepare() which goes to trx_flush_log_if_needed_low() and calls log_write_up_to() in log/log0log.cc. And in log_write_up_to(), you will see the group commit logic. The transaction will wait for any previous fsync to complete; then if it still needs the fsync(), it will fsync not just itself, but also any other transactions that are waiting for fsync.
There is some description of the removal of fsync() in step 3 here:
However, the group commit in step 1 has been in the InnoDB code for many years, as far as I know.
yeah, I got it. InnoDB/xtradb group commit indeed reduce fsync() called times in prepare step, but it could do more than one fsync() for a group of transactions in binlog group commit to be durable in prepare step. what I think is it only to make sure that a group of transactions writing to binlog has been flushed to innodb/xtradb redolog. so how about don't flush redolog in prepare(), insteadly let leader thread to flush innodb/xtradb redolog to latest lsn just before it begin to write follower transactions and itself to binlog. that only need one fsync() for a group of transactions completed prepare step to flush to redolog. Thanks
"nanyi607rao" <nanyi607rao@gmail.com> writes:
yeah, I got it. InnoDB/xtradb group commit indeed reduce fsync() called times in prepare step, but it could do more than one fsync() for a group of transactions in binlog group commit to be durable in prepare step.
Ah, yes, now I see, I had forgotten about this. Yes, you are right, it is very likely that the transactions that are group committed together in the binlog will need more than one fsync in the prepare step.
what I think is it only to make sure that a group of transactions writing to binlog has been flushed to innodb/xtradb redolog. so how about don't flush redolog in prepare(), insteadly let leader thread to flush innodb/xtradb redolog to latest lsn just before it begin to write follower transactions and itself to binlog. that only need one fsync() for a group of transactions completed prepare step to flush to redolog.
I think it is an interesting idea. It should not be too hard to implement a prototype, and then it would be interesting to run some benchmarks and see what effect it can have. It is hard to predict if it will be a win in all cases - one can imagine some specific scenarios where a slowdown could occur, but such cases may or may not be likely to turn up in practise, it's hard to tell without testing. But it seems likely that it could improve performance in many or even most cases. (This problem was actually something that bothered me a bit when I originally implemented group commit, but I did not so far think much on how to avoid it. So I am happy to see this suggestion.) Thanks, - Kristian.
participants (2)
-
Kristian Nielsen
-
nanyi607rao