[Maria-developers] commit_checkpoint_request() vs. thd_get_durability_property() (in relation to MDEV-11937)
Monty asked me to fix MDEV-11937. This particular one is a performance regression in InnoDB commit. But there is a wider problem that I thought I should explain, so it can be perhaps avoided in the future. MariaDB and MySQL use different mechanisms for storage engines to avoid having to fsync during commit when binlog is enabled. In MariaDB, storage engines implement the commit_checkpoint_request() handlerton method. In MySQL, storage engines call thd_get_durability_property() to check if they can avoid fsync. The problem is that somehow the thd_get_durability_property() function was introduced into MariaDB code, but it is completely non-functional. So now there is code in InnoDB, TokuDB and RocksDB that calls this function and does not work correctly. This lead to performance regression due to extra fsync() calls. This seems to me a serious problem. Now new code can be merged and compile fine, where in reality it is wrong. There really should not be two separate and different mechanisms for the same thing, and certainly not with one of them non-functional. The "expected" approach would be to remove thd_get_durability_property() and update storage engines to use the corresponding MariaDB APIs (commit_ordered() and commit_checkpoint_request()). This should not be hard. A simple commit_checkpoint_request() implementation can just fsync all transactions immediately (similar to what MySQL does). A more detailed implementation can avoid any extra fsyncs, and just asynchroneously notify the upper layer with commit_checkpoint_notify_ha() later when such fsync happens normally (this is what InnoDB does). See comments in handler.h for details. Or was the intention to eventually replace the whole MariaDB binlog group commit implementation with the MySQL one, to make MariaDB less divergent? This would require a number of changes to MariaDB binlog and replication. The binlog recovery code should be replaced (MySQL does not have the ability to recover from more than one binlog). The binlog group commit code must be replaced as well, and the commit_ordered() mechanism removed. I think this would also require a re-design of the MariaDB in-order parallel replication. MySQL has some optional mechanism for in-order, but my understanding is that it is not sufficient to do optimistic parallel replication. I am not intimately familiar with the MySQL code though. I hope this helps. The wider problem behind MDEV-11937 is one of policy more than one of code bugs, so not too much else I can do to address it. ----------------------------------------------------------------------- Incidentally, I noticed some code in InnoDB trx0trx.cc: /* We set the HA_IGNORE_DURABILITY during prepare phase of binlog group commit to not flush redo log for every transaction here. So that we can flush prepared records of transactions to redo log in a group right before writing them to binary log during flush stage of binlog group commit. */ So the idea is to do group prepare with the same group of transactions that will later group commit to the binlog. In MariaDB, this concept does not exist. Storage engine prepares are allowed to run in parallel and in any order compared to binlog commit. So the InnoDB group prepare can include more transactions than participate in the binlog commit (but also less, of course). IIRC, the important thing is to ensure that all transactions are durably prepared in storage engines before being written to the binlog. In MariaDB, there is MYSQL_LOG_BIN::group_commit_queue that holds the list of transactions to be group committed to the binlog. A similar mechanism in MariaDB might use the group_commit_queue as the set of transactions to send to storage engines for group prepare. But if we wait for a prepare fsync() after building this list, some transactions that could prepare during this fsync may be unnecesarily delayed to the next binlog commit. The MySQL 5.7 code grabs the list of transactions to binlog group commit, and then flushes the log in _all_ storage engines, unconditionally. That just seems horrybly wrong, but I can't see any other way to read the code (from MYSQL_BIN_LOG::process_flush_stage_queue()): ha_flush_logs(NULL, true); It even does so while holding LOCK_log :-( I guess the MySQL idea is that there is only one storage engine anyway, InnoDB. - Kristian.
Hi, On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
The problem is that somehow the thd_get_durability_property() function was introduced into MariaDB code, but it is completely non-functional. So now there is code in InnoDB, TokuDB and RocksDB that calls this function and does not work correctly. This lead to performance regression due to extra fsync() calls.
This function was introduced while I was doing InnoDB 5.7.9 merge, this was huge merge but in my understanding all server changes i.e. also storage API changes were reviewed. But, as patch was significant there was error. That function should not be introduced.
Or was the intention to eventually replace the whole MariaDB binlog group commit implementation with the MySQL one, to make MariaDB less divergent? This would require a number of changes to MariaDB binlog and replication.
No this was not the intention. R: Jan
Kristian, Thank you for your analysis. I suggest that we remove the unused declarations and code as soon as possible. Would you do that already under MDEV-11937? I think that future code merges from MySQL should be based on individual commits rather than snapshots. For merging the changes from 5.7.14 to 5.7.18, I already did this, and I rewrote or omitted some commits; see MDEV-11751. One particular problem with snapshots is that changes or additions to test files were often omitted. When merging changes commit by commit ("git format-patch" followed by "git am"), some manual work is needed to import the affected test files. The Oracle policy should be to only withhold test files for security bugs; we should take full advantage of the public tests. Best regards, Marko On Mon, Aug 7, 2017 at 2:25 PM, Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi,
On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
The problem is that somehow the thd_get_durability_property() function was introduced into MariaDB code, but it is completely non-functional. So now there is code in InnoDB, TokuDB and RocksDB that calls this function and does not work correctly. This lead to performance regression due to extra fsync() calls.
This function was introduced while I was doing InnoDB 5.7.9 merge, this was huge merge but in my understanding all server changes i.e. also storage API changes were reviewed. But, as patch was significant there was error. That function should not be introduced.
Or was the intention to eventually replace the whole MariaDB binlog group commit implementation with the MySQL one, to make MariaDB less divergent? This would require a number of changes to MariaDB binlog and replication.
No this was not the intention.
R: Jan
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Marko Mäkelä <marko.makela@mariadb.com> writes:
Thank you for your analysis. I suggest that we remove the unused declarations and code as soon as possible. Would you do that already under MDEV-11937?
I started on this, but then noticed that several other storage engines reference the APIs also, and thought I should not remove it without discussion first. Another example is the added argument binlog_group_flush to the flush_logs handlerton method. This is the binlog-based group prepare, which is not in MariaDB. The InnoDB code seems to always pass binlog_group_flush as true, isn't this wrong? It will skip log sync in srv_flush_log_at_trx_commit==0, not sure what bad consequences that will have though, if any. My opinion would be that storage engines should be adapted to the mariadb APIs before being pushed into any main tree. It really should not take much effort to do so, apart from taking the time to understand the code properly first. For InnoDB, it should be already done, so new code just needs to be adapted to the different APIs. (BTW, I hope it is clear that I am not blaming any individual bug on any individual engineer here. I know full well how MariaDB development happens and what policies lead to these consequences). - Kristian.
I think that future code merges from MySQL should be based on individual commits rather than snapshots. For merging the changes from 5.7.14 to 5.7.18, I already did this, and I rewrote or omitted some commits; see MDEV-11751.
One particular problem with snapshots is that changes or additions to test files were often omitted. When merging changes commit by commit ("git format-patch" followed by "git am"), some manual work is needed to import the affected test files. The Oracle policy should be to only withhold test files for security bugs; we should take full advantage of the public tests.
Best regards,
Marko
On Mon, Aug 7, 2017 at 2:25 PM, Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi,
On Mon, Aug 7, 2017 at 2:09 PM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
The problem is that somehow the thd_get_durability_property() function was introduced into MariaDB code, but it is completely non-functional. So now there is code in InnoDB, TokuDB and RocksDB that calls this function and does not work correctly. This lead to performance regression due to extra fsync() calls.
This function was introduced while I was doing InnoDB 5.7.9 merge, this was huge merge but in my understanding all server changes i.e. also storage API changes were reviewed. But, as patch was significant there was error. That function should not be introduced.
Or was the intention to eventually replace the whole MariaDB binlog group commit implementation with the MySQL one, to make MariaDB less divergent? This would require a number of changes to MariaDB binlog and replication.
No this was not the intention.
R: Jan
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
participants (3)
-
Jan Lindström
-
Kristian Nielsen
-
Marko Mäkelä