[Maria-developers] MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)
Hi Kristian, Could you please take a look at this InnoDB regression in MariaDB 10.2 and later: https://jira.mariadb.org/browse/MDEV-15740 Because of this bug, we can no longer write tests that would ensure that certain changes are present in the redo log, before killing and restarting the server. This already hurt me a bit when testing MDEV-15562 (instant DROP COLUMN) in 10.4. I had to add extra sleeps to test cases in order to actually test redo log roll-forward followed by rollback. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Marko Mäkelä <marko.makela@mariadb.com> writes:
Could you please take a look at this InnoDB regression in MariaDB 10.2 and later: https://jira.mariadb.org/browse/MDEV-15740
Because of this bug, we can no longer write tests that would ensure that certain changes are present in the redo log, before killing and
But what is the bug here? This is a Galera-only issue, right? It is not a bug that commit does not flush the log to disk when trx->active_commit_ordered, it's an important performance optimisation. The transaction is already durable at this point (it was synced in prepare, and will be recovered by the transaction coordinator during crash recovery, eg. from the binlog). It's the same in MySQL, just using a different mechanism (thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)). Is the problem just about getting the test cases to work? That's easy to do, for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some appropriate place. Or is the problem that Galera somehow disables crash recovery from the binlog, and requires an fsync during every commit to be crash safe, not just for test-purposes? In that case, isn't the problem the same in MySQL? Then it sounds like the problem is that Galera in MySQL does something to avoid HA_IGNORE_DURABILITY being set, and this "something" was not merged correctly to MariaDB to work within the commit_ordered() framework? Galera needs to either respect the commit_ordered() semantics (see eg. comments in sql/handler.h). Or it needs to take over the transaction coordinator role and control commit_ordered() itself. The latter is the correct approach (but is not done currently as far as I am aware). In general, Galera has never been properly integrated with the transaction coordinator / commit_ordered() framework in MariaDB, and thus we keep getting issues like this popping up. But without knowing how Galera intends this to work, it's hard to say what the real problem is...
- || thd_requested_durability(trx->mysql_thd) - == HA_IGNORE_DURABILITY) { + || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable!
It could be that this could instead use srv_flush_log_at_trx_commit>=1 to also avoid an unnecessary log flush (without fsync) to disk. I didn't check now, as I think this is not the question at hand here?
Related to this, I discussed an idea with Andrei Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".
It's an interesting idea, I had the same one 8 years ago ;-) http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164 (I think there may be a jira corresponding to this MWL#164, but not sure how to find it). It would be really nice to see this implemented. There are a few complications though, eg. multi-engine transactions. And it still requires that storage engines (/Galera) respect the commit_ordered() semantics. If you can describe in more detail what the problem is, I can look a bit deeper? Hope this helps, - Kristian.
Hi Kristian, Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Marko Mäkelä <marko.makela@mariadb.com> writes:
Could you please take a look at this InnoDB regression in MariaDB 10.2 and later: https://jira.mariadb.org/browse/MDEV-15740
Because of this bug, we can no longer write tests that would ensure that certain changes are present in the redo log, before killing and
But what is the bug here? This is a Galera-only issue, right?
I was under the impression that it affects normal InnoDB too. But indeed, it looks like you could be right that it is only affecting Galera. I am sorry for blaming you; I think that the test innodb.innodb_force_recovery proves that the redo log will be flushed correctly in 10.2 as well as 10.4. I also tested with a variant that relies on the commit of DML (INSERT) instead of DDL (DROP TABLE) for flushing the redo log. I have now reassigned MDEV-15740 as a Galera bug.
It is not a bug that commit does not flush the log to disk when trx->active_commit_ordered, it's an important performance optimisation. The transaction is already durable at this point (it was synced in prepare, and will be recovered by the transaction coordinator during crash recovery, eg. from the binlog).
It's the same in MySQL, just using a different mechanism (thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)).
Thanks, this makes sense now.
Is the problem just about getting the test cases to work? That's easy to do, for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some appropriate place.
I prefer to have tests that run on non-debug binaries, and innodb_flush_log_at_trx_commit=1 should trigger a redo log write (which will also persist any incomplete transactions from concurrent sessions). For those cases where we need debug instrumentation, I prefer DEBUG_SYNC. For example, if I have to kill the server at a specific point, I prefer to do it externally, like this: connection other; SET DEBUG_SYNC = 'some_point SIGNAL hung WAIT_FOR ever'; … connection default; SET DEBUG_SYNC = 'now WAIT_FOR hung'; --source include/kill_mysqld.inc disconnect other; --source include/start_mysqld.inc In this way, the test will not cause any memory leak errors in Valgrind or ASAN.
Or is the problem that Galera somehow disables crash recovery from the binlog, and requires an fsync during every commit to be crash safe, not just for test-purposes? In that case, isn't the problem the same in MySQL? Then it sounds like the problem is that Galera in MySQL does something to avoid HA_IGNORE_DURABILITY being set, and this "something" was not merged correctly to MariaDB to work within the commit_ordered() framework?
Galera needs to either respect the commit_ordered() semantics (see eg. comments in sql/handler.h). Or it needs to take over the transaction coordinator role and control commit_ordered() itself. The latter is the correct approach (but is not done currently as far as I am aware).
In general, Galera has never been properly integrated with the transaction coordinator / commit_ordered() framework in MariaDB, and thus we keep getting issues like this popping up. But without knowing how Galera intends this to work, it's hard to say what the real problem is...
I have not participated in Galera development, so I cannot say either.
- || thd_requested_durability(trx->mysql_thd) - == HA_IGNORE_DURABILITY) { + || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable!
It could be that this could instead use srv_flush_log_at_trx_commit>=1 to also avoid an unnecessary log flush (without fsync) to disk. I didn't check now, as I think this is not the question at hand here?
Yes, it could be changed to >=1 as part of the Galera fix.
Related to this, I discussed an idea with Andrei Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".
It's an interesting idea, I had the same one 8 years ago ;-)
http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164
(I think there may be a jira corresponding to this MWL#164, but not sure how to find it).
It would be really nice to see this implemented. There are a few complications though, eg. multi-engine transactions. And it still requires that storage engines (/Galera) respect the commit_ordered() semantics.
Yes, I believe that multi-engine transactions would still need a XA 2PC mechanism. In that case, it seems to me that the binlog would have to continue to do extra fsync() at all of XA PREPARE, XA COMMIT, XA ROLLBACK.
If you can describe in more detail what the problem is, I can look a bit deeper?
I think that I must carefully debug the issue that I observed in 10.4 to see what exactly is wrong there. I was making a false assumption, but luckily as a result of this, we got MDEV-15740 properly recategorized as a Galera bug. Thanks for the help! Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Marko Mäkelä <marko.makela@mariadb.com> writes:
Hi Kristian,
I was under the impression that it affects normal InnoDB too. But indeed, it looks like you could be right that it is only affecting Galera.
Right. The idea is that one of these two cases apply: 1. There is 2-phase commit (eg. between InnoDB and binlog). The innodb redo log is synced to disk in prepare, commit_ordered() is called where trx->active_commit_ordered is set, and we skip syncing to disk in commit (because we ca recover from the prepared transaction). 2. There is no 2-phase commit (eg. binlog disabled). Prepare / commit_ordered() is not called, trx->active_commit_ordered is not set, log is synced to disk during commit. It sounds like Galera somehow disables the recovery step of (1) (otherwise it would not see the transaction lost). Maybe that is the problem. Or if Galera really doesn't want to recover lost transactions from binlog, maybe it shouldn't do the prepare step (which is costly due to fsync()), not the commit_ordered(). That will break the synchronisation between commit order in binlog and innodb, but I think Galera is already breaking that severely. But yes, something for Galera people to look at. Thanks, - Kristian.
participants (2)
-
Kristian Nielsen
-
Marko Mäkelä