Re: [Maria-developers] MDEV-15740 (InnoDB lost Durability due to incorrect fix of MDEV-11937)

2 Nov 2018

      Hi Kristian,
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
...
Marko Mäkelä <marko.makela@mariadb.com> writes:
...
Could you please take a look at this InnoDB regression in MariaDB 10.2
and later:
https://jira.mariadb.org/browse/MDEV-15740
Because of this bug, we can no longer write tests that would ensure
that certain changes are present in the redo log, before killing and
But what is the bug here? This is a Galera-only issue, right?
I was under the impression that it affects normal InnoDB too.
But indeed, it looks like you could be right that it is only affecting Galera.

I am sorry for blaming you; I think that the test innodb.innodb_force_recovery
proves that the redo log will be flushed correctly in 10.2 as well as 10.4.
I also tested with a variant that relies on the commit of DML (INSERT)
instead of DDL (DROP TABLE) for flushing the redo log.
I have now reassigned MDEV-15740 as a Galera bug.
...
It is not a bug that commit does not flush the log to disk when
trx->active_commit_ordered, it's an important performance optimisation. The
transaction is already durable at this point (it was synced in prepare, and
will be recovered by the transaction coordinator during crash recovery, eg.
from the binlog).
It's the same in MySQL, just using a different mechanism
(thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)).
Thanks, this makes sense now.
...
Is the problem just about getting the test cases to work? That's easy to do,
for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some
appropriate place.
I prefer to have tests that run on non-debug binaries, and
innodb_flush_log_at_trx_commit=1 should trigger a redo log write
(which will also persist any incomplete transactions from concurrent sessions).

For those cases where we need debug instrumentation, I prefer DEBUG_SYNC.
For example, if I have to kill the server at a specific point, I prefer to do it
externally, like this:

connection other;
SET DEBUG_SYNC = 'some_point SIGNAL hung WAIT_FOR ever';
…
connection default;
SET DEBUG_SYNC = 'now WAIT_FOR hung';
--source include/kill_mysqld.inc
disconnect other;
--source include/start_mysqld.inc

In this way, the test will not cause any memory leak errors in Valgrind or ASAN.
...
Or is the problem that Galera somehow disables crash recovery from the
binlog, and requires an fsync during every commit to be crash safe, not just
for test-purposes? In that case, isn't the problem the same in MySQL? Then
it sounds like the problem is that Galera in MySQL does something to avoid
HA_IGNORE_DURABILITY being set, and this "something" was not merged
correctly to MariaDB to work within the commit_ordered() framework?
Galera needs to either respect the commit_ordered() semantics (see eg.
comments in sql/handler.h). Or it needs to take over the transaction
coordinator role and control commit_ordered() itself. The latter is the
correct approach (but is not done currently as far as I am aware).
In general, Galera has never been properly integrated with the transaction
coordinator / commit_ordered() framework in MariaDB, and thus we keep
getting issues like this popping up. But without knowing how Galera intends
this to work, it's hard to say what the real problem is...
I have not participated in Galera development, so I cannot say either.
...
...
-           || thd_requested_durability(trx->mysql_thd)
-              == HA_IGNORE_DURABILITY) {
+           || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
This looks particularly strange because changing
innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the
condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1
is supposed to be more durable and slow compared to 2. Now, with the
current code 2 is more durable!
It could be that this could instead use srv_flush_log_at_trx_commit>=1 to
also avoid an unnecessary log flush (without fsync) to disk. I didn't check
now, as I think this is not the question at hand here?
Yes, it could be changed to >=1 as part of the Galera fix.
...
...
Related to this, I discussed an idea with Andrei Elkin: Actually there
should be no need to flush the InnoDB redo log at transaction commit if
logical logging is enabled. On recovery, we could retrieve the latest
committed logical log position from InnoDB, and then replay the logical
log from that point of time. We would only have to wait for InnoDB redo
log write before discarding the head of the logical log. For this, we
could implement some API call like "flush logs up to the LSN corresponding
to this binlog position".
It's an interesting idea, I had the same one 8 years ago ;-)
http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164
(I think there may be a jira corresponding to this MWL#164, but not sure how
to find it).
It would be really nice to see this implemented. There are a few
complications though, eg. multi-engine transactions. And it still requires
that storage engines (/Galera) respect the commit_ordered() semantics.
Yes, I believe that multi-engine transactions would still need a XA
2PC mechanism.
In that case, it seems to me that the binlog would have to continue to
do extra fsync() at all of XA PREPARE, XA COMMIT, XA ROLLBACK.
...
If you can describe in more detail what the problem is, I can look a bit
deeper?
I think that I must carefully debug the issue that I observed in 10.4
to see what exactly is wrong there.
I was making a false assumption, but luckily as a result of this, we
got MDEV-15740 properly recategorized as a Galera bug.

Thanks for the help!

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation