Marko Mäkelä <marko.makela@mariadb.com> writes:
Could you please take a look at this InnoDB regression in MariaDB 10.2 and later: https://jira.mariadb.org/browse/MDEV-15740
Because of this bug, we can no longer write tests that would ensure that certain changes are present in the redo log, before killing and
But what is the bug here? This is a Galera-only issue, right? It is not a bug that commit does not flush the log to disk when trx->active_commit_ordered, it's an important performance optimisation. The transaction is already durable at this point (it was synced in prepare, and will be recovered by the transaction coordinator during crash recovery, eg. from the binlog). It's the same in MySQL, just using a different mechanism (thd_requested_durability(trx->mysql_thd) == HA_IGNORE_DURABILITY)). Is the problem just about getting the test cases to work? That's easy to do, for example with a DBUG_EXECUTE_IF() to add a log flush to disk at some appropriate place. Or is the problem that Galera somehow disables crash recovery from the binlog, and requires an fsync during every commit to be crash safe, not just for test-purposes? In that case, isn't the problem the same in MySQL? Then it sounds like the problem is that Galera in MySQL does something to avoid HA_IGNORE_DURABILITY being set, and this "something" was not merged correctly to MariaDB to work within the commit_ordered() framework? Galera needs to either respect the commit_ordered() semantics (see eg. comments in sql/handler.h). Or it needs to take over the transaction coordinator role and control commit_ordered() itself. The latter is the correct approach (but is not done currently as far as I am aware). In general, Galera has never been properly integrated with the transaction coordinator / commit_ordered() framework in MariaDB, and thus we keep getting issues like this popping up. But without knowing how Galera intends this to work, it's hard to say what the real problem is...
- || thd_requested_durability(trx->mysql_thd) - == HA_IGNORE_DURABILITY) { + || (srv_flush_log_at_trx_commit == 1 && trx->active_commit_ordered)) {
This looks particularly strange because changing innodb_flush_log_at_trx_commit from 1 to 2 "fixes" the bug due to the condition srv_flush_log_at_trx_commit == 1 not being true anymore. But 1 is supposed to be more durable and slow compared to 2. Now, with the current code 2 is more durable!
It could be that this could instead use srv_flush_log_at_trx_commit>=1 to also avoid an unnecessary log flush (without fsync) to disk. I didn't check now, as I think this is not the question at hand here?
Related to this, I discussed an idea with Andrei Elkin: Actually there should be no need to flush the InnoDB redo log at transaction commit if logical logging is enabled. On recovery, we could retrieve the latest committed logical log position from InnoDB, and then replay the logical log from that point of time. We would only have to wait for InnoDB redo log write before discarding the head of the logical log. For this, we could implement some API call like "flush logs up to the LSN corresponding to this binlog position".
It's an interesting idea, I had the same one 8 years ago ;-) http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164 (I think there may be a jira corresponding to this MWL#164, but not sure how to find it). It would be really nice to see this implemented. There are a few complications though, eg. multi-engine transactions. And it still requires that storage engines (/Galera) respect the commit_ordered() semantics. If you can describe in more detail what the problem is, I can look a bit deeper? Hope this helps, - Kristian.