Marko Mäkelä <marko.makela@mariadb.com> writes:
recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
Is this the original purpose and motivation of the handlerton::commit_checkpoint_request?
Yes.
I think that for InnoDB to guess the last binlog-committed transaction would require adding a thread-local variable, which could cause trouble with the connection thread pool. An added function parameter would be much simpler.
I agree that a function parameter seems simpler. We're requesting to be notified when a specific commit is durable in the engine, better specify _which_ commit than for InnoDB to guess, as you say :-). And an engine is free to ignore the parameter, and just make sure all existing transactions at that point are committed (eg. current lsn), as my original InnoDB implementation did. We need to be careful about lifetime. The transaction may no longer exist as a THD or trx_t (?) in memory. But innobase_commit_ordered() could return the corresponding LSN, as was suggested in another mail, and then commit_checkpoint_request() could pass that value. Something like that seems indeed a better API.
There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation).
It was a long time ago when I worked on this. The InnoDB log_sys.mutex certainly was a bottleneck before some InnoDB log related data structures were refactored to use C++11 std::atomic.
I was thinking that the log_flush_request list does not need anything fancy, and could just use a normal mutex, it is so rarely accessed. And any more critical code (eg. using log_sys.mutex) could just do an unsafe check for NULL log_flush_request list, at the worst skipping a binlog checkpoint notification until the next flush. But all that becomes more complicated when RESET MASTER will hang until the checkpoint notification arrives. Then a lost checkpoint notification becomes a problem on an idle server. That's why I'm unhappy about the way RESET MASTER is done now (even though I implemented it, IIRC). It really was my intention that skipping a binlog checkpoint from time to time should be allowed to simplify things in the engine.
The 10.4 innobase_mysql_log_notify() follows the anti-pattern https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#do...
The comment that claims that it is safe might not hold for architectures that use a weak memory model, such as typical implementations of ARM, POWER, and RISC-V (RVWMO). I believe that the https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_orderi... that is employed by the 10.5 innodb_log_flush_request() and log_flush_notify() should be safer.
Right. As long as RESET MASTER doesn't tolerate delayed notifications, all of this becomes a concern :-(
If the last write to the log was an InnoDB internal one, it could be done by log_write_up_to(lsn, false). The log_sys.flushed_to_disk_lsn would only be advanced (to somewhere closer to log_sys.lsn) if there was a subsequent call to log_write_up_to(lsn, true).
What I found in the test case I added to MDEV-25611 is that InnoDB flushes the redo log every second (assuming I read the code right). This is the function srv_sync_log_buffer_in_background(). Hm, but actually this can be controlled by the option srv_flush_log_at_timeout, so my testcase could be improved to set this to a high value instead of disabling srv_sync_log_buffer_in_background() in the code.
Right, I did not think of multi-engine transactions. Perhaps there could be a special single-engine transaction mode? If the binlog is
Yes. single-engine transaction is surely the important usecase to optimise. It's nice if multi-engine transactions still work, but if they require multiple fsync still, I think that's perfectly fine, not something to allocate a lot of resources to optimise for.
One more thing: The post https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing the fact that when the "fast part" of an InnoDB transaction state change has finished, the LSN of the transaction COMMIT (or XA PREPARE or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to reorder transactions after that point. In fact, after the completion of the "fast part", other transactions in the system will observe the state change of the transaction, even before it becomes durable.
Thanks. Yes, this sounds like what I would expect.
Another thing: Could replication make use of the log write completion mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to MariaDB Server 10.6.0?
Ah, this is interesting, thanks for the pointer. I will need to look into it more to fully understand, but it does seem like something that might be of use. Not just for user threads writing to the binlog, but maybe even more so for parallel replication worker threads. Because of the wait_for_prior_commit() requirements for in-order parallel replication, a _lot_ of THDs can be waiting for binlog commit, and being able to suspend these THDs and let a worker thread do something else in-between could potentially be very beneficial, I think. I also had the idea to use fibers/coroutines as is mentined in the MDEV description, but if that can be avoided, so much the better. - Kristian.