On Fri, May 12, 2023 at 8:32 PM Kristian Nielsen
The transaction of interest is the last one that called commit_ordered() prior to this call of commit_checkpoint_request(). I guess binlog code could save it (it's trivial as commit_ordered() calls are serialised), or InnoDB could do it itself.
I think that for InnoDB to guess the last binlog-committed transaction would require adding a thread-local variable, which could cause trouble with the connection thread pool. An added function parameter would be much simpler.
But let's take it from what we actually want to achive. We have created a new binlog file, and now crash recovery will have to scan the two most recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate.
Is this the original purpose and motivation of the handlerton::commit_checkpoint_request?
There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation).
It was a long time ago when I worked on this. The InnoDB log_sys.mutex certainly was a bottleneck before some InnoDB log related data structures were refactored to use C++11 std::atomic. I might have originally misunderstood the source code comment in sql/handler.h that was added in MDEV-232. The 10.4 innobase_mysql_log_notify() follows the anti-pattern https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces#do... The comment that claims that it is safe might not hold for architectures that use a weak memory model, such as typical implementations of ARM, POWER, and RISC-V (RVWMO). I believe that the https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_orderi... that is employed by the 10.5 innodb_log_flush_request() and log_flush_notify() should be safer.
There's also some complex logic to avoid missing a commit_checkpoint_notify_ha(). I can see how that is important and could occur in a completely idle server. But I think it can be done simpler, by simply scheduling a new check of this every second or so as long as there are pending checkpoint requests. This is all asynchronous, no point in trying to report the checkpoint immediately.
According to a comment in https://jira.mariadb.org/browse/MDEV-25611 there is an https://rr-project.org/ trace of a hang available to Andrei Elkin. It is also pretty easy to reproduce the hang by using some debug instrumentation that will make InnoDB to try harder to avoid issuing log checkpoints (and therefore trigger another call to log_flush_notify() which will therefore "rescue" the hang). It would be very interesting to know what happened concurrently with or after the last call to innodb_log_flush_request().
If InnoDB is running non-durably, I think there is no point in checking the log at all. It doesn't help that we scan the required log files if the transactions to be recoved from InnoDB were not prepared durably and are unrecoverable anyway. So in non-durable mode InnoDB could just call commit_checkpoint_notify_ha() immediately?
I do not think that we should spend too much effort to optimize an unsafe, corruption-prone mode of operation. We should be safe by default (https://jira.mariadb.org/browse/MDEV-16589) and optimize that case.
I notice that RESET MASTER also does a commit_checkpoint_request() and waits for the notification. I can understand why I did it this way, but in hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so we should not wait on it. RESET MASTER surely cannot be time critical, so probably it would be better to just ask InnoDB to flush everything to its log, skipping the checkpoint mechanism. Or just do nothing, does it make any sense to recover the binlog into a consistent state with InnoDB when we just deleted all of the binlog?!?
If RESET MASTER can rewind the binlog position, then it might make sense to ensure that the new binlog position is durably written to the persistent InnoDB transaction metadata storage.
Anyway, these are comments on the current code. I still didn't fully understand what the problem is with the current API wrt. InnoDB?
Based on your description, it does not feel like a big design problem. I would very much like to see an analysis and fix of the MDEV-25611 hang. It could be something fairly trivial, or a sign of a more serious problem in the design.
I have an old worklog where this idea was explored in some detail. It is not trivial, there are many corner cases where replaying binlog events will not reliably recover the state (eg. statement-based binlogging). People have learned to live with this for replication, but now that will be extended to the binlog. One "interesting" corner case is a multi-engine transaction that ended up durable in one engine but needs replay in another engine. Maybe all this is tolerable and we can require row-mode binlogging or something, the gain will be huge for some workloads.
Right, I did not think of multi-engine transactions. Perhaps there could be a special single-engine transaction mode? If the binlog is guaranteed to be durably written ahead of the engine log, then we should be able to skip the MySQLXid mechanism as well. However, I do not know how much that might end up saving.
store the latest committed binlog file name and offset. I do not know where exactly that information is being used.
(I think the information with binlog file name and offset is used to provision a new slave from an innobackup or LVM/BTRFS snapshot?).
Yes, something like that. See https://jira.mariadb.org/browse/MDEV-22351 and https://jira.mariadb.org/browse/MDEV-21611. One more thing: The post https://knielsen-hq.org/w/fixing-mysql-group-commit-part-3/ is missing the fact that when the "fast part" of an InnoDB transaction state change has finished, the LSN of the transaction COMMIT (or XA PREPARE or XA ROLLBACK or XA COMMIT) will be determined. It is not possible to reorder transactions after that point. In fact, after the completion of the "fast part", other transactions in the system will observe the state change of the transaction, even before it becomes durable. Another thing: Could replication make use of the log write completion mechanism https://jira.mariadb.org/browse/MDEV-24341 that was added to MariaDB Server 10.6.0? -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc