Marko Mäkelä <marko.makela@mariadb.com> writes:
Do you know how MyRocks implements ACID? There is a function
Sorry, I do not. Maybe Sergey Petrunya does.
That function does not currently take any parameter (such as THD) to identify the transaction of interest, and it cannot indicate that the most recent state change of the transaction has already been durably written.
The transaction of interest is the last one that called commit_ordered() prior to this call of commit_checkpoint_request(). I guess binlog code could save it (it's trivial as commit_ordered() calls are serialised), or InnoDB could do it itself. But let's take it from what we actually want to achive. We have created a new binlog file, and now crash recovery will have to scan the two most recent binlog files. We want to know when scanning the old binlog file is no longer necessary; then we will log a CHECKPOINT_EVENT recording this fact. But we don't want to stall waiting for everything to be flushed to disk immediately; we can just log the CHECKPOINT_EVENT later when appropriate. We call commit_checkpoint_request() into all storage engines to receive a notification when all currently (binlog) committed transactions have become durable. Once the engine has synced past its corresponding LSN, it replies with commit_checkpoint_notify_ha(), and when all engines have done so, the CHECKPOINT_EVENT is binlogged. In 10.4 innobase_checkpoint_request() the code saves the current LSN: lsn = log_get_lsn(); An extra check is made if already lsn <= log_get_flush_lsn(); if so, commit_checkpoint_notify_ha() is called immediately. Else, we check again whenever the log is flushed, until the recorded lsn has been flushed. The code in 11.1 is different, though it seems to be doing something similar. There seems to be some lock-free operations, I wonder if that makes sense, this is not timing-critical code (it's called once per binlog rotation). There's also some complex logic to avoid missing a commit_checkpoint_notify_ha(). I can see how that is important and could occur in a completely idle server. But I think it can be done simpler, by simply scheduling a new check of this every second or so as long as there are pending checkpoint requests. This is all asynchronous, no point in trying to report the checkpoint immediately. If InnoDB is running non-durably, I think there is no point in checking the log at all. It doesn't help that we scan the required log files if the transactions to be recoved from InnoDB were not prepared durably and are unrecoverable anyway. So in non-durable mode InnoDB could just call commit_checkpoint_notify_ha() immediately? I notice that RESET MASTER also does a commit_checkpoint_request() and waits for the notification. I can understand why I did it this way, but in hindsight it doesn't seem like a good idea. Binlog checkpoint is async, so we should not wait on it. RESET MASTER surely cannot be time critical, so probably it would be better to just ask InnoDB to flush everything to its log, skipping the checkpoint mechanism. Or just do nothing, does it make any sense to recover the binlog into a consistent state with InnoDB when we just deleted all of the binlog?!? Anyway, these are comments on the current code. I still didn't fully understand what the problem is with the current API wrt. InnoDB? But I don't see a big problem with changing it either if InnoDB needs something different. All that is needed from the binlog side is to get a notification at some point that an old binlog file is no longer needed for recovery. If we can find a simpler way to do this, then that's good. Removing the checkpointing from RESET MASTER completely might be a good start.
Where should a notification be initiated if all changes have already been written at the time handlerton::commit_checkpoint_request is called?
Then commit_checkpoint_notify_ha() could be called immediately before returning from commit_checkpoint_request().
Since the binlog writing is essentially sequential in nature, it is more efficient to do it in a single thread for all waiting threads, and this is how the MariaDB binlog group commit is done.
Yes, optimizing that would require a file format change, to replace the append-only-to-last-file with something else, such as one or multiple preallocated files, possibly arranged as a ring buffer.
Agree, pre-allocated would be more efficient.
One related idea that has been floating around is to use the InnoDB log as a "doublewrite buffer" for short enough binlog event groups. The maximum mini-transaction size (dictated by the InnoDB recovery code) is something like 4 megabytes. On an InnoDB log checkpoint, any buffered binlog event groups would have to be durably written to the binlog. Likewise, if a binlog event group is too large to be buffered, it would have to be written and fdatasync()ed in advance.
Ah, yes, this is an interesting idea actually.
Yes, it is an interesting idea. What do you see as the mechanism to recover a transaction inside InnoDB in case of a crash just after writing to the binlog?
It is simple but potentially costly: Those transactions that were recovered in some other state than COMMIT or XA PREPARE will be rolled back. Then everything will be replayed from the binlog. InnoDB does
I have an old worklog where this idea was explored in some detail. It is not trivial, there are many corner cases where replaying binlog events will not reliably recover the state (eg. statement-based binlogging). People have learned to live with this for replication, but now that will be extended to the binlog. One "interesting" corner case is a multi-engine transaction that ended up durable in one engine but needs replay in another engine. Maybe all this is tolerable and we can require row-mode binlogging or something, the gain will be huge for some workloads.
store the latest committed binlog file name and offset. I do not know where exactly that information is being used.
(I think the information with binlog file name and offset is used to provision a new slave from an innobackup or LVM/BTRFS snapshot?). - Kristian.