Redo log and tablespace flushing
I have a question that may seem somewhat obscure, but what I really want to know is how the disk flushing and crash safety work. Do tablespace commits get explicitly flushed during normal runtime operation? If we have a write that successfully commits to the redo log and to the binlog, but the tablespace loses, say, 5 seconds worth of commits in an unclean shutdown, would crash recovery deal with it? Is replaying the redo log followed by binlog based recovery sufficient to put the tablespace(s) into a consistent state even if the redo+binary logs are in terms of on-disk state a few seconds ahead of the tablespaces? On other words, provided that write ordering is preserved (ordering as guided by flush calls), can I do the equivalent of LD_PRELOAD=libeatmydata on the tablespace operations safely as long as the redo and binary logs are fsync()-ed reliably?
Gordan Bobic via discuss
Do tablespace commits get explicitly flushed during normal runtime operation?
Not commits, no. Only the redo log (and binlog) is fsync'ed per commit, as controlled by --innodb-flush-log-at-trx-commit and --sync-binlog.
If we have a write that successfully commits to the redo log and to the binlog, but the tablespace loses, say, 5 seconds worth of commits in an unclean shutdown, would crash recovery deal with it? Is
Yes.
replaying the redo log followed by binlog based recovery sufficient to put the tablespace(s) into a consistent state even if the redo+binary logs are in terms of on-disk state a few seconds ahead of the tablespaces?
Yes, this is precisely the purpose of the redo log - and why it's also called a write-ahead log.
On other words, provided that write ordering is preserved (ordering as guided by flush calls), can I do the equivalent of LD_PRELOAD=libeatmydata on the tablespace operations safely as long as the redo and binary logs are fsync()-ed reliably?
No. The redo log is of finite size, and cycles. InnoDB regularly does a checkpoint, to ensure that all tablespace data up to a certain point has been durably written. At that point, the redo log corresponding to earlier changes is no longer needed, and can be overwritten by new log data. Crash recovery only needs to replay the log from the last checkpoint. If libeatmydata or other incorrect fsync-behaviour leaves a checkpoint corrupted, then crash recovery can fail. - Kristian.
Hi Kristian,
Thank you for your excellent reply. I thought that some additional
details might be worth mentioning.
On Sat, Nov 18, 2023 at 8:03 PM Kristian Nielsen via discuss
The redo log is of finite size, and cycles. InnoDB regularly does a checkpoint, to ensure that all tablespace data up to a certain point has been durably written.
The writes to each data file must be made durable by fdatasync() or fsync() before the log checkpoint can be advanced. There are two checkpoint headers near the start of ib_logfile0, for remembering the last 2 checkpoint LSNs and the corresponding log file offsets. Recovery or mariadb-backup --backup will choose the larger checkpoint LSN as the starting point. One more aspect to crash recovery is the InnoDB doublewrite buffer, which protects against torn page writes. When it is enabled, any data page writes would first be written to the doublewrite buffer (128 pages in the InnoDB system tablespace), and upon write completion, to the final destination. In that way, if the process is killed during the "main" write, it should be possible to find an intact version of the page in the doublewrite buffer. This buffer is not used by mariadb-backup; it would simply retry reading pages when it encounters a checksum mismatch. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace? I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier. On Sat, 18 Nov 2023, 22:13 Marko Mäkelä via discuss, < discuss@lists.mariadb.org> wrote:
Hi Kristian,
Thank you for your excellent reply. I thought that some additional details might be worth mentioning.
On Sat, Nov 18, 2023 at 8:03 PM Kristian Nielsen via discuss
wrote: The redo log is of finite size, and cycles. InnoDB regularly does a checkpoint, to ensure that all tablespace data up to a certain point has been durably written.
The writes to each data file must be made durable by fdatasync() or fsync() before the log checkpoint can be advanced. There are two checkpoint headers near the start of ib_logfile0, for remembering the last 2 checkpoint LSNs and the corresponding log file offsets. Recovery or mariadb-backup --backup will choose the larger checkpoint LSN as the starting point.
One more aspect to crash recovery is the InnoDB doublewrite buffer, which protects against torn page writes. When it is enabled, any data page writes would first be written to the doublewrite buffer (128 pages in the InnoDB system tablespace), and upon write completion, to the final destination. In that way, if the process is killed during the "main" write, it should be possible to find an intact version of the page in the doublewrite buffer. This buffer is not used by mariadb-backup; it would simply retry reading pages when it encounters a checksum mismatch.
Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc _______________________________________________ discuss mailing list -- discuss@lists.mariadb.org To unsubscribe send an email to discuss-leave@lists.mariadb.org
On Sun, Nov 19, 2023 at 1:03 AM Gordan Bobic
Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace?
You can overwrite the newer checkpoint block, so that recovery is forced to use the older one. Before MariaDB 10.8, the two checkpoint blocks are 512 (0x200) bytes starting at ib_logfile0 offset 0x200 and 0x600. Starting with 10.8, the checkpoint blocks are 64 bytes starting at ib_logfile0 offset 0x1000 and 0x2000. Obviously, do not try this on any important data, or experiment on a copy of the data. It is possible that the recovery will fail in various ways if the section of the log between the older checkpoint and the logical end of the log has been overwritten. The InnoDB WAL file is cyclic: checkpoints "truncate" the head and the tail (new log records) is not supposed to overwrite the head. If you are moving the head backwards by discarding the latest checkpoint, there will be no guarantee that no overwrite took place. Another way to experiment would be to run mariadb-backup --backup while a server is executing a write heavy workload. When you --prepare the backup, it will start from the LSN of the checkpoint that was the latest when the backup started. When the backup finishes, the server’s log file may already be several checkpoints ahead of the backup.
I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier.
I can't comment much on that. It could be a good idea to execute some kind of "pull the plug" testing during a write workload. Perhaps that could be arranged more easily in a virtualized environment. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On Sun, Nov 19, 2023 at 3:42 PM Marko Mäkelä
Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace?
You can overwrite the newer checkpoint block, so that recovery is forced to use the older one. Before MariaDB 10.8, the two checkpoint blocks are 512 (0x200) bytes starting at ib_logfile0 offset 0x200 and 0x600. Starting with 10.8, the checkpoint blocks are 64 bytes starting at ib_logfile0 offset 0x1000 and 0x2000. Obviously, do not try this on any important data, or experiment on a copy of the data. It is possible that the recovery will fail in various ways if the section of the log between the older checkpoint and the logical end of the log has been overwritten. The InnoDB WAL file is cyclic: checkpoints "truncate" the head and the tail (new log records) is not supposed to overwrite the head. If you are moving the head backwards by discarding the latest checkpoint, there will be no guarantee that no overwrite took place.
Another way to experiment would be to run mariadb-backup --backup while a server is executing a write heavy workload. When you --prepare the backup, it will start from the LSN of the checkpoint that was the latest when the backup started. When the backup finishes, the server’s log file may already be several checkpoints ahead of the backup.
I think what I'm looking for is an option to ignore checkpoints, scan the entire redo log and replay everything from lowest to highest available LSN. From what you are saying, if I zero out bytes 512-1023 and bytes 1536-2047 That will force a full log scan / replay? Did I understand that correctly?
I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier.
I can't comment much on that. It could be a good idea to execute some kind of "pull the plug" testing during a write workload. Perhaps that could be arranged more easily in a virtualized environment.
Yes, obviously this would need some extreme testing, that goes without saying. I just wanted to make sure my idea wasn't outright retarded before I went down this particular rabbit hole.
Gordan Bobic via discuss
Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace?
This won't help you if the part of the redo log you need was overwritten by new records due to the cyclic nature of the redo log.
I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier.
If write-ordering is preserved (but it has to preserved between log writes and data writes as well), then you will be crash-safe, because the situation will be the same as if a full-durable system crashed X seconds ago. You will lose the last X seconds of commit, but data will be consistent, similar to --innodb-flush-log-at-trx-commit=2 (or 0). What goal are you trying to achieve here? Some performance gains, or the ability to use main storage with some non-standard write semantics? You can configure InnoDB to have a huge redo log and perhaps there are also some options to reduce the frequency of checkpoints. That should in practice avoid the problem with needed redo log being overwritten. But it's obviously not something that InnoDB was designed to support. - Kristian.
On Sun, Nov 19, 2023 at 5:23 PM Kristian Nielsen
Gordan Bobic via discuss
writes: Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace?
This won't help you if the part of the redo log you need was overwritten by new records due to the cyclic nature of the redo log.
I am working here based on the assumption that the time taken to overwrite the circular buffer (typically sized to absorb the peak daily hour of writes) is going to be vastly greater than the amount of data that could be lost to lying about tablespace sync (~5 seconds). So unless something spectacularly anomalous happens, I think there should be plenty of margin for error there.
I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier.
If write-ordering is preserved (but it has to preserved between log writes and data writes as well), then you will be crash-safe, because the situation will be the same as if a full-durable system crashed X seconds ago. You will lose the last X seconds of commit, but data will be consistent, similar to --innodb-flush-log-at-trx-commit=2 (or 0).
Yes, I already do this on the slaves (innodb_flush_log_at_trx_commit=1, sync_master_info=1) with storage that preserves write ordering but lies about having committed to stable storage. That part of the setup is pretty bulletproof. Slaves just restart replicating from a point a few seconds before and everything is consistent.
What goal are you trying to achieve here? Some performance gains, or the ability to use main storage with some non-standard write semantics?
The objective is to gain a bit of performance on the master node where being a few seconds behind after a dirty shutdown is not a palatable option.
You can configure InnoDB to have a huge redo log and perhaps there are also some options to reduce the frequency of checkpoints.
Well, the traditional rule of thumb has been to size the redo log to absorb the daily peak hour of writes. I [refer to tune it to be sized so that checkpoint age never gets too close to the limit (log size). Unfortunately, the latter option is impossible with the redo log checkpointing changes since 10.5+ (it never flushes anything at all until it reaches the high water mark), but that's for a different conversation thread.
That should in practice avoid the problem with needed redo log being overwritten.
That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
But it's obviously not something that InnoDB was designed to support.
I don't go down rabbit holes like this because it's easy and everybody does it. :-)
Gordan Bobic via discuss
That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
Yes. I'm thinking that Marko's suggestion to clear the newest checkpoint (not both of them) would eliminate this edge case (with all of the caveats that Marko mentioned).
The objective is to gain a bit of performance on the master node where being a few seconds behind after a dirty shutdown is not a palatable option.
But doesn't InnoDB already handle this by itself? That's the whole point of the write-ahead log and checkpointing. Database operations only need to wait for the durable write of redo log records. Durable write of buffer pool pages happens in the background, nothing needs to wait for it (except checkpointing, which shouldn't stall things as long as the redo log is sufficiently large). So where do you gain the performance with this idea? Don't get me wrong, I think it's cool to push performace to the limits (and beyond). I'm just curious what the mechanism would be that would make this increase performance over what the write-ahead log / checkpointing already achieves. What is the storage you're using, how does it improve the page flushing internally over what InnoDB itself can achieve? - Kristian.
On Sun, Nov 19, 2023 at 11:36 PM Kristian Nielsen
Gordan Bobic via discuss
writes: That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
Yes. I'm thinking that Marko's suggestion to clear the newest checkpoint (not both of them) would eliminate this edge case (with all of the caveats that Marko mentioned).
Which one is the more recent one? The first or second? If establishing which is more recent requires reading it, how do I parse these blocks and what am I looking for?
The objective is to gain a bit of performance on the master node where being a few seconds behind after a dirty shutdown is not a palatable option.
But doesn't InnoDB already handle this by itself? That's the whole point of the write-ahead log and checkpointing. Database operations only need to wait for the durable write of redo log records. Durable write of buffer pool pages happens in the background, nothing needs to wait for it (except checkpointing, which shouldn't stall things as long as the redo log is sufficiently large). So where do you gain the performance with this idea?
By removing the flushing overhead from the tablespace path and offloading that overhead to the storage back end. The difference may be relatively small, but it seems there is some improvement that could be had by just doing a heavily write-cached offload to the order-preserving back end.
Don't get me wrong, I think it's cool to push performace to the limits (and beyond). I'm just curious what the mechanism would be that would make this increase performance over what the write-ahead log / checkpointing already achieves. What is the storage you're using, how does it improve the page flushing internally over what InnoDB itself can achieve?
ZFS - it preserves write ordering (based on the flushing calls it receives), and if we can run datadir with sync=disabled and only ib_logfile* and binlogs on a path with sync=standard, it should provide some improvement, e.g. by making those writes asynchronously performed in the background, we could turn up the compression without anything having to wait the extra time for these flushes and potentially causing a stall. I'll admit this is a somewhat obscure case of relying on side-effects - where I still need InnoDB to emit all of the fsync() calls but am actively proposing to lie to InnoDB about it in hope that the redo log replay will save me from a dirty shutdown because data loss on the back end is limited to seconds while WAL is sized to absorb tens of minutes of writes.
Hi Gordan and Kristian,
On Mon, Nov 20, 2023 at 12:48 PM Gordan Bobic via discuss
On Sun, Nov 19, 2023 at 11:36 PM Kristian Nielsen
wrote: Gordan Bobic via discuss
writes: That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
Yes. I'm thinking that Marko's suggestion to clear the newest checkpoint (not both of them) would eliminate this edge case (with all of the caveats that Marko mentioned).
Which one is the more recent one? The first or second? If establishing which is more recent requires reading it, how do I parse these blocks and what am I looking for?
You should look for the 64-bit big-endian unsigned checkpoint LSN. In the file mysql-test/suite/innodb/include/no_checkpoint_end.inc in the source code repository that corresponds to the version that you use, you should find some Perl code for this.
ZFS - it preserves write ordering (based on the flushing calls it receives), and if we can run datadir with sync=disabled and only ib_logfile* and binlogs on a path with sync=standard, it should provide some improvement
I see. Is there any alternative system call that could be used to guarantee write ordering? That is, a lighter-weight variant of fdatasync()? I think that we’d only want strict fdatasync() on the redo log files when the user cares about innodb_flush_log_at_trx_commit=1. If there was a lighter-weight write-ordering system call, and if the fdatasync() made all previous ordered writes persistent, then this could gain some performance in the page flushing, but maybe not so much in the end. Does your storage stack (including the file system implementation in the kernel) support FUA? Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (3)
-
Gordan Bobic
-
Kristian Nielsen
-
Marko Mäkelä