On Sun, Nov 19, 2023 at 5:23 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Gordan Bobic via discuss <discuss@lists.mariadb.org> writes:
Thanks for this. Is there a way to force replay of the entire redo log on an unclean shutdown even if the checkpoint in the redo log says it was flushed to tablespace?
This won't help you if the part of the redo log you need was overwritten by new records due to the cyclic nature of the redo log.
I am working here based on the assumption that the time taken to overwrite the circular buffer (typically sized to absorb the peak daily hour of writes) is going to be vastly greater than the amount of data that could be lost to lying about tablespace sync (~5 seconds). So unless something spectacularly anomalous happens, I think there should be plenty of margin for error there.
I'm exploring the idea of running datadir on storage that preserves write ordering but runs with the equivalent of nobarrier. It will still flush in the background every X seconds where X is configurable, so I am hoping to use the redo log to keep my data crash-safe even though I am lying about tablespace write flushes, because write ordering will be preserved despite running with the equivalent of nobarrier.
If write-ordering is preserved (but it has to preserved between log writes and data writes as well), then you will be crash-safe, because the situation will be the same as if a full-durable system crashed X seconds ago. You will lose the last X seconds of commit, but data will be consistent, similar to --innodb-flush-log-at-trx-commit=2 (or 0).
Yes, I already do this on the slaves (innodb_flush_log_at_trx_commit=1, sync_master_info=1) with storage that preserves write ordering but lies about having committed to stable storage. That part of the setup is pretty bulletproof. Slaves just restart replicating from a point a few seconds before and everything is consistent.
What goal are you trying to achieve here? Some performance gains, or the ability to use main storage with some non-standard write semantics?
The objective is to gain a bit of performance on the master node where being a few seconds behind after a dirty shutdown is not a palatable option.
You can configure InnoDB to have a huge redo log and perhaps there are also some options to reduce the frequency of checkpoints.
Well, the traditional rule of thumb has been to size the redo log to absorb the daily peak hour of writes. I [refer to tune it to be sized so that checkpoint age never gets too close to the limit (log size). Unfortunately, the latter option is impossible with the redo log checkpointing changes since 10.5+ (it never flushes anything at all until it reaches the high water mark), but that's for a different conversation thread.
That should in practice avoid the problem with needed redo log being overwritten.
That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
But it's obviously not something that InnoDB was designed to support.
I don't go down rabbit holes like this because it's easy and everybody does it. :-)