On Sun, Nov 19, 2023 at 11:36 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Gordan Bobic via discuss <discuss@lists.mariadb.org> writes:
That would still leave the edge case of a few seconds after it does eventually write the checkpoint, would it not? I am effectively looking at a case of "never write a checkpoint".
Yes. I'm thinking that Marko's suggestion to clear the newest checkpoint (not both of them) would eliminate this edge case (with all of the caveats that Marko mentioned).
Which one is the more recent one? The first or second? If establishing which is more recent requires reading it, how do I parse these blocks and what am I looking for?
The objective is to gain a bit of performance on the master node where being a few seconds behind after a dirty shutdown is not a palatable option.
But doesn't InnoDB already handle this by itself? That's the whole point of the write-ahead log and checkpointing. Database operations only need to wait for the durable write of redo log records. Durable write of buffer pool pages happens in the background, nothing needs to wait for it (except checkpointing, which shouldn't stall things as long as the redo log is sufficiently large). So where do you gain the performance with this idea?
By removing the flushing overhead from the tablespace path and offloading that overhead to the storage back end. The difference may be relatively small, but it seems there is some improvement that could be had by just doing a heavily write-cached offload to the order-preserving back end.
Don't get me wrong, I think it's cool to push performace to the limits (and beyond). I'm just curious what the mechanism would be that would make this increase performance over what the write-ahead log / checkpointing already achieves. What is the storage you're using, how does it improve the page flushing internally over what InnoDB itself can achieve?
ZFS - it preserves write ordering (based on the flushing calls it receives), and if we can run datadir with sync=disabled and only ib_logfile* and binlogs on a path with sync=standard, it should provide some improvement, e.g. by making those writes asynchronously performed in the background, we could turn up the compression without anything having to wait the extra time for these flushes and potentially causing a stall. I'll admit this is a somewhat obscure case of relying on side-effects - where I still need InnoDB to emit all of the fsync() calls but am actively proposing to lie to InnoDB about it in hope that the redo log replay will save me from a dirty shutdown because data loss on the back end is limited to seconds while WAL is sized to absorb tens of minutes of writes.