On Wed, Apr 10, 2024 at 2:16 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Gordan,
On Mon, Apr 8, 2024 at 9:41 AM Gordan Bobic via discuss <discuss@lists.mariadb.org> wrote:
I've mentioned this here a few times before, but it looks like the latest 10.5.x update to 10.5.24 actually made things worse than in 10.5.23.
I have attached two screenshots of transaction log checkpoint age over a 5 minute oltp_write_only sysbench test with 10.5.23 and 10.5.24.
10.5.23 starts off with the flushing all or nothing behaviour but it dampens out after the first minute or so. 10.5.24 exhibits it permanently. I ran multiple separate tests and the behaviour is very consistent for each version.
10.5.23 almost (not quite but almost) seems to have had things fixed to a state where things were in 10.5.8.
Thank you for the report.
A nitpick on the terminology: There is no such thing as "the transaction log" in InnoDB. There is the write-ahead log or redo log (ib_logfile0), which is the primary mechanism for making changes durable. Each write transaction is associated with a number of undo log pages, which, just like any other persistent pages, are covered by the write-ahead log. I suspect that your observation is related to page flushing, and not writes of the log file.
I see that MariaDB Server 10.5.24 includes two commits related to MDEV-26055: https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c... and its parent. The reason why these fixes were ported from 10.6 was that a regression test mentioned in https://jira.mariadb.org/browse/MDEV-32681 was failing due to taking an extreme amount of time.
I also tested the latest 10.6.17 and that exhibits the same broken behaviour.
This is interesting. Can you try to collect more metrics, something like the oltp_*.png graphs in https://jira.mariadb.org/browse/MDEV-23855? Based on just one metric (such as transaction throughput or I/O rate), it is difficult to make any accurate conclusions.
Please note that the 10.6.17 release is affected by a performance regression https://jira.mariadb.org/browse/MDEV-33508. It would make sense to apply that fix or to test the latest 10.6 development snapshot.
Is a fix for this behaviour expected any time soon?
As soon as we know what is going on, things should be simple.
As a starting point, can we agree that the sawtooth pattern on the checkpoint age as per the attached screenshot is not sane behaviour when exposed to a steady but not overwhelming stream of writes?
The config I tested with is as follows:
I see that your innodb_log_file_size is only a quarter of the innodb_buffer_pool_size. Increasing it could reduce the need to write out dirty pages.
Can you elaborate on what innodb_log_file_size has to do with the innodb_buffer_pool_size? AFAIK: - Buffer pool is sized to cache the active working data set. - Redo log is (or was, back when it worked correctly) sized to absorb the peak burst of writes and prevent write stalls caused by it filling up. Any bigger than that, and all it is achieving is slowing down a startup after a dirty shutdown for no good reason. These two things are not directly related, are they?
I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is padded out to that size. In my case it is done to line it up with InnoDB page size because it is on ZFS (which doesn't support O_DIRECT yet), and I want to make sure that writes are aligned to ZFS recordsize to evade read-modify-write overhead. But I don't think that is particularly relevant to the main point of this thread.