Transaction Log Flushing got worse again in 10.5.23 -> 10.5.24
I've mentioned this here a few times before, but it looks like the latest 10.5.x update to 10.5.24 actually made things worse than in 10.5.23. I have attached two screenshots of transaction log checkpoint age over a 5 minute oltp_write_only sysbench test with 10.5.23 and 10.5.24. 10.5.23 starts off with the flushing all or nothing behaviour but it dampens out after the first minute or so. 10.5.24 exhibits it permanently. I ran multiple separate tests and the behaviour is very consistent for each version. 10.5.23 almost (not quite but almost) seems to have had things fixed to a state where things were in 10.5.8. It's pretty bad that whatever causes this undesirable flushing behaviour seems to have been merged as of 10.5.9 (10.5.8 seems to be the newest version that doesn't exhibit it). It's much more concerning that nobody seems to care that the last good major version (10.4) is going EOL soon with apparently no newer long-term tested version that is not broken. I also tested the latest 10.6.17 and that exhibits the same broken behaviour. Is a fix for this behaviour expected any time soon? On high throughput production systems that can have prolonged bursts of large writes that can hit at any time - if they hit at the top of the checkpoint age that the database made no effort to flush, there is no space left in the transaction log to absorb that impact and it leads to stalls. If combined with disk I/O exhaustion, these stalls can be so bad that for a period of tens of seconds or even minutes even the database telemetry gathering starts to fails. I also attached a screenshot with a 10 minute period where 10.5.8 and 10.5.9 were sysbenched back-to-back for 5 minutes each, with an update/restart in the middle. How extensively is the commit that introduced this behaviour built upon in subsequent changes? How much of a rabbit hole am I heading down if I try to cherry-unpick it? The config I tested with is as follows: ``` [mysqld] innodb_buffer_pool_dump_at_shutdown = 0 innodb_buffer_pool_load_at_startup = 0 innodb_buffer_pool_size = 1G innodb_fast_shutdown = 0 # in case a upgrade introduces redo log format change innodb_io_capacity = 2000 innodb_io_capacity_max = 10000 innodb_log_file_size = 256M innodb_log_write_ahead_size=16K innodb_monitor_enable = all innodb_use_native_aio = 0 # native aio is slow on ZFS innodb_use_atomic_writes = 0 # native aio is slow on ZFS ```
Hi Gordan, On Mon, Apr 8, 2024 at 9:41 AM Gordan Bobic via discuss <discuss@lists.mariadb.org> wrote:
I've mentioned this here a few times before, but it looks like the latest 10.5.x update to 10.5.24 actually made things worse than in 10.5.23.
I have attached two screenshots of transaction log checkpoint age over a 5 minute oltp_write_only sysbench test with 10.5.23 and 10.5.24.
10.5.23 starts off with the flushing all or nothing behaviour but it dampens out after the first minute or so. 10.5.24 exhibits it permanently. I ran multiple separate tests and the behaviour is very consistent for each version.
10.5.23 almost (not quite but almost) seems to have had things fixed to a state where things were in 10.5.8.
Thank you for the report. A nitpick on the terminology: There is no such thing as "the transaction log" in InnoDB. There is the write-ahead log or redo log (ib_logfile0), which is the primary mechanism for making changes durable. Each write transaction is associated with a number of undo log pages, which, just like any other persistent pages, are covered by the write-ahead log. I suspect that your observation is related to page flushing, and not writes of the log file. I see that MariaDB Server 10.5.24 includes two commits related to MDEV-26055: https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c... and its parent. The reason why these fixes were ported from 10.6 was that a regression test mentioned in https://jira.mariadb.org/browse/MDEV-32681 was failing due to taking an extreme amount of time.
I also tested the latest 10.6.17 and that exhibits the same broken behaviour.
This is interesting. Can you try to collect more metrics, something like the oltp_*.png graphs in https://jira.mariadb.org/browse/MDEV-23855? Based on just one metric (such as transaction throughput or I/O rate), it is difficult to make any accurate conclusions. Please note that the 10.6.17 release is affected by a performance regression https://jira.mariadb.org/browse/MDEV-33508. It would make sense to apply that fix or to test the latest 10.6 development snapshot.
Is a fix for this behaviour expected any time soon?
As soon as we know what is going on, things should be simple.
The config I tested with is as follows:
I see that your innodb_log_file_size is only a quarter of the innodb_buffer_pool_size. Increasing it could reduce the need to write out dirty pages. I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
On Wed, Apr 10, 2024 at 2:16 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Gordan,
On Mon, Apr 8, 2024 at 9:41 AM Gordan Bobic via discuss <discuss@lists.mariadb.org> wrote:
I've mentioned this here a few times before, but it looks like the latest 10.5.x update to 10.5.24 actually made things worse than in 10.5.23.
I have attached two screenshots of transaction log checkpoint age over a 5 minute oltp_write_only sysbench test with 10.5.23 and 10.5.24.
10.5.23 starts off with the flushing all or nothing behaviour but it dampens out after the first minute or so. 10.5.24 exhibits it permanently. I ran multiple separate tests and the behaviour is very consistent for each version.
10.5.23 almost (not quite but almost) seems to have had things fixed to a state where things were in 10.5.8.
Thank you for the report.
A nitpick on the terminology: There is no such thing as "the transaction log" in InnoDB. There is the write-ahead log or redo log (ib_logfile0), which is the primary mechanism for making changes durable. Each write transaction is associated with a number of undo log pages, which, just like any other persistent pages, are covered by the write-ahead log. I suspect that your observation is related to page flushing, and not writes of the log file.
I see that MariaDB Server 10.5.24 includes two commits related to MDEV-26055: https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c... and its parent. The reason why these fixes were ported from 10.6 was that a regression test mentioned in https://jira.mariadb.org/browse/MDEV-32681 was failing due to taking an extreme amount of time.
I also tested the latest 10.6.17 and that exhibits the same broken behaviour.
This is interesting. Can you try to collect more metrics, something like the oltp_*.png graphs in https://jira.mariadb.org/browse/MDEV-23855? Based on just one metric (such as transaction throughput or I/O rate), it is difficult to make any accurate conclusions.
Please note that the 10.6.17 release is affected by a performance regression https://jira.mariadb.org/browse/MDEV-33508. It would make sense to apply that fix or to test the latest 10.6 development snapshot.
Is a fix for this behaviour expected any time soon?
As soon as we know what is going on, things should be simple.
As a starting point, can we agree that the sawtooth pattern on the checkpoint age as per the attached screenshot is not sane behaviour when exposed to a steady but not overwhelming stream of writes?
The config I tested with is as follows:
I see that your innodb_log_file_size is only a quarter of the innodb_buffer_pool_size. Increasing it could reduce the need to write out dirty pages.
Can you elaborate on what innodb_log_file_size has to do with the innodb_buffer_pool_size? AFAIK: - Buffer pool is sized to cache the active working data set. - Redo log is (or was, back when it worked correctly) sized to absorb the peak burst of writes and prevent write stalls caused by it filling up. Any bigger than that, and all it is achieving is slowing down a startup after a dirty shutdown for no good reason. These two things are not directly related, are they?
I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is padded out to that size. In my case it is done to line it up with InnoDB page size because it is on ZFS (which doesn't support O_DIRECT yet), and I want to make sure that writes are aligned to ZFS recordsize to evade read-modify-write overhead. But I don't think that is particularly relevant to the main point of this thread.
Hi Gordan, On Wed, Apr 10, 2024 at 2:37 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
As a starting point, can we agree that the sawtooth pattern on the checkpoint age as per the attached screenshot is not sane behaviour when exposed to a steady but not overwhelming stream of writes?
Yes and no (see below). I am sorry that I did not realize that the graphs that you attached in your previous message were actually displaying the checkpoint age and not for example the system throughput (transactions per second). The legend was very small and not part of the file name.
I see that your innodb_log_file_size is only a quarter of the innodb_buffer_pool_size. Increasing it could reduce the need to write out dirty pages.
Can you elaborate on what innodb_log_file_size has to do with the innodb_buffer_pool_size? AFAIK: - Buffer pool is sized to cache the active working data set. - Redo log is (or was, back when it worked correctly) sized to absorb the peak burst of writes and prevent write stalls caused by it filling up. Any bigger than that, and all it is achieving is slowing down a startup after a dirty shutdown for no good reason.
These two things are not directly related, are they?
The relation that I had in my mind is that a larger innodb_log_file_size can help reduce write amplification. If the same pages are modified over and over again and they fit in the buffer pool, then using a small log file size would force rewrites of those pages from the buffer pool to datafiles; in https://smalldatum.blogspot.com/2022/10/reasons-for-writeback-with-update-in... this is called "LSN trigger." I improved the recovery speed in 10.6 and more in 10.11, in https://jira.mariadb.org/browse/MDEV-29911. Before that fix, in my test, crash recovery could take 3 times the time that it took to generate those log records in the first place.
I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is padded out to that size. In my case it is done to line it up with InnoDB page size because it is on ZFS (which doesn't support O_DIRECT yet), and I want to make sure that writes are aligned to ZFS recordsize to evade read-modify-write overhead. But I don't think that is particularly relevant to the main point of this thread.
Right, unlike the Subject hints, this thread seems to be about data page flushing and not log flushing. In any case, the last log block will typically be filled partially and will be overwritten as further log records are appended to the buffer. When implementing MDEV-14425, I experimented with an option to pad the last log block with dummy records, so that the next write would be guaranteed to start at the next block, but I could not get the logic right. Possibly, some additional logic to determine the ZFS recordsize may be necessary in 10.11. If it can be configured per-file, I think that it would be best to use the minimum ZFS recordsize and disable compression on the ib_logfile0. What you seem to be looking for is a steady stream of data pages flushing in the background, to keep the checkpoint age within some bounds. The default value of the parameter innodb_max_dirty_pages_pct is 90, which means that up to 90% of the buffer pool is allowed to be filled with dirty pages. When this ratio is exceeded, the buf_flush_page_cleaner thread is supposed to write innodb_io_capacity pages per second. In addition to this, there are two "adaptive flushing" mechanisms that could help. The "idle flushing" is connected to SET GLOBAL innodb_max_dirty_pages_pct_low, which is disabled (0) by default. If you set innodb_max_dirty_pages_pct_low to a positive value below innodb_max_dirty_pages_pct, then the buf_flush_page_cleaner thread should write innodb_io_capacity pages per second whenever there are no transactions being committed and no pending reads into the buffer pool. There is also SET GLOBAL innodb_adaptive_flushing (ON by default), which will cause the buf_flush_page_cleaner() to write out page_cleaner_flush_pages_recommendation() pages per second. I am guessing that this could be the source of the regression that you observed. That function was changed in the following: https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c... As you see, the calculation is rather convoluted, and something had been off by a factor of 100. I would suggest that you add some logging to page_cleaner_flush_pages_recommendation(), which I believe should be invoked at most once per second, and try to figure it out from there. Another possibility is that "goto set_idle" in the buf_flush_page_cleaner() thread is being invoked when it should not be, and the thread enters an untimed wait for a condition variable (waiting for a call to buf_pool_t::page_cleaner_wakeup() from another thread). Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Hello all, I've opened an issue here: https://jira.mariadb.org/browse/MDEV-33720 I just hope someone in the community can help me figure out what's wrong with this. In a 3-node galera cluster, one of the node is a replication slave from remote master, while the 2 others are simply synced with the first. When I restart mariadb service on those 2 nodes, it simply doesn't restart at all and there's no logs shown anywhere. It sits there for hours. /usr/sbin/mariadbd --user=mysql --wsrep_recover --disable-log-error If I restart the first node (the slave of master cluster), there is no issue it restarts as expected. Whether I stop the slave replication doesn't make any difference. It used to work fine, so I'm not sure what changed since updating to 10.6.16, then .17 or if it's related. We have 2 clusters and the other cluster, same configuration, is not suffering from this, but it's the master. Is there any way to investigate this further?
Hi again, I'm looking forward to some update on the page flushing. On Wed, Apr 10, 2024 at 2:37 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
On Wed, Apr 10, 2024 at 2:16 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is padded out to that size.
Based on https://jira.mariadb.org/browse/MDEV-33894 and https://smalldatum.blogspot.com/2024/04/sysbench-on-less-small-server-mariad... this needs to be addressed in MariaDB Server 10.11. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
participants (3)
-
Cédric Counotte
-
Gordan Bobic
-
Marko Mäkelä