Hi Gordan, On Wed, Apr 10, 2024 at 2:37 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
As a starting point, can we agree that the sawtooth pattern on the checkpoint age as per the attached screenshot is not sane behaviour when exposed to a steady but not overwhelming stream of writes?
Yes and no (see below). I am sorry that I did not realize that the graphs that you attached in your previous message were actually displaying the checkpoint age and not for example the system throughput (transactions per second). The legend was very small and not part of the file name.
I see that your innodb_log_file_size is only a quarter of the innodb_buffer_pool_size. Increasing it could reduce the need to write out dirty pages.
Can you elaborate on what innodb_log_file_size has to do with the innodb_buffer_pool_size? AFAIK: - Buffer pool is sized to cache the active working data set. - Redo log is (or was, back when it worked correctly) sized to absorb the peak burst of writes and prevent write stalls caused by it filling up. Any bigger than that, and all it is achieving is slowing down a startup after a dirty shutdown for no good reason.
These two things are not directly related, are they?
The relation that I had in my mind is that a larger innodb_log_file_size can help reduce write amplification. If the same pages are modified over and over again and they fit in the buffer pool, then using a small log file size would force rewrites of those pages from the buffer pool to datafiles; in https://smalldatum.blogspot.com/2022/10/reasons-for-writeback-with-update-in... this is called "LSN trigger." I improved the recovery speed in 10.6 and more in 10.11, in https://jira.mariadb.org/browse/MDEV-29911. Before that fix, in my test, crash recovery could take 3 times the time that it took to generate those log records in the first place.
I don't remember what exactly innodb_log_write_ahead_size was supposed to do. The log writing was simplified and made compatible with O_DIRECT and physical block sizes up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server 10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is padded out to that size. In my case it is done to line it up with InnoDB page size because it is on ZFS (which doesn't support O_DIRECT yet), and I want to make sure that writes are aligned to ZFS recordsize to evade read-modify-write overhead. But I don't think that is particularly relevant to the main point of this thread.
Right, unlike the Subject hints, this thread seems to be about data page flushing and not log flushing. In any case, the last log block will typically be filled partially and will be overwritten as further log records are appended to the buffer. When implementing MDEV-14425, I experimented with an option to pad the last log block with dummy records, so that the next write would be guaranteed to start at the next block, but I could not get the logic right. Possibly, some additional logic to determine the ZFS recordsize may be necessary in 10.11. If it can be configured per-file, I think that it would be best to use the minimum ZFS recordsize and disable compression on the ib_logfile0. What you seem to be looking for is a steady stream of data pages flushing in the background, to keep the checkpoint age within some bounds. The default value of the parameter innodb_max_dirty_pages_pct is 90, which means that up to 90% of the buffer pool is allowed to be filled with dirty pages. When this ratio is exceeded, the buf_flush_page_cleaner thread is supposed to write innodb_io_capacity pages per second. In addition to this, there are two "adaptive flushing" mechanisms that could help. The "idle flushing" is connected to SET GLOBAL innodb_max_dirty_pages_pct_low, which is disabled (0) by default. If you set innodb_max_dirty_pages_pct_low to a positive value below innodb_max_dirty_pages_pct, then the buf_flush_page_cleaner thread should write innodb_io_capacity pages per second whenever there are no transactions being committed and no pending reads into the buffer pool. There is also SET GLOBAL innodb_adaptive_flushing (ON by default), which will cause the buf_flush_page_cleaner() to write out page_cleaner_flush_pages_recommendation() pages per second. I am guessing that this could be the source of the regression that you observed. That function was changed in the following: https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c... As you see, the calculation is rather convoluted, and something had been off by a factor of 100. I would suggest that you add some logging to page_cleaner_flush_pages_recommendation(), which I believe should be invoked at most once per second, and try to figure it out from there. Another possibility is that "goto set_idle" in the buf_flush_page_cleaner() thread is being invoked when it should not be, and the thread enters an untimed wait for a condition variable (waiting for a call to buf_pool_t::page_cleaner_wakeup() from another thread). Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc