[MariaDB discuss] Re: Transaction Log Flushing got worse again in 10.5.23 -> 10.5.24

11 Apr 2024

      Hi Gordan,

On Wed, Apr 10, 2024 at 2:37 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
...
As a starting point, can we agree that the sawtooth pattern on the
checkpoint age as per the attached screenshot is not sane behaviour
when exposed to a steady but not overwhelming stream of writes?
Yes and no (see below). I am sorry that I did not realize that the
graphs that you attached in your previous message were actually
displaying the checkpoint age and not for example the system
throughput (transactions per second). The legend was very small and
not part of the file name.
...
...
I see that your innodb_log_file_size is only a quarter of the
innodb_buffer_pool_size. Increasing it could reduce the need to write
out dirty pages.
Can you elaborate on what innodb_log_file_size has to do with the
innodb_buffer_pool_size?
AFAIK:
- Buffer pool is sized to cache the active working data set.
- Redo log is (or was, back when it worked correctly) sized to absorb
the peak burst of writes and prevent write stalls caused by it filling
up. Any bigger than that, and all it is achieving is slowing down a
startup after a dirty shutdown for no good reason.
These two things are not directly related, are they?
The relation that I had in my mind is that a larger
innodb_log_file_size can help reduce write amplification. If the same
pages are modified over and over again and they fit in the buffer
pool, then using a small log file size would force rewrites of those
pages from the buffer pool to datafiles; in
https://smalldatum.blogspot.com/2022/10/reasons-for-writeback-with-update-in...
this is called "LSN trigger."

I improved the recovery speed in 10.6 and more in 10.11, in
https://jira.mariadb.org/browse/MDEV-29911. Before that fix, in my
test, crash recovery could take 3 times the time that it took to
generate those log records in the first place.
...
...
I don't remember what exactly
innodb_log_write_ahead_size was supposed to do. The log writing was
simplified and made compatible with O_DIRECT and physical block sizes
up to 4096 bytes, in MDEV-14425, which is available in MariaDB Server
10.11.
innodb_log_write_ahead_size is supposed to ensure that redo log is
padded out to that size.
In my case it is done to line it up with InnoDB page size because it
is on ZFS (which doesn't support O_DIRECT yet), and I want to make
sure that writes are aligned to ZFS recordsize to evade
read-modify-write overhead.
But I don't think that is particularly relevant to the main point of
this thread.
Right, unlike the Subject hints, this thread seems to be about data
page flushing and not log flushing. In any case, the last log block
will typically be filled partially and will be overwritten as further
log records are appended to the buffer. When implementing MDEV-14425,
I experimented with an option to pad the last log block with dummy
records, so that the next write would be guaranteed to start at the
next block, but I could not get the logic right. Possibly, some
additional logic to determine the ZFS recordsize may be necessary in
10.11. If it can be configured per-file, I think that it would be best
to use the minimum ZFS recordsize and disable compression on the
ib_logfile0.

What you seem to be looking for is a steady stream of data pages
flushing in the background, to keep the checkpoint age within some
bounds.

The default value of the parameter innodb_max_dirty_pages_pct is 90,
which means that up to 90% of the buffer pool is allowed to be filled
with dirty pages. When this ratio is exceeded, the
buf_flush_page_cleaner thread is supposed to write innodb_io_capacity
pages per second.

In addition to this, there are two "adaptive flushing" mechanisms that
could help.

The "idle flushing" is connected to SET GLOBAL
innodb_max_dirty_pages_pct_low, which is disabled (0) by default. If
you set innodb_max_dirty_pages_pct_low to a positive value below
innodb_max_dirty_pages_pct, then the buf_flush_page_cleaner thread
should write innodb_io_capacity pages per second whenever there are no
transactions being committed and no pending reads into the buffer
pool.

There is also SET GLOBAL innodb_adaptive_flushing (ON by default),
which will cause the buf_flush_page_cleaner() to write out
page_cleaner_flush_pages_recommendation() pages per second. I am
guessing that this could be the source of the regression that you
observed. That function was changed in the following:
https://github.com/MariaDB/server/commit/9a545eb67ca8a666b87fc0aa8d22fe0a01c...

As you see, the calculation is rather convoluted, and something had
been off by a factor of 100. I would suggest that you add some logging
to page_cleaner_flush_pages_recommendation(), which I believe should
be invoked at most once per second, and try to figure it out from
there. Another possibility is that "goto set_idle" in the
buf_flush_page_cleaner() thread is being invoked when it should not
be, and the thread enters an untimed wait for a condition variable
(waiting for a call to buf_pool_t::page_cleaner_wakeup() from another
thread).

Best regards,

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc