Hi all, First, sorry for a long quote before my response inline: On Sat, Sep 7, 2019 at 11:22 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Jure Sah <dustwolfy@gmail.com> writes:
It would appear that on a typical webserver, the majority of disk i/o belongs to the MariaDB database process. It would appear it's mostly waiting on fsync(), which to my understanding is executed at least once per commit.
Group commit can amortise this if there are multiple commits in parallel, but essentially yes.
I've also noticed in the documentation that the options to control fsync usage are even more limited than in the MySQL server. They are also very strongly argued against. Considering the point that InnoDB is considered to be in an inconsistent state in any event, so long as the server is not cleanly stopped, is there really justification for such strong opposition here?
Usually you can just set --sync-binlog=0 --innodb-flush-log-at-trx-commit=2 and most fsync should be gone.
I understand that this is extensively researched in the documentation and it has to do with the recovery of data in case of an unexpected server reboot.
InnoDB should recover safely in any case. But if binlog is enabled, the binlog is likely to become out of sync with the data inside innodb (in case of kernel crash/power outage. Mariadb process crash by itself is not enough to break recovery). So if the server is a replication master, slaves might need to be rebuilt.
Exactly. No matter which value of innodb_flush_log_at_trx_commit you use, InnoDB should recover from a crash. In the event the database server is killed, you can lose a few of the latest committed transactions in the worst case. The fsync() operations at transaction commit are responsible for making durable the transaction state change (and any preceding changes) by ensuring that all the operations up to the user transaction commit will have been written to the write-ahead log (the InnoDB redo log). When the binlog is enabled, for recovery to work properly, the binlog's record of transaction commits should never be behind storage engines. The current default setting of sync_binlog=0 is actually unsafe because of this. There is some ongoing development work around this area. If the binlog is enabled, it should actually be unnecessary to persist the storage engine log, because it should be possible to replay any not-committed-in-engine transactions from the binlog. We must merely guarantee that the binlog is never behind the storage engine log, that is, any write to the InnoDB redo log must be preceded by a fsync() of the binlog. I think that this practically requires sync_binlog=1. Work on this is being tracked in https://jira.mariadb.org/browse/MDEV-18959 (Engine transaction recovery through persistent binlog). For transactions that involve at most one transactional storage engine, we should even be able to get rid of the multi-phase commit mechanism. But, InnoDB’s use of fsync() on data files feels like an overkill. I believe that we only need some 'write barriers', that is, some interlocking between log and page writes to prevent some writes from being reordered. I think that we need the following: (1) Ensure strict write-ahead logging: Guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is overkill for this. (2) For most page writes ('background flushing'), we do not care when exactly they completed, and we do not need or want fsync(). (3) On significant transaction state change (COMMIT, or a transition of XA PREPARE to XA ROLLBACK), we must ensure that the log record for that (and any preceding log) will be durably written to the log file(s). This probably calls for fsync(). (4) On log checkpoint (logically discarding the start of the write-ahead log), we must ensure that all the pages that were referred to be to-be-discarded section of log will have been written to disk. This could be the only case where we might actually need fsync() on data files. Preferably it should be done asynchronously. Log checkpoint must also fsync() the log file(s) to make the checkpoint metadata durable. Also, the InnoDB doublewrite buffer is a work-around for the operating system kernel not supporting atomic writes of data pages (even though to my understanding, it should be technically doable on any journal-based or copy-on-write file system). This week, there was a Linux Plumbers Conference, with a Databases Microconference https://linuxplumbersconf.org/event/4/page/34-accepted-microconferences#db where both the atomic writes and fsync() were discussed. I hope that Sergei Golubchik or Daniel Black can report more on that. In any case, it might take years before these improvements become available in Linux distributions; with luck, something will happen soon and be backported to the kernels in stable or long-term-support distributions. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation