[Maria-discuss] fsync alternative

Jure Sah

7 Sep 2019 7 Sep '19

6:27 p.m.

Hi, It would appear that on a typical webserver, the majority of disk i/o belongs to the MariaDB database process. It would appear it's mostly waiting on fsync(), which to my understanding is executed at least once per commit. I understand that this is extensively researched in the documentation and it has to do with the recovery of data in case of an unexpected server reboot. The thing is, this has many performance issues because it caps database performance at whatever the speed of the underlying physical storage is, even if the changed data fits into the available RAM. It also results in the circumstance where if anything is impacting performance of the storage system on the server, this will break MariaDB and cause the service to go offline (time out), even if there is sufficient RAM in the machine to continue operating as normal from page cache. In a typical scenario you have a website which is writing things like session and page cache data which expires within an hour and would be no great loss if missing from a backup. Especially if the volume of the missing data would be 30 seconds (kernel default for committing dirty pages to disk) and mind you this setting is configurable. Temporary tables cannot be used in this case, because they are deleted as soon as the session ends, which is too soon. Is there a different solution that could be used here? I've also noticed in the documentation that the options to control fsync usage are even more limited than in the MySQL server. They are also very strongly argued against. Considering the point that InnoDB is considered to be in an inconsistent state in any event, so long as the server is not cleanly stopped, is there really justification for such strong opposition here? Disabling fsync boosts performance of a typical MySQL server by something like a factor of 3. LP, Jure

Attachments:

signature.asc (application/pgp-signature — 181 bytes)

Show replies by date

Roberto Spadim

7 Sep 7 Sep

6:50 p.m.

Maybe buy battery based raid card Em sáb, 7 de set de 2019 às 15:27, Jure Sah <dustwolfy@gmail.com> escreveu:

...

Hi,

It would appear that on a typical webserver, the majority of disk i/o belongs to the MariaDB database process. It would appear it's mostly waiting on fsync(), which to my understanding is executed at least once per commit.

I understand that this is extensively researched in the documentation and it has to do with the recovery of data in case of an unexpected server reboot.

The thing is, this has many performance issues because it caps database performance at whatever the speed of the underlying physical storage is, even if the changed data fits into the available RAM. It also results in the circumstance where if anything is impacting performance of the storage system on the server, this will break MariaDB and cause the service to go offline (time out), even if there is sufficient RAM in the machine to continue operating as normal from page cache.

In a typical scenario you have a website which is writing things like session and page cache data which expires within an hour and would be no great loss if missing from a backup. Especially if the volume of the missing data would be 30 seconds (kernel default for committing dirty pages to disk) and mind you this setting is configurable.

Temporary tables cannot be used in this case, because they are deleted as soon as the session ends, which is too soon.

Is there a different solution that could be used here?

I've also noticed in the documentation that the options to control fsync usage are even more limited than in the MySQL server. They are also very strongly argued against. Considering the point that InnoDB is considered to be in an inconsistent state in any event, so long as the server is not cleanly stopped, is there really justification for such strong opposition here?

Disabling fsync boosts performance of a typical MySQL server by something like a factor of 3.

LP, Jure

_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp

-- Roberto Spadim SPAEmpresarial - Software ERP/Scada Eng. Automação e Controle, Eng. Financeira

Kristian Nielsen

8:22 p.m.

Jure Sah <dustwolfy@gmail.com> writes:

...

It would appear that on a typical webserver, the majority of disk i/o belongs to the MariaDB database process. It would appear it's mostly waiting on fsync(), which to my understanding is executed at least once per commit.

Group commit can amortise this if there are multiple commits in parallel, but essentially yes.

...

I've also noticed in the documentation that the options to control fsync usage are even more limited than in the MySQL server. They are also very strongly argued against. Considering the point that InnoDB is considered to be in an inconsistent state in any event, so long as the server is not cleanly stopped, is there really justification for such strong opposition here?

Usually you can just set --sync-binlog=0 --innodb-flush-log-at-trx-commit=2 and most fsync should be gone.

...

I understand that this is extensively researched in the documentation and it has to do with the recovery of data in case of an unexpected server reboot.

InnoDB should recover safely in any case. But if binlog is enabled, the binlog is likely to become out of sync with the data inside innodb (in case of kernel crash/power outage. Mariadb process crash by itself is not enough to break recovery). So if the server is a replication master, slaves might need to be rebuilt. Whatever is argued in one place or another, the better approach is to read docs on what each option actually does, and make your own trade-off, in this case between performance and recoverability. Which is exactly what you did, concluding that running without fsync is the right choice in your setup. Hope this helps, - Kristian.

Marko Mäkelä

13 Sep 13 Sep

6:34 a.m.

Hi all, First, sorry for a long quote before my response inline: On Sat, Sep 7, 2019 at 11:22 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:

...

Jure Sah <dustwolfy@gmail.com> writes:

...
It would appear that on a typical webserver, the majority of disk i/o belongs to the MariaDB database process. It would appear it's mostly waiting on fsync(), which to my understanding is executed at least once per commit.

Group commit can amortise this if there are multiple commits in parallel, but essentially yes.

...
I've also noticed in the documentation that the options to control fsync usage are even more limited than in the MySQL server. They are also very strongly argued against. Considering the point that InnoDB is considered to be in an inconsistent state in any event, so long as the server is not cleanly stopped, is there really justification for such strong opposition here?

Usually you can just set --sync-binlog=0 --innodb-flush-log-at-trx-commit=2 and most fsync should be gone.

...
I understand that this is extensively researched in the documentation and it has to do with the recovery of data in case of an unexpected server reboot.

InnoDB should recover safely in any case. But if binlog is enabled, the binlog is likely to become out of sync with the data inside innodb (in case of kernel crash/power outage. Mariadb process crash by itself is not enough to break recovery). So if the server is a replication master, slaves might need to be rebuilt.

Exactly. No matter which value of innodb_flush_log_at_trx_commit you use, InnoDB should recover from a crash. In the event the database server is killed, you can lose a few of the latest committed transactions in the worst case. The fsync() operations at transaction commit are responsible for making durable the transaction state change (and any preceding changes) by ensuring that all the operations up to the user transaction commit will have been written to the write-ahead log (the InnoDB redo log). When the binlog is enabled, for recovery to work properly, the binlog's record of transaction commits should never be behind storage engines. The current default setting of sync_binlog=0 is actually unsafe because of this. There is some ongoing development work around this area. If the binlog is enabled, it should actually be unnecessary to persist the storage engine log, because it should be possible to replay any not-committed-in-engine transactions from the binlog. We must merely guarantee that the binlog is never behind the storage engine log, that is, any write to the InnoDB redo log must be preceded by a fsync() of the binlog. I think that this practically requires sync_binlog=1. Work on this is being tracked in https://jira.mariadb.org/browse/MDEV-18959 (Engine transaction recovery through persistent binlog). For transactions that involve at most one transactional storage engine, we should even be able to get rid of the multi-phase commit mechanism. But, InnoDB’s use of fsync() on data files feels like an overkill. I believe that we only need some 'write barriers', that is, some interlocking between log and page writes to prevent some writes from being reordered. I think that we need the following: (1) Ensure strict write-ahead logging: Guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is overkill for this. (2) For most page writes ('background flushing'), we do not care when exactly they completed, and we do not need or want fsync(). (3) On significant transaction state change (COMMIT, or a transition of XA PREPARE to XA ROLLBACK), we must ensure that the log record for that (and any preceding log) will be durably written to the log file(s). This probably calls for fsync(). (4) On log checkpoint (logically discarding the start of the write-ahead log), we must ensure that all the pages that were referred to be to-be-discarded section of log will have been written to disk. This could be the only case where we might actually need fsync() on data files. Preferably it should be done asynchronously. Log checkpoint must also fsync() the log file(s) to make the checkpoint metadata durable. Also, the InnoDB doublewrite buffer is a work-around for the operating system kernel not supporting atomic writes of data pages (even though to my understanding, it should be technically doable on any journal-based or copy-on-write file system). This week, there was a Linux Plumbers Conference, with a Databases Microconference https://linuxplumbersconf.org/event/4/page/34-accepted-microconferences#db where both the atomic writes and fsync() were discussed. I hope that Sergei Golubchik or Daniel Black can report more on that. In any case, it might take years before these improvements become available in Linux distributions; with luck, something will happen soon and be backported to the kernels in stable or long-term-support distributions. With best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation

Kristian Nielsen

7:08 p.m.

Marko Mäkelä <marko.makela@mariadb.com> writes:

...

There is some ongoing development work around this area. If the binlog is enabled, it should actually be unnecessary to persist the storage engine log, because it should be possible to replay any not-committed-in-engine transactions from the binlog. We must merely

Nice to hear that this is being worked on. There is an old worklog MWL#164 with some analysis of potential issues to be solved. http://worklog.askmonty.org/worklog/Server-RawIdeaBin/?tid=164 It becomes tricky in some corner cases, for example cross-engine transactions where one engine has the changes persisted after a crash and the other does not. But the impact of a robust implementation of this could be huge, double-fsync-per-commit is _really_ expensive. Hopefully the corner cases can be solved or handled with some kind of fall-back.

...

But, InnoDB’s use of fsync() on data files feels like an overkill. I believe that we only need some 'write barriers', that is, some

This is also quite interesting. My (admittedly limited) understanding is that disks in fact have write-barrier functionality, and that journalling file systems in fact use that. The problem seems to be how to expose that to userspace. I wonder if there are any existing or proposed interfaces to allow userspace to specify write barriers between writes. - Kristian.

Daniel Black

16 Sep 16 Sep

2:25 a.m.

On Fri, 13 Sep 2019 09:34:48 +0300 Marko Mäkelä <marko.makela@mariadb.com> wrote:

...

But, InnoDB’s use of fsync() on data files feels like an overkill. I believe that we only need some 'write barriers', that is, some interlocking between log and page writes to prevent some writes from being reordered. I think that we need the following: (1) Ensure strict write-ahead logging: Guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is overkill for this. (2) For most page writes ('background flushing'), we do not care when exactly they completed, and we do not need or want fsync(). (3) On significant transaction state change (COMMIT, or a transition of XA PREPARE to XA ROLLBACK), we must ensure that the log record for that (and any preceding log) will be durably written to the log file(s). This probably calls for fsync(). (4) On log checkpoint (logically discarding the start of the write-ahead log), we must ensure that all the pages that were referred to be to-be-discarded section of log will have been written to disk. This could be the only case where we might actually need fsync() on data files. Preferably it should be done asynchronously. Log checkpoint must also fsync() the log file(s) to make the checkpoint metadata durable.

Also, the InnoDB doublewrite buffer is a work-around for the operating system kernel not supporting atomic writes of data pages (even though to my understanding, it should be technically doable on any journal-based or copy-on-write file system).

This week, there was a Linux Plumbers Conference, with a Databases Microconference https://linuxplumbersconf.org/event/4/page/34-accepted-microconferences#db where both the atomic writes and fsync() were discussed. I hope that Sergei Golubchik or Daniel Black can report more on that. In any case, it might take years before these improvements become available in Linux distributions; with luck, something will happen soon and be backported to the kernels in stable or long-term-support distributions.

Atomic writes, the direction we where pointed to is the XFS CoW (reflink=1) implementation. I haven't found the kernel interfaces required to use this. SQLite uses a f2fs ioctl extension. An alternative talked about using NVDIMM space attached as a helper space for filesystems to use however all filesystems would be required to implement support. Of course users have to have NVDIMM first. At one stage Seagate had hard disks with NVDIMM on them, unsure if this continues to be the case. If devices ever supported atomic writes this would require implementing these changes all from filesystem the way down (though pseudo block devies like device mapper (aka LVM) and crypt layers) to block layers would be required otherwise. On write barriers this seemed like an enhancement to io_uring as I followed the discussion but I'm waiting for the video to relook to see if I missed anything. I can't see much of a chance of a backport (io_uring was first added in 5.1 kernel) here however Ubuntu 20.04 LTS might happen if it gets defined/implemented/tested quickly enough. On Fri, 13 Sep 2019 21:08:13 +0200 Kristian Nielsen <knielsen@knielsen-hq.org> wrote:

...

...
But, InnoDB’s use of fsync() on data files feels like an overkill. I believe that we only need some 'write barriers', that is, some

This is also quite interesting. My (admittedly limited) understanding is that disks in fact have write-barrier functionality,

From a major disk vendor in the LPC Database Microconference session, SCSI had ordering as an option, however it was never implemented by any vendor. Without this existing in hardware I think the discussion went along the lines that it needs to wait until the hardware queue is fully flushed. (lots of hardware specification acronyms where mentioned in quick succession)

Jure Sah

27 Sep 27 Sep

9:22 a.m.

Apologies for late response. Thanks to everybody for your responses. On 16. 09. 19 04:25, Daniel Black wrote:

...

If devices ever supported atomic writes this would require implementing these changes all from filesystem the way down (though pseudo block devies like device mapper (aka LVM) and crypt layers) to block layers would be required otherwise.

So it is my understanding that --innodb-flush-log-at-trx-commit=2 is an acceptable workaround for reducing the amount of fsyncs to one per second, on suitably crash resistant hardware (and assuming for the moment, slave databases are not used). That is until there is a proper solution for atomic writes on the filesystem level. Realistically, even if this were only solved on ext4, xfs, or a filesystem advertised as recommended for high-performance MariaDB deployments, the issue like mine would be rendered nonexistent. And it's just down to carrying this idea over to the kernel people.

...

From a major disk vendor in the LPC Database Microconference session, SCSI had ordering as an option, however it was never implemented by any vendor.

Without this existing in hardware I think the discussion went along the lines that it needs to wait until the hardware queue is fully flushed. (lots of hardware specification acronyms where mentioned in quick succession)

My understanding of the correspondence on this subject on the kernel mailing list to date is that for a long time, this was intentionally not implemented because of the performance penalties inherent with this being done right (as fsync effectively negates all cache and storage vendors like their cache). This basically makes it all the more obvious that what is actually needed is a high-performance atomic write mechanism separate from fsync and then fsync can be properly implemented down to the hardware level. LP, Jure

Jure Sah

11:14 a.m.

On 27. 09. 19 11:22, Jure Sah wrote:

...

That is until there is a proper solution for atomic writes on the filesystem level.

So apparently O_ATOMIC has been a thing since 2017 or such, in XFS. Any plans to support it in MariaDB as an alternative to fsync() and double buffering? I realize the feature is probably experimental, however it can be programatically detected and I think something like MariaDB supporting it would encourage further development. LP, Jure

2115

Age (days ago)

2135

Last active (days ago)

List overview

7 comments

6 participants

participants (6)

Daniel Black
Jure Sah
Jure Sah
Kristian Nielsen
Marko Mäkelä
Roberto Spadim