[Maria-discuss] Is disabling doublewrite safe on ZFS?
Hi all, as by subject: is disabling doublewrite safe on ZFS (and/or other CoW filesystems as BTRFS)? Background information: ZFS is a CoW/transactional filesystem, meaning that writes are atomic: they fully commit or are rolled backup to latest "stable" version. This lend many peoples to claim not only that disabling doublewrite is safe when InnoDB runs on top of a ZFS storage, but even that it is the *right* thing to do for increase InnoDB write performance. The reason is that when ZFS recordsize is set the same as InnoDB page/record size, no partial page write can happen. Some evidence: http://assets.en.oreilly.com/1/event/21/Optimizing%20MySQL%20Performance%20w... However, I am not fully committed (pun intended!) to this idea. While I surely appreciate ZFS write atomicity, and how it *does* protect from system-wide crash (ie: powerloss), I fear that an InnoDB/MariaDB crash *can* lead to partial page writes. If, for example, the mysqld process crashes (or it is killed) when copying an internal buffer during a write() call, I can imagine the filesystem will receive wrong/partial data, which it will happily write to the main storage pool (as it know nothing of internal data consistency from InnoDB point of view). I understand that this failure scenario should be *really* rare, as the critical operation (buffer copy from mysqld to system pagecache/ARC via write()) is extremely fast compared to the real data flush to stable storage (meaning that the "vulnerable time window" is very small). However, it remain different from 100% safety. Moreover, it really backfired in the past: https://www.percona.com/blog/2015/06/17/update-on-the-innodb-double-write-bu... From my understanding, disabling doublebuffer is really 100% safe only when enabling atomic writes on *a supported hardware* (https://mariadb.com/kb/en/library/atomic-write-support/). Am I missing something? Am I over-thinking it, maybe? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
There is at least one case I know where you do not need doublewrite buffer. And you even do not need CoW filesystem. A combination of OS guarantee of atomic writes if they are sector-sized writes, and matching innodb page size being. If you have disks with 4K sectors (quite common), and you chose innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use Windows as your OS (because this one provides guarantees that single-sector sized/aligned writes are atomic as per https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writ...), then you can safely disable innodb-doublewrite. You do not need "supported hardware" for that. As for Linux, I think Marko tested what happens when process is getting killed, and sure enough, it can be killed in the middle of a larger write, and have partially written data. I suspect that O_DIRECT and sector-sized writes might be atomic ( as in Windows example), but I did not find any written confirmation for that. Someone with better understanding of kernel and filesystems could prove or disprove this suspicion. ________________________________ From: Maria-discuss <maria-discuss-bounces+vvaintroub=gmail.com@lists.launchpad.net> on behalf of Gionatan Danti <g.danti@assyoma.it> Sent: Tuesday, August 14, 2018 11:37:49 AM To: maria-discuss@lists.launchpad.net Subject: [Maria-discuss] Is disabling doublewrite safe on ZFS? Hi all, as by subject: is disabling doublewrite safe on ZFS (and/or other CoW filesystems as BTRFS)? Background information: ZFS is a CoW/transactional filesystem, meaning that writes are atomic: they fully commit or are rolled backup to latest "stable" version. This lend many peoples to claim not only that disabling doublewrite is safe when InnoDB runs on top of a ZFS storage, but even that it is the *right* thing to do for increase InnoDB write performance. The reason is that when ZFS recordsize is set the same as InnoDB page/record size, no partial page write can happen. Some evidence: http://assets.en.oreilly.com/1/event/21/Optimizing%20MySQL%20Performance%20w... However, I am not fully committed (pun intended!) to this idea. While I surely appreciate ZFS write atomicity, and how it *does* protect from system-wide crash (ie: powerloss), I fear that an InnoDB/MariaDB crash *can* lead to partial page writes. If, for example, the mysqld process crashes (or it is killed) when copying an internal buffer during a write() call, I can imagine the filesystem will receive wrong/partial data, which it will happily write to the main storage pool (as it know nothing of internal data consistency from InnoDB point of view). I understand that this failure scenario should be *really* rare, as the critical operation (buffer copy from mysqld to system pagecache/ARC via write()) is extremely fast compared to the real data flush to stable storage (meaning that the "vulnerable time window" is very small). However, it remain different from 100% safety. Moreover, it really backfired in the past: https://www.percona.com/blog/2015/06/17/update-on-the-innodb-double-write-bu... From my understanding, disabling doublebuffer is really 100% safe only when enabling atomic writes on *a supported hardware* (https://mariadb.com/kb/en/library/atomic-write-support/). Am I missing something? Am I over-thinking it, maybe? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it<http://www.assyoma.it> email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
There is at least one case I know where you do not need doublewrite buffer. And you even do not need CoW filesystem.
A combination of OS guarantee of atomic writes if they are sector-sized writes, and matching innodb page size being. If you have disks with 4K sectors (quite common), and you chose innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use Windows as your OS (because this one provides guarantees that single-sector sized/aligned writes are atomic as per https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writ... [1]), then you can safely disable innodb-doublewrite. You do not need "supported hardware" for that.
Hi, lets suppose mysqld crashes during the copy from its internal buffer and the OS write cache, ending with only partial data being transferred (ie: 2K data on a 4Kn disk). If using direct writes (or FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the underlying disk throwing an I/O error. But what about non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?
As for Linux, I think Marko tested what happens when process is getting killed, and sure enough, it can be killed in the middle of a larger write, and have partially written data. I suspect that O_DIRECT and sector-sized writes might be atomic ( as in Windows example), but I did not find any written confirmation for that. Someone with better understanding of kernel and filesystems could prove or disprove this suspicion.
Yes, O_DIRECT + single sector aligned write *should* be atomic, supposing the disk rejects the partial write. However, this really is an hardware-specific condition. Back to ZFS: the entire record *will* be written atomically. As a first approximation, when recordsize == innodb page size, doublewrite should not be needed. However, as stated above, what will happen if the mysqld process is killed at the wrong moment? I fear something as: - InnoDB pagesize and ZFS recordsize are both at 16K; - InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS does not support O_DIRECT, by the way); - mysqld crashes at the worst possible moment, so only 1/2 of InnoDB internal data (8K) was written by write(); - ZFS received the partial 8K data, but it does *not* know these are partial data only (ie: it "see" a normal 8K write); - some seconds later, partial data are commited to stable storage; - when mysqld restarts, InnoDB complains about partial page write. This bring another question: how will InnoDB behave after detecting a partial page write? Will it shut down itself? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
Hi, Just a small correction: ZoL does not support O_DIRECT, but FreeBSD ZFS does. Probably other distributions also do. Regards,Federico Razzoli Il martedì 14 agosto 2018, 20:13:59 GMT+1, Gionatan Danti <g.danti@assyoma.it> ha scritto: Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
There is at least one case I know where you do not need doublewrite buffer. And you even do not need CoW filesystem.
A combination of OS guarantee of atomic writes if they are sector-sized writes, and matching innodb page size being. If you have disks with 4K sectors (quite common), and you chose innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use Windows as your OS (because this one provides guarantees that single-sector sized/aligned writes are atomic as per https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writ... [1]), then you can safely disable innodb-doublewrite. You do not need "supported hardware" for that.
Hi, lets suppose mysqld crashes during the copy from its internal buffer and the OS write cache, ending with only partial data being transferred (ie: 2K data on a 4Kn disk). If using direct writes (or FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the underlying disk throwing an I/O error. But what about non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?
As for Linux, I think Marko tested what happens when process is getting killed, and sure enough, it can be killed in the middle of a larger write, and have partially written data. I suspect that O_DIRECT and sector-sized writes might be atomic ( as in Windows example), but I did not find any written confirmation for that. Someone with better understanding of kernel and filesystems could prove or disprove this suspicion.
Yes, O_DIRECT + single sector aligned write *should* be atomic, supposing the disk rejects the partial write. However, this really is an hardware-specific condition. Back to ZFS: the entire record *will* be written atomically. As a first approximation, when recordsize == innodb page size, doublewrite should not be needed. However, as stated above, what will happen if the mysqld process is killed at the wrong moment? I fear something as: - InnoDB pagesize and ZFS recordsize are both at 16K; - InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS does not support O_DIRECT, by the way); - mysqld crashes at the worst possible moment, so only 1/2 of InnoDB internal data (8K) was written by write(); - ZFS received the partial 8K data, but it does *not* know these are partial data only (ie: it "see" a normal 8K write); - some seconds later, partial data are commited to stable storage; - when mysqld restarts, InnoDB complains about partial page write. This bring another question: how will InnoDB behave after detecting a partial page write? Will it shut down itself? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On 14/08/2018 22:17, Federico Razzoli wrote:
Hi,
Just a small correction: ZoL does not support O_DIRECT, but FreeBSD ZFS does. Probably other distributions also do.
Regards, Federico Razzoli
Hi, I just tried on FreeBSD 11.x a small C program with O_DIRECT support [1] and it really seems O_DIRECT is ignored: writes go into ARC and are served from it when data is read. ZFS compression for the dataset it off. This do not surprise me: O_DIRECT implies zero-memory-copy and/or DMA from main memory to the disk themselves. While with standard filesystem this should be possible, with CoW+checksum (and anything which transforms data when they flow, ie: compression) this become very difficult. Back to main point... anyone with some insights on doublewrite and ZFS? # Before running the test program: ARC Size: 0.09% 1.14 MiB Target Size: (Adaptive) 100.00% 1.20 GiB Min Size (Hard Limit): 12.50% 153.30 MiB Max Size (High Water): 8:1 1.20 GiB # After running it: ARC Size: 48.65% 596.61 MiB Target Size: (Adaptive) 100.00% 1.20 GiB Min Size (Hard Limit): 12.50% 153.30 MiB Max Size (High Water): 8:1 1.20 GiB # Reading the just-written file shows data are server by ARC (ie: too fast for coming from the disk) root@freebsd:~ # dd if=/tank/test.img of=/dev/null bs=1M 512+0 records in 512+0 records out 536870912 bytes transferred in 0.188852 secs (2842809718 bytes/sec) [1] Test program: root@freebsd:~ # cat test.c #define _GNU_SOURCE #include <string.h> #include <stdlib.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> #define BLOCKSIZE 128*1024 int main() { void *buffer; int i = 0; int w = 0; buffer = malloc(BLOCKSIZE); buffer = memset(buffer, 48, BLOCKSIZE); int f = open("/tank/test.img", O_CREAT|O_TRUNC|O_WRONLY|O_DIRECT); for (i=0; i<512*8; i++) { w = write(f, buffer, BLOCKSIZE); } close(f); free(buffer); return 0; } -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
Actually I don't remember why I was convinced about that and I couldn't find a resource supporting this idea. With an exception if you use block devices, in which cases it seems that writes are not cached on FreeBSD:https://lists.freebsd.org/pipermail/freebsd-fs/2013-July/017602.html But sorry for dissertion - this is a very special case, I would also understand more about safeness of disabling doublewrite in other cases. Federico Il giovedì 16 agosto 2018, 16:28:43 GMT+1, Gionatan Danti <g.danti@assyoma.it> ha scritto: On 14/08/2018 22:17, Federico Razzoli wrote:
Hi,
Just a small correction: ZoL does not support O_DIRECT, but FreeBSD ZFS does. Probably other distributions also do.
Regards, Federico Razzoli
Hi, I just tried on FreeBSD 11.x a small C program with O_DIRECT support [1] and it really seems O_DIRECT is ignored: writes go into ARC and are served from it when data is read. ZFS compression for the dataset it off. This do not surprise me: O_DIRECT implies zero-memory-copy and/or DMA from main memory to the disk themselves. While with standard filesystem this should be possible, with CoW+checksum (and anything which transforms data when they flow, ie: compression) this become very difficult. Back to main point... anyone with some insights on doublewrite and ZFS? # Before running the test program: ARC Size: 0.09% 1.14 MiB Target Size: (Adaptive) 100.00% 1.20 GiB Min Size (Hard Limit): 12.50% 153.30 MiB Max Size (High Water): 8:1 1.20 GiB # After running it: ARC Size: 48.65% 596.61 MiB Target Size: (Adaptive) 100.00% 1.20 GiB Min Size (Hard Limit): 12.50% 153.30 MiB Max Size (High Water): 8:1 1.20 GiB # Reading the just-written file shows data are server by ARC (ie: too fast for coming from the disk) root@freebsd:~ # dd if=/tank/test.img of=/dev/null bs=1M 512+0 records in 512+0 records out 536870912 bytes transferred in 0.188852 secs (2842809718 bytes/sec) [1] Test program: root@freebsd:~ # cat test.c #define _GNU_SOURCE #include <string.h> #include <stdlib.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> #define BLOCKSIZE 128*1024 int main() { void *buffer; int i = 0; int w = 0; buffer = malloc(BLOCKSIZE); buffer = memset(buffer, 48, BLOCKSIZE); int f = open("/tank/test.img", O_CREAT|O_TRUNC|O_WRONLY|O_DIRECT); for (i=0; i<512*8; i++) { w = write(f, buffer, BLOCKSIZE); } close(f); free(buffer); return 0; } -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
I think it is up to the OS kernel how to handle interrupt request when a system call is in progress. If kernel reacts to signals/exceptions by interrupting write() call in the middle of copying data from your buffer to the page cache, nothing would help. And what means “in the middle”, is also unclear. There would be some kind of granularity (page size in pagecache maybe). I do not know what different kernels do in such cases, but this the is level where ZFS is not involved at all. From: Gionatan Danti<mailto:g.danti@assyoma.it> Sent: Tuesday, 14 August 2018 21:13 Subject: Re: [Maria-discuss] Is disabling doublewrite safe on ZFS? Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
There is at least one case I know where you do not need doublewrite buffer. And you even do not need CoW filesystem.
A combination of OS guarantee of atomic writes if they are sector-sized writes, and matching innodb page size being. If you have disks with 4K sectors (quite common), and you chose innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use Windows as your OS (because this one provides guarantees that single-sector sized/aligned writes are atomic as per https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writ... [1]), then you can safely disable innodb-doublewrite. You do not need "supported hardware" for that.
Hi, lets suppose mysqld crashes during the copy from its internal buffer and the OS write cache, ending with only partial data being transferred (ie: 2K data on a 4Kn disk). If using direct writes (or FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the underlying disk throwing an I/O error. But what about non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?
As for Linux, I think Marko tested what happens when process is getting killed, and sure enough, it can be killed in the middle of a larger write, and have partially written data. I suspect that O_DIRECT and sector-sized writes might be atomic ( as in Windows example), but I did not find any written confirmation for that. Someone with better understanding of kernel and filesystems could prove or disprove this suspicion.
Yes, O_DIRECT + single sector aligned write *should* be atomic, supposing the disk rejects the partial write. However, this really is an hardware-specific condition. Back to ZFS: the entire record *will* be written atomically. As a first approximation, when recordsize == innodb page size, doublewrite should not be needed. However, as stated above, what will happen if the mysqld process is killed at the wrong moment? I fear something as: - InnoDB pagesize and ZFS recordsize are both at 16K; - InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS does not support O_DIRECT, by the way); - mysqld crashes at the worst possible moment, so only 1/2 of InnoDB internal data (8K) was written by write(); - ZFS received the partial 8K data, but it does *not* know these are partial data only (ie: it "see" a normal 8K write); - some seconds later, partial data are commited to stable storage; - when mysqld restarts, InnoDB complains about partial page write. This bring another question: how will InnoDB behave after detecting a partial page write? Will it shut down itself? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it<http://www.assyoma.it> email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
On 16/08/2018 23:41, Vladislav Vaintroub wrote:
I think it is up to the OS kernel how to handle interrupt request when a system call is in progress. If kernel reacts to signals/exceptions by interrupting write() call in the middle of copying data from your buffer to the page cache, nothing would help. And what means “in the middle”, is also unclear. There would be some kind of granularity (page size in pagecache maybe). I do not know what different kernels do in such cases, but this the is level where ZFS is not involved at all.
Well, the manual [1] seems quite explicit about the possibility to have write() interrupted by a signal (and, by extension, due to a process crash): "If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has been written, the call succeeds, and returns the number of bytes written." With such a premise, do you think disabling doublewrite still be safe? Thanks. [1] https://linux.die.net/man/2/write -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
On 17/08/2018 09:20, Gionatan Danti wrote:
On 16/08/2018 23:41, Vladislav Vaintroub wrote:
I think it is up to the OS kernel how to handle interrupt request when a system call is in progress. If kernel reacts to signals/exceptions by interrupting write() call in the middle of copying data from your buffer to the page cache, nothing would help. And what means “in the middle”, is also unclear. There would be some kind of granularity (page size in pagecache maybe). I do not know what different kernels do in such cases, but this the is level where ZFS is not involved at all.
Well, the manual [1] seems quite explicit about the possibility to have write() interrupted by a signal (and, by extension, due to a process crash):
"If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has been written, the call succeeds, and returns the number of bytes written."
With such a premise, do you think disabling doublewrite still be safe? Thanks.
Hi all, anyone with some suggestion/insight on the matter? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
Hi all, anyone with some suggestion/insight on the matter?
While I can't comment on the intricacies or internals of MySQL being (un)able to recover after a crash without the doublewrite buffer, if you skim through the changelog between versions (be that upstream Oracle or downstream in Maria/Percona), nearly every second (even minor) version has some sort of dataloss/corruption/segfault type of bug. Just for example a recent comes into mind https://jira.mariadb.org/browse/MDEV-15764
From my experience I've been switching off doublewrite on MySQL (even on XFS and now on ZFS (because of compression)) for years and even in the few accidental powerloss/total crash cases I haven't seen a corruption caused by an unexpected reboot (possible write lost midflight). Most times mysql hasn't been unable to start just because of internal issues (which you solve by having slaves and backups).
My point being - zfs in principle is the same as the "atomic write hardware" (eg either the block writes succeed fully or not at all) so if you turn off doublewrite on those fancy Fusionio cards, I don't see a reason why you can't do the same on zfs. Even if there are some edge cases where it could become "unsafe" most of the time you still run with better performance and considering the SSD wear level the hardware could fail (reach end of life) ~two times sooner ;) p.s. sorry for the mail not being about the particular technical aspects rather than general thoughts rr
On 20/08/2018 15:10, Reinis Rozitis wrote:
Hi all, anyone with some suggestion/insight on the matter?
While I can't comment on the intricacies or internals of MySQL being (un)able to recover after a crash without the doublewrite buffer, if you skim through the changelog between versions (be that upstream Oracle or downstream in Maria/Percona), nearly every second (even minor) version has some sort of dataloss/corruption/segfault type of bug. Just for example a recent comes into mind https://jira.mariadb.org/browse/MDEV-15764
D'oh! [1]
From my experience I've been switching off doublewrite on MySQL (even on XFS and now on ZFS (because of compression)) for years and even in the few accidental powerloss/total crash cases I haven't seen a corruption caused by an unexpected reboot (possible write lost midflight). Most times mysql hasn't been unable to start just because of internal issues (which you solve by having slaves and backups).
Uhm, powerloss and segfault/segkill (ie: process crash) are quite different. The first means *any* activity is stopped (ie: filesystem has no means to write anything), while the latter means *mysqld* stops writing but the filesystem can write the partial data received.
My point being - zfs in principle is the same as the "atomic write hardware" (eg either the block writes succeed fully or not at all) so if you turn off doublewrite on those fancy Fusionio cards, I don't see a reason why you can't do the same on zfs.
What I means is that while a ZFS write is an all-or-nothing affair, it can write partial data from the application (mysqld) point of view. What it needs is a partial data from the application itself (ie: a crashing mysqld) - garbage in, garbage out. Doublewrite *shoult* catch that (ie: at restarting, mysqld would read the double buffer, detect it as corrupt and discard it while not touching at all any previous data on main database files). My understanding (which *can* be wrong) is that MariaDB "atomic write support" is a mean to inform the underlying device of the entire write process and to keep new data on "spare" location until the application itself (mysqld) commits the *entire*, verified write, enabling the hardware device to atomically swap/remap the affected data locations. In this case, a failed mysqld process will never reach the "commit phase", leaving the old data untouched.
Even if there are some edge cases where it could become "unsafe" most of the time you still run with better performance and considering the SSD wear level the hardware could fail (reach end of life) ~two times sooner ;)
Good point, this surely is a factor to evaluate.
p.s. sorry for the mail not being about the particular technical aspects rather than general thoughts
They are greatly appreciated! Thanks. [1] https://en.wikipedia.org/wiki/D%27oh! -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
What I means is that while a ZFS write is an all-or-nothing affair, it can write partial data from the application (mysqld) point of view. What it needs is a partial data from the application itself (ie: a crashing mysqld) - garbage in, garbage out. Doublewrite *shoult* catch that (ie: at restarting, mysqld would read the double buffer, detect it as corrupt and discard it while not touching at all any previous data on main database files).
My understanding (which *can* be wrong) is that MariaDB "atomic write support" is a mean to inform the underlying device of the entire write process and to keep new data on "spare" location until the application itself (mysqld) commits the *entire*, verified write, enabling the hardware device to atomically swap/remap the affected data locations. In this case, a failed mysqld process will never reach the "commit phase", leaving the old data untouched.
I reread my previous mail and I have been a bit unclear - what I wanted to say that turning off doublewrite feels (at least for me) safe enough (compared to other disasters you could get into), but it's not 100% safe. (commenting on the initial "From my understanding, disabling doublebuffer is really 100% safe only when enabling atomic writes"). For the application (mysql) itself to perform any sort of checksum (after a crash boot) and possibility to recover you have to write (and keep until the second write is acknowledged) data twice anyways hence the doublewrite buffer. Those atomic write hardware bits the same as ZFS [1] just ensure that any write either lands on the metal (which also tend to lie - write caches on drives/controllers, silent errors, bit rot etc) or not. If the application itself issues a 4KB write when it actually needed to write 8KB. It has been covered in some recent Percona blog comments [2] (by some authorities in mysql world) So to answer your initial mail - it's not 100% safe ;) [1] https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsAndZILs [2] https://www.percona.com/blog/2017/12/07/hands-look-zfs-with-mysql/#comment-1... rr
On 20/08/2018 21:40, Reinis Rozitis wrote:
I reread my previous mail and I have been a bit unclear - what I wanted to say that turning off doublewrite feels (at least for me) safe enough (compared to other disasters you could get into), but it's not 100% safe. (commenting on the initial "From my understanding, disabling doublebuffer is really 100% safe only when enabling atomic writes").
Ok, understood.
For the application (mysql) itself to perform any sort of checksum (after a crash boot) and possibility to recover you have to write (and keep until the second write is acknowledged) data twice anyways hence the doublewrite buffer.
Exactly.
Those atomic write hardware bits the same as ZFS [1] just ensure that any write either lands on the metal (which also tend to lie - write caches on drives/controllers, silent errors, bit rot etc) or not. If the application itself issues a 4KB write when it actually needed to write 8KB.
What puzzles me it that ext4 with data=journal *should* guarantee atomic writes also, but disabling doublewrites led to InnoDB corruption in case of mysqld crash/kill
It has been covered in some recent Percona blog comments [2] (by some authorities in mysql world)
[2] https://www.percona.com/blog/2017/12/07/hands-look-zfs-with-mysql/#comment-1...
Thanks for the link. The problem I have with such blog is that, in the past, it gave incorrect information on doublewrite safety guarantees [1] Even in the comments section of the link you posted there are users warning against disabling checksum (which is another can of worms, for sure). Well, it seems I just need to do some in-house testing... :p I'll report back any interesting findings. Thanks. [1] https://www.percona.com/blog/2015/06/17/update-on-the-innodb-double-write-bu... -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
2018-08-21 11:16 GMT+03:00 Gionatan Danti <g.danti@assyoma.it>:
On 20/08/2018 21:40, Reinis Rozitis wrote:
Those atomic write hardware bits the same as ZFS [1] just ensure that any write either lands on the metal (which also tend to lie - write caches on drives/controllers, silent errors, bit rot etc) or not. If the application itself issues a 4KB write when it actually needed to write 8KB.
What puzzles me it that ext4 with data=journal *should* guarantee atomic writes also, but disabling doublewrites led to InnoDB corruption in case of mysqld crash/kill
I believe that the Linux kernel can interrupt any write at 4096-byte boundaries when a signal is delivered to the process. I am curious: Where was it claimed that data=journal guarantees atomic writes (other than [1])? I would expect it to only guarantee that anything that was written to the journal will be durable. Whether the actual write request was honored in full is a separate matter.
It has been covered in some recent Percona blog comments [2] (by some authorities in mysql world)
[2] https://www.percona.com/blog/2017/12/07/hands-look-zfs-with-mysql/#comment-1...
Thanks for the link. The problem I have with such blog is that, in the past, it gave incorrect information on doublewrite safety guarantees [1] Even in the comments section of the link you posted there are users warning against disabling checksum (which is another can of worms, for sure).
Well, it seems I just need to do some in-house testing... :p I'll report back any interesting findings.
[1] https://www.percona.com/blog/2015/06/17/update-on-the-innodb-double-write-bu...
Please report back any findings, whether or not you consider them to be interesting. I believe that it is technically possible for a copy-on-write filesystem like ZFS to support atomic writes, but for that to be possible in practice, the interfaces inside the kernel must be implemented in an appropriate way. Disclaimer: I have no knowledge of the implementation details of any kernel. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
On 21/08/2018 11:52, Marko Mäkelä wrote:
I believe that the Linux kernel can interrupt any write at 4096-byte boundaries when a signal is delivered to the process. I am curious: Where was it claimed that data=journal guarantees atomic writes (other than [1])? I would expect it to only guarantee that anything that was written to the journal will be durable. Whether the actual write request was honored in full is a separate matter.
Sure, ext4 + data=journal only has "atomic writes" in the sense that what was written in the journal transaction/commit would be completely commited into the main filesystem. But from the application point of view, this could be very well a partial write. This is exactly the point I am stressing: durable writes does *not* means atomicity in the true sense (ie: from application standpoint). In this regards, I would imagine for ZFS to behave similarly: at TXG commit, anything buffered in RAM (and replicated by the ZIL) would be committed to the main filesystem, but if the application write itself was incomplete (due to an application crash) *and* application-side doublebuffer was disabled, bad thing could happen...
Please report back any findings, whether or not you consider them to be interesting.
I believe that it is technically possible for a copy-on-write filesystem like ZFS to support atomic writes, but for that to be possible in practice, the interfaces inside the kernel must be implemented in an appropriate way. Disclaimer: I have no knowledge of the implementation details of any kernel.
I would expect (and I can be wrong!) that "atomic writes" in MySQL/MariaDB context means more that durable writes; rather, I expect them to be a means for communicate to the lower layer (ie: storage device) the application consistency model. Something similar to "buffer all writes and atomically write them into the main filesystem only when I (MariaDB) *explicitly* tell you to do that". In this case, a crashed MariaDB will *never* commit the partial data to the main database files. I wrote a test program[1] which spawn a child appending data to a backing file, killing (-9) it via the parent process at random time. It seem *very* difficult to cause any sort of partial, both on ext4 (even with no data journal!) and ZFS. You basically had to interrupt the write() call at a very precise moment, and good luck doing that, especially when writing small data chunks. So it really seems that a doublewrite-less MariaDB would be safe from corruption unless extraordinary bad luck (ie: mysqld crash at a *really small* wrong moment) hits. I plan to do some more test with a "real" MariaDB installation being crashed in the middle of intense writes. I'll update you when done. Test setup: - CentOS 7 x86-64 VM on KVM host - 1 GB RAM - 8 GB disk - ext4 (data=ordered) and zfs filesystem ((compression=off, xattr=sa, recordize=16k)) created on top of a ~400 MB files under /dev/shm (basically a RAMDISK), mounted on /mnt/ - varying buffer size (16k, 128k and 4m) Results... # ext4 16k [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 16 /mnt/append.txt 1000 ec6affcd48d0f33be5cb211f99453b73 /mnt/append.txt # ext4 128k [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 128 /mnt/append.txt 1000 8f607cfdd2c87d6a7eedb657dafbd836 /mnt/append.txt # ext4 4m <-- PARTIAL WRITES DETECTED [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1 1624 /mnt/append.txt 1 2892 /mnt/append.txt 998 4096 /mnt/append.txt 1 5ab53863a602f93aaef0c7578bb2f91d /mnt/append.txt 1 c67e09d43084ce17cef2f844482bf9a9 /mnt/append.txt 998 d5e9dca290ea8d856183557a31d5eb72 /mnt/append.txt Ext4 summary: partial write detected only when buffersize == 4m zfs 16k (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 16 /mnt/append.txt 1000 ec6affcd48d0f33be5cb211f99453b73 /mnt/append.txt zfs 128k (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 4 0 /mnt/append.txt 996 128 /mnt/append.txt 1000 8f607cfdd2c87d6a7eedb657dafbd836 /mnt/append.txt zfs 4m (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 353 0 /mnt/append.txt 647 4096 /mnt/append.txt 1000 d5e9dca290ea8d856183557a31d5eb72 /mnt/append.txt ZFS summary: no partial write detected, albeit apparent file size was sometime wrong (it can be a lazy metadata update; md5sum was always correct). I hope the above data to be interesting. If I did something wrong, please let me know. --- [1] Test program #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <sys/types.h> #include <signal.h> #define MAX_COUNT 1 #define MAX_WAIT 1000 #define BUF_SIZE 16*1024 // or 128*1024 or 4*1024*1024 void ChildProcess(void); void ParentProcess(pid_t); void main(void) { pid_t pid; int i; for (i = 0; i < MAX_COUNT; i++) { pid = fork(); if (pid == 0) ChildProcess(); else ParentProcess(pid); } } void ChildProcess(void) { int fd; int res; char *str; str = (char *) malloc(BUF_SIZE); memset(str,48,BUF_SIZE); fd = open( "/mnt/append.txt" , O_SYNC | O_WRONLY | O_TRUNC | O_CREAT); while (1) { lseek(fd, 0, SEEK_SET); res = write(fd, str, BUF_SIZE); } close(fd); } void ParentProcess(pid_t pid) { struct timeval tv; int res = 0; int rnd = 0; res = gettimeofday(&tv, NULL); srand(tv.tv_usec); rnd = random() % MAX_WAIT; usleep(rnd); kill(pid, 9); } -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
Il 24-08-2018 17:25 Gionatan Danti ha scritto:
So it really seems that a doublewrite-less MariaDB would be safe from corruption unless extraordinary bad luck (ie: mysqld crash at a *really small* wrong moment) hits.
Hi all, I have a follow-up on my tests. It seems that a write() up to 16 MB is unkillable/unstoppable when directly done on top of a ZFS filesystem. I *think* this is a deliberate result of how ARC accept write for buffering. It does not seems a coincidence that current max recordsize on a ZFS filesystem is 16 MB. On the other side, layering a ext4 filesystem on top of a ZVOL does *not* avoid partial writes. Similarly, I *think* this is due the linux own pagecache accepting an interrupted write stream. In short, it seems a performance vs correctness tradeoff: while pagecache is way faster (for reads/writes that hit), ARC seems to greatly favor correcteness by avoiding interrupted writes. These are only *speculations* on my parts, but they are backed by my (empirical) test results. If they are correct, disabling doublewrite would be safe if mysqld runs directly on top of a ZFS filesystem, while it have a (small) probability of corruption itself when running inside a virtual machine and/or through another filesystem layer. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
participants (5)
-
Federico Razzoli
-
Gionatan Danti
-
Marko Mäkelä
-
Reinis Rozitis
-
Vladislav Vaintroub