On 21/08/2018 11:52, Marko Mäkelä wrote:
I believe that the Linux kernel can interrupt any write at 4096-byte boundaries when a signal is delivered to the process. I am curious: Where was it claimed that data=journal guarantees atomic writes (other than [1])? I would expect it to only guarantee that anything that was written to the journal will be durable. Whether the actual write request was honored in full is a separate matter.
Sure, ext4 + data=journal only has "atomic writes" in the sense that what was written in the journal transaction/commit would be completely commited into the main filesystem. But from the application point of view, this could be very well a partial write. This is exactly the point I am stressing: durable writes does *not* means atomicity in the true sense (ie: from application standpoint). In this regards, I would imagine for ZFS to behave similarly: at TXG commit, anything buffered in RAM (and replicated by the ZIL) would be committed to the main filesystem, but if the application write itself was incomplete (due to an application crash) *and* application-side doublebuffer was disabled, bad thing could happen...
Please report back any findings, whether or not you consider them to be interesting.
I believe that it is technically possible for a copy-on-write filesystem like ZFS to support atomic writes, but for that to be possible in practice, the interfaces inside the kernel must be implemented in an appropriate way. Disclaimer: I have no knowledge of the implementation details of any kernel.
I would expect (and I can be wrong!) that "atomic writes" in MySQL/MariaDB context means more that durable writes; rather, I expect them to be a means for communicate to the lower layer (ie: storage device) the application consistency model. Something similar to "buffer all writes and atomically write them into the main filesystem only when I (MariaDB) *explicitly* tell you to do that". In this case, a crashed MariaDB will *never* commit the partial data to the main database files. I wrote a test program[1] which spawn a child appending data to a backing file, killing (-9) it via the parent process at random time. It seem *very* difficult to cause any sort of partial, both on ext4 (even with no data journal!) and ZFS. You basically had to interrupt the write() call at a very precise moment, and good luck doing that, especially when writing small data chunks. So it really seems that a doublewrite-less MariaDB would be safe from corruption unless extraordinary bad luck (ie: mysqld crash at a *really small* wrong moment) hits. I plan to do some more test with a "real" MariaDB installation being crashed in the middle of intense writes. I'll update you when done. Test setup: - CentOS 7 x86-64 VM on KVM host - 1 GB RAM - 8 GB disk - ext4 (data=ordered) and zfs filesystem ((compression=off, xattr=sa, recordize=16k)) created on top of a ~400 MB files under /dev/shm (basically a RAMDISK), mounted on /mnt/ - varying buffer size (16k, 128k and 4m) Results... # ext4 16k [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 16 /mnt/append.txt 1000 ec6affcd48d0f33be5cb211f99453b73 /mnt/append.txt # ext4 128k [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 128 /mnt/append.txt 1000 8f607cfdd2c87d6a7eedb657dafbd836 /mnt/append.txt # ext4 4m <-- PARTIAL WRITES DETECTED [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1 1624 /mnt/append.txt 1 2892 /mnt/append.txt 998 4096 /mnt/append.txt 1 5ab53863a602f93aaef0c7578bb2f91d /mnt/append.txt 1 c67e09d43084ce17cef2f844482bf9a9 /mnt/append.txt 998 d5e9dca290ea8d856183557a31d5eb72 /mnt/append.txt Ext4 summary: partial write detected only when buffersize == 4m zfs 16k (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 1000 16 /mnt/append.txt 1000 ec6affcd48d0f33be5cb211f99453b73 /mnt/append.txt zfs 128k (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 4 0 /mnt/append.txt 996 128 /mnt/append.txt 1000 8f607cfdd2c87d6a7eedb657dafbd836 /mnt/append.txt zfs 4m (compression=off, xattr=sa, recordize=16k) [root@localhost test]# gcc test.c; rm -f /mnt/append.txt; for i in `seq 1 1000`; do ./a.out; du -k --apparent-size /mnt/append.txt; md5sum /mnt/append.txt; done | sort | uniq -c 353 0 /mnt/append.txt 647 4096 /mnt/append.txt 1000 d5e9dca290ea8d856183557a31d5eb72 /mnt/append.txt ZFS summary: no partial write detected, albeit apparent file size was sometime wrong (it can be a lazy metadata update; md5sum was always correct). I hope the above data to be interesting. If I did something wrong, please let me know. --- [1] Test program #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <sys/types.h> #include <signal.h> #define MAX_COUNT 1 #define MAX_WAIT 1000 #define BUF_SIZE 16*1024 // or 128*1024 or 4*1024*1024 void ChildProcess(void); void ParentProcess(pid_t); void main(void) { pid_t pid; int i; for (i = 0; i < MAX_COUNT; i++) { pid = fork(); if (pid == 0) ChildProcess(); else ParentProcess(pid); } } void ChildProcess(void) { int fd; int res; char *str; str = (char *) malloc(BUF_SIZE); memset(str,48,BUF_SIZE); fd = open( "/mnt/append.txt" , O_SYNC | O_WRONLY | O_TRUNC | O_CREAT); while (1) { lseek(fd, 0, SEEK_SET); res = write(fd, str, BUF_SIZE); } close(fd); } void ParentProcess(pid_t pid) { struct timeval tv; int res = 0; int rnd = 0; res = gettimeofday(&tv, NULL); srand(tv.tv_usec); rnd = random() % MAX_WAIT; usleep(rnd); kill(pid, 9); } -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8