Hi, Just a small correction: ZoL does not support O_DIRECT, but FreeBSD ZFS does. Probably other distributions also do. Regards,Federico Razzoli Il martedì 14 agosto 2018, 20:13:59 GMT+1, Gionatan Danti <g.danti@assyoma.it> ha scritto: Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
There is at least one case I know where you do not need doublewrite buffer. And you even do not need CoW filesystem.
A combination of OS guarantee of atomic writes if they are sector-sized writes, and matching innodb page size being. If you have disks with 4K sectors (quite common), and you chose innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use Windows as your OS (because this one provides guarantees that single-sector sized/aligned writes are atomic as per https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writ... [1]), then you can safely disable innodb-doublewrite. You do not need "supported hardware" for that.
Hi, lets suppose mysqld crashes during the copy from its internal buffer and the OS write cache, ending with only partial data being transferred (ie: 2K data on a 4Kn disk). If using direct writes (or FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the underlying disk throwing an I/O error. But what about non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?
As for Linux, I think Marko tested what happens when process is getting killed, and sure enough, it can be killed in the middle of a larger write, and have partially written data. I suspect that O_DIRECT and sector-sized writes might be atomic ( as in Windows example), but I did not find any written confirmation for that. Someone with better understanding of kernel and filesystems could prove or disprove this suspicion.
Yes, O_DIRECT + single sector aligned write *should* be atomic, supposing the disk rejects the partial write. However, this really is an hardware-specific condition. Back to ZFS: the entire record *will* be written atomically. As a first approximation, when recordsize == innodb page size, doublewrite should not be needed. However, as stated above, what will happen if the mysqld process is killed at the wrong moment? I fear something as: - InnoDB pagesize and ZFS recordsize are both at 16K; - InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS does not support O_DIRECT, by the way); - mysqld crashes at the worst possible moment, so only 1/2 of InnoDB internal data (8K) was written by write(); - ZFS received the partial 8K data, but it does *not* know these are partial data only (ie: it "see" a normal 8K write); - some seconds later, partial data are commited to stable storage; - when mysqld restarts, InnoDB complains about partial page write. This bring another question: how will InnoDB behave after detecting a partial page write? Will it shut down itself? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp