Il 14-08-2018 19:58 Vladislav Vaintroub ha scritto:
> There is at least one case I know where you do not need doublewrite
> buffer. And you even do not need CoW filesystem.
>
> A combination of OS guarantee of atomic writes if they are
> sector-sized writes, and matching innodb page size being. If you have
> disks with 4K sectors (quite common), and you chose
> innodb-page-size=4K, and use innodb-flush-neighbors=0 , and use
> Windows as your OS (because this one provides guarantees that
> single-sector sized/aligned writes are atomic as per
>
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile> [1]), then you can safely disable innodb-doublewrite. You do not need
> "supported hardware" for that.
Hi,
lets suppose mysqld crashes during the copy from its internal buffer and
the OS write cache, ending with only partial data being transferred (ie:
2K data on a 4Kn disk). If using direct writes (or
FILE_FLAG_WRITE_THROUGH) the partial data will be rejected by the
underlying disk throwing an I/O error. But what about
non-O_DIRECT/FILE_FLAG_WRITE_THROUGH writes?
> As for Linux, I think Marko tested what happens when process is
> getting killed, and sure enough, it can be killed in the middle of a
> larger write, and have partially written data. I suspect that O_DIRECT
> and sector-sized writes might be atomic ( as in Windows example), but
> I did not find any written confirmation for that. Someone with better
> understanding of kernel and filesystems could prove or disprove this
> suspicion.
Yes, O_DIRECT + single sector aligned write *should* be atomic,
supposing the disk rejects the partial write. However, this really is an
hardware-specific condition. Back to ZFS: the entire record *will* be
written atomically. As a first approximation, when recordsize == innodb
page size, doublewrite should not be needed. However, as stated above,
what will happen if the mysqld process is killed at the wrong moment?
I fear something as:
- InnoDB pagesize and ZFS recordsize are both at 16K;
- InnoDB calls write() copy 16K of internal data to OS pagecache (ZFS
does not support O_DIRECT, by the way);
- mysqld crashes at the worst possible moment, so only 1/2 of InnoDB
internal data (8K) was written by write();
- ZFS received the partial 8K data, but it does *not* know these are
partial data only (ie: it "see" a normal 8K write);
- some seconds later, partial data are commited to stable storage;
- when mysqld restarts, InnoDB complains about partial page write.
This bring another question: how will InnoDB behave after detecting a
partial page write? Will it shut down itself?