Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync.
Thanks for your work. Today I took a closer look at the patch and the deeper issues. I think we need to understand first better 1. What are the inefficiencies of the current code. 2. What is the effect of the patch, why does it speed things up. I mean, understand in terms of exactly what I/O operations are performed on the disk in the different cases. Mostly for the ext4 and XFS file systems, maybe there are others of interest. The current code does a series of 1MB writes at the end of the file to extend the file. I think these writes are done with O_DIRECT if O_DIRECT is enabled. I do not see any fsync() or fdatasync() call at the end. Did you run your benchmarks with O_DIRECT or without? How much is the tablespace extended with for every call to fil_extend_space_to_desired_size()? One possible inefficiency is if each 1MB O_DIRECT write flushes to disk both the data written and also the new size of the file. I did not find conclusive answer to this one way or the other, maybe it depends on the file system? An 1MB sequential write to a non-SSD harddisk costs around the same as one random I/O, so this alone could double the time needed. Another potential inefficiency is that the existing code first writes zero to each data page. But then what happens when the page is first needed? I assume it is not read, rather a new page is initialised and written. So if fallocate() on the given system can just mark the block allocated, then we can save the initial write of zeros, just writing the initial page later. On the other hand, that initial write will then need to update metadata saying that the disk blocks are now in use. So you need to also benchmark the cost of both creating the table and then afterwards filling it up with data, in a situation where the I/O is the bottlenect, not CPU. This needs to be done a bit carefully to ensure that the tablespace pages are actually written (initially only the redo log is written). So possible reasons for the speedup from the patch include: - Less syncing of metadata to disk during extending the file. - Saving initial write of disk page. I want to understand which of these are in play, if any, and if there are other effects. Apart from researching documentation and so on, one way to understand this better is to run benchmarks inside a virtual machine like kvm, and run strace on the kvm process in the host. This shows all I/O operations to the disk. On the other hand, using fallocate to extend the file in one go gives the file system more information than writing it in pieces. So this could potentially be a better method. But it touches a core part of I/O performance, so we need to understand it. I have also some smaller comments on the patch itself, given inline below. But we should resolve the above general problems first, probably:
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
The patch would have to be made against MariaDB 10.0. But I looked, the code looks much the same in 10.0 and 5.5, so should not be too hard.
+ ibool fallenback = TRUE;
+ fallenback = FALSE; + goto extend_after;
+extend_after: + fil_node_complete_io(node, fil_system, fallenback ? OS_FILE_WRITE : OS_FILE_READ);
I think the two different methods of extending the file should be done in separate functions called from here. Then the complexity with the `fallenback' flag and the goto is not needed. Also, I think there should be an fdatasync() after fallocate(). Or what do you think, is it not needed, and why not? What is the typical size that a tablespace is extended with? The os_file_set_size() does have an fdatasync(). - Kristian.