Hi Kristian, (2013/02/08 23:52), Kristian Nielsen wrote:
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync.
Thanks for your work. Today I took a closer look at the patch and the deeper issues.
I think we need to understand first better
1. What are the inefficiencies of the current code.
2. What is the effect of the patch, why does it speed things up.
I think fallocate is more light weight than write system call. It is because fallocate changes only meta data instead of actual data. It allocates extents and flags these to 'unwritten'. (You can confirm this by using filefrag command with -v option.) On the other hand, write system call writes actual data to the disk. So, I/O operation to the physical disk would be more expensive.
(snip.)
The current code does a series of 1MB writes at the end of the file to extend the file. I think these writes are done with O_DIRECT if O_DIRECT is enabled. I do not see any fsync() or fdatasync() call at the end.
Is this true? In the last of fil_extend_space_to_desired_size, fil_flush is called. It seemds that this function calles fsync indirectly.
Did you run your benchmarks with O_DIRECT or without? How much is the tablespace extended with for every call to fil_extend_space_to_desired_size()?
I did that with O_DIRECT. I observed fil_extend_space_to_desired_size by using printf, It shows that this function typically writes 16KB ~ 64KB. It is because created table spaces are less than 400KB.
One possible inefficiency is if each 1MB O_DIRECT write flushes to disk both the data written and also the new size of the file. I did not find conclusive answer to this one way or the other, maybe it depends on the file system? An 1MB sequential write to a non-SSD harddisk costs around the same as one random I/O, so this alone could double the time needed.
I found writes are very small, just as described above. So, this causes high random access to disk.
(snip.)
So possible reasons for the speedup from the patch include:
- Less syncing of metadata to disk during extending the file.
- Saving initial write of disk page.
I want to understand which of these are in play, if any, and if there are other effects.
I think both. fallocate reduces write and sync.
(snip.)
I have also some smaller comments on the patch itself, given inline below. But we should resolve the above general problems first, probably:
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
The patch would have to be made against MariaDB 10.0. But I looked, the code looks much the same in 10.0 and 5.5, so should not be too hard.
Ok, I will try it.
+ ibool fallenback = TRUE;
+ fallenback = FALSE; + goto extend_after;
+extend_after: + fil_node_complete_io(node, fil_system, fallenback ? OS_FILE_WRITE : OS_FILE_READ);
I think the two different methods of extending the file should be done in separate functions called from here. Then the complexity with the `fallenback' flag and the goto is not needed.
I think it is better, too. Thanks.
Also, I think there should be an fdatasync() after fallocate(). Or what do you think, is it not needed, and why not? What is the typical size that a tablespace is extended with? The os_file_set_size() does have an fdatasync().
Since fallocate modifies only meta data, I think this modification is protected by filesystem journal from some corruption. There are no sync in my patch for os_file_set_size as well as fil_extend_space_to_desired_size.
- Kristian.
Regards, Toshikuni Fukaya