[Maria-developers] accelerating CREATE TABLE
Hi, I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync. I compared MariaDB 5.5 trunk and the patched one by our own application, which creates > 1000 tables in series, with ext4 filesystem on Ubuntu 12.04. Results are shown below. Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.) Any comments or suggestions? Thanks, Toshikuni Fukaya
Can anyone provide results for XFS? The ext-2/3/4 variants that I used in the past were prone to stalls from sequential writes. On Tue, Jan 22, 2013 at 3:03 AM, Toshikuni Fukaya < toshikuni-fukaya@cybozu.co.jp> wrote:
Hi,
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync. I compared MariaDB 5.5 trunk and the patched one by our own application, which creates > 1000 tables in series, with ext4 filesystem on Ubuntu 12.04. Results are shown below.
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
Any comments or suggestions?
Thanks, Toshikuni Fukaya
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
Hi Mark, I tried to benchmark this patch on XFS. In this test, it showed similar result to using ext4: Original MariaDB 5.5 trunk: 62.0 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 32.6 seconds (5 times avg.) BTW, this results are total time of CREATE TABLEs, it is unclear whether any stalls had been happened. Thanks, Toshikuni Fukaya (2013/01/22 23:09), MARK CALLAGHAN wrote:
Can anyone provide results for XFS? The ext-2/3/4 variants that I used in the past were prone to stalls from sequential writes.
On Tue, Jan 22, 2013 at 3:03 AM, Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp <mailto:toshikuni-fukaya@cybozu.co.jp>> wrote:
Hi,
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync. I compared MariaDB 5.5 trunk and the patched one by our own application, which creates > 1000 tables in series, with ext4 filesystem on Ubuntu 12.04. Results are shown below.
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
Any comments or suggestions?
Thanks, Toshikuni Fukaya
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net <mailto:maria-developers@lists.launchpad.net> Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com <mailto:mdcallag@gmail.com>
Hi, I want to know whether this patch is correct, and if so, could I ask you to merge this to Maria? Could you review this? If there are any things I need to prepare for the review, please tell me. Please give me any comments. Thanks, Toshikuni Fukaya
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I want to know whether this patch is correct, and if so, could I ask you to merge this to Maria?
Sorry that we have not had time to get back to you. There is a lot going on at the moment, but the whole point of MariaDB is to be a place where the community can work together on the codebase. So it is important that we be responsive. I will try to review the patch this week. I am not familiar with this area of the code, so it would help me if you could provide some additional information. - You mention CREATE TABLE, but I assume the patch actually speeds up creation of a new tablespace, right? Is it used for the creation of all tablespaces, or only for tablespaces used for tables when --innodb-file-per-table=1? - I think InnoDB has the option to auto-extend tablespaces, as well as create new ones. Does this patch handle both creating new and autoextension? Or only the former? - Do you consider the patch complete? Or do you have ideas for how it could be extended, perhaps in a future version? - Is there any other interaction with other parts of the code in InnoDB (or the server) that you are aware of? - Did you get any useful feedback (and if so, what) when you tried posting this to the internals@mysql.com mailing list? Or just the extremely arrogant "we do not like the patch and we will not say why" that was replied on the public list? Anyway, thanks a lot for your efforts so far, and I will try to get back to you at the end of the week. - Kristian.
Hi Kristian,
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I want to know whether this patch is correct, and if so, could I ask you to merge this to Maria?
Sorry that we have not had time to get back to you. There is a lot going on at the moment, but the whole point of MariaDB is to be a place where the community can work together on the codebase. So it is important that we be responsive.
I will try to review the patch this week. I am not familiar with this area of the code, so it would help me if you could provide some additional information.
- You mention CREATE TABLE, but I assume the patch actually speeds up creation of a new tablespace, right? Is it used for the creation of all
You are right, this affects creation of a new tablespace.
tablespaces, or only for tablespaces used for tables when --innodb-file-per-table=1?
Yes, I set this in my.conf.
- I think InnoDB has the option to auto-extend tablespaces, as well as create new ones. Does this patch handle both creating new and autoextension? Or only the former?
It handles both. I mainly focused on CREATE TABLE, but I think that InnoDB makes indices when CREATE TABLE has some keys, then autoextension is done in index creation. So, I also modified that.
- Do you consider the patch complete? Or do you have ideas for how it could be extended, perhaps in a future version?
Currently I don't have any ideas yet to improve this, but I think in a future version the function should be able to turn off for those people who do not want to use it.
- Is there any other interaction with other parts of the code in InnoDB (or the server) that you are aware of?
I relied on that fallocate does not need fsync since metadata is protected by filesystem journal. But I am not confident whether it is true. I'm wondering if this patch may lead InnoDB committing schema to not function normally.
- Did you get any useful feedback (and if so, what) when you tried posting this to the internals@mysql.com mailing list? Or just the extremely arrogant "we do not like the patch and we will not say why" that was replied on the public list?
No, I got a comment to reject and no reason to that. I wanted to know whether my idea is correct or to get any suggestions.
Anyway, thanks a lot for your efforts so far, and I will try to get back to you at the end of the week.
- Kristian.
Thanks a lot your attension, Toshikuni Fukaya
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync.
Thanks for your work. Today I took a closer look at the patch and the deeper issues. I think we need to understand first better 1. What are the inefficiencies of the current code. 2. What is the effect of the patch, why does it speed things up. I mean, understand in terms of exactly what I/O operations are performed on the disk in the different cases. Mostly for the ext4 and XFS file systems, maybe there are others of interest. The current code does a series of 1MB writes at the end of the file to extend the file. I think these writes are done with O_DIRECT if O_DIRECT is enabled. I do not see any fsync() or fdatasync() call at the end. Did you run your benchmarks with O_DIRECT or without? How much is the tablespace extended with for every call to fil_extend_space_to_desired_size()? One possible inefficiency is if each 1MB O_DIRECT write flushes to disk both the data written and also the new size of the file. I did not find conclusive answer to this one way or the other, maybe it depends on the file system? An 1MB sequential write to a non-SSD harddisk costs around the same as one random I/O, so this alone could double the time needed. Another potential inefficiency is that the existing code first writes zero to each data page. But then what happens when the page is first needed? I assume it is not read, rather a new page is initialised and written. So if fallocate() on the given system can just mark the block allocated, then we can save the initial write of zeros, just writing the initial page later. On the other hand, that initial write will then need to update metadata saying that the disk blocks are now in use. So you need to also benchmark the cost of both creating the table and then afterwards filling it up with data, in a situation where the I/O is the bottlenect, not CPU. This needs to be done a bit carefully to ensure that the tablespace pages are actually written (initially only the redo log is written). So possible reasons for the speedup from the patch include: - Less syncing of metadata to disk during extending the file. - Saving initial write of disk page. I want to understand which of these are in play, if any, and if there are other effects. Apart from researching documentation and so on, one way to understand this better is to run benchmarks inside a virtual machine like kvm, and run strace on the kvm process in the host. This shows all I/O operations to the disk. On the other hand, using fallocate to extend the file in one go gives the file system more information than writing it in pieces. So this could potentially be a better method. But it touches a core part of I/O performance, so we need to understand it. I have also some smaller comments on the patch itself, given inline below. But we should resolve the above general problems first, probably:
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
The patch would have to be made against MariaDB 10.0. But I looked, the code looks much the same in 10.0 and 5.5, so should not be too hard.
+ ibool fallenback = TRUE;
+ fallenback = FALSE; + goto extend_after;
+extend_after: + fil_node_complete_io(node, fil_system, fallenback ? OS_FILE_WRITE : OS_FILE_READ);
I think the two different methods of extending the file should be done in separate functions called from here. Then the complexity with the `fallenback' flag and the goto is not needed. Also, I think there should be an fdatasync() after fallocate(). Or what do you think, is it not needed, and why not? What is the typical size that a tablespace is extended with? The os_file_set_size() does have an fdatasync(). - Kristian.
Hi Kristian, (2013/02/08 23:52), Kristian Nielsen wrote:
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
I made a patch to accelerate CREATE TABLE on the innodb plugin. To zero table spaces, I used fallocate instead of normal writes and sync.
Thanks for your work. Today I took a closer look at the patch and the deeper issues.
I think we need to understand first better
1. What are the inefficiencies of the current code.
2. What is the effect of the patch, why does it speed things up.
I think fallocate is more light weight than write system call. It is because fallocate changes only meta data instead of actual data. It allocates extents and flags these to 'unwritten'. (You can confirm this by using filefrag command with -v option.) On the other hand, write system call writes actual data to the disk. So, I/O operation to the physical disk would be more expensive.
(snip.)
The current code does a series of 1MB writes at the end of the file to extend the file. I think these writes are done with O_DIRECT if O_DIRECT is enabled. I do not see any fsync() or fdatasync() call at the end.
Is this true? In the last of fil_extend_space_to_desired_size, fil_flush is called. It seemds that this function calles fsync indirectly.
Did you run your benchmarks with O_DIRECT or without? How much is the tablespace extended with for every call to fil_extend_space_to_desired_size()?
I did that with O_DIRECT. I observed fil_extend_space_to_desired_size by using printf, It shows that this function typically writes 16KB ~ 64KB. It is because created table spaces are less than 400KB.
One possible inefficiency is if each 1MB O_DIRECT write flushes to disk both the data written and also the new size of the file. I did not find conclusive answer to this one way or the other, maybe it depends on the file system? An 1MB sequential write to a non-SSD harddisk costs around the same as one random I/O, so this alone could double the time needed.
I found writes are very small, just as described above. So, this causes high random access to disk.
(snip.)
So possible reasons for the speedup from the patch include:
- Less syncing of metadata to disk during extending the file.
- Saving initial write of disk page.
I want to understand which of these are in play, if any, and if there are other effects.
I think both. fallocate reduces write and sync.
(snip.)
I have also some smaller comments on the patch itself, given inline below. But we should resolve the above general problems first, probably:
Original MariaDB 5.5 trunk: 64.9 seconds (5 times avg.) Patched MariaDB 5.5 trunk: 36.0 seconds (5 times avg.)
The patch would have to be made against MariaDB 10.0. But I looked, the code looks much the same in 10.0 and 5.5, so should not be too hard.
Ok, I will try it.
+ ibool fallenback = TRUE;
+ fallenback = FALSE; + goto extend_after;
+extend_after: + fil_node_complete_io(node, fil_system, fallenback ? OS_FILE_WRITE : OS_FILE_READ);
I think the two different methods of extending the file should be done in separate functions called from here. Then the complexity with the `fallenback' flag and the goto is not needed.
I think it is better, too. Thanks.
Also, I think there should be an fdatasync() after fallocate(). Or what do you think, is it not needed, and why not? What is the typical size that a tablespace is extended with? The os_file_set_size() does have an fdatasync().
Since fallocate modifies only meta data, I think this modification is protected by filesystem journal from some corruption. There are no sync in my patch for os_file_set_size as well as fil_extend_space_to_desired_size.
- Kristian.
Regards, Toshikuni Fukaya
Toshikuni Fukaya <toshikuni-fukaya@cybozu.co.jp> writes:
Hi Kristian,
I am sorry, I really wanted to look more into this, but I have not been able to find the time to do so, being really busy with global transaction ID, Debian packaging, and other stuff. I will try to find someone else to look more into your patch. - Kristian.
participants (3)
-
Kristian Nielsen
-
MARK CALLAGHAN
-
Toshikuni Fukaya