[Maria-discuss] InnoDB: slow insert (too many fsyncs ?)
Hello, I am trying to optimize the insertion of 124 GB of data into InnoDB and it takes much more time than I expected. For the moment this is only a test before migrating prod, using a USB HDD. Prod has much faster storage but I prefer to check if there's anything wrong in my MariaDB setup. The source DB is read as less than 1MB/s with less than 20% CPU usage (<10% for mysqld and <10% for the client app doing the migration). I know that several unsafe settings can be used to import data but I won't be able to use them for the migration of prod. This is because the migration will be done transparently in background, the old db being accessed for not yet migrated data. For the same reason, indexes can't be disabled. Anyway, it was slow that for the moment, I only did a full test with unsafe settings: innodb_flush_log_at_trx_commit = 0 innodb_flush_method = nosync innodb_doublewrite = 0 sync_frm = 0 It took exactly 19.8 hours to import all the data. See attached file for all parameters and detailed statistics. (Sorry for the 'show status' outputs, no idea what I expected to get when I tried 'flush status', whereas 21G were already processed. So 'show status.20941778944' is the output just before the flush, and 'show status' is after everything was finished) The schema of tables can be found here: http://git.erp5.org/gitweb/neoppod.git/blob/HEAD:/neo/storage/database/mysql... 1. fsync Not sure if I read 'show status' correctly, but I'd say it's actually 6 fsyncs per second, and this matches what I see with strace. Why not 0 whereas I disabled them ? There's also less than 1 COMMIT per second, so even without disabling fsyncs I was surprised to see so many in the middle of transactions. 2. Write amplification During my first tests, I saw with iotop that mysqld wrote at 10MB/s. Which means a factor 10. With doublewrite disabled, it's rather a factor 5. These ratios are probably underestimated. For better interpretation of the statistics, I should add that: - source DB is uncompressed but the file format is very compact - most of the imported data ends up in the 'data.value' column, and most values are zlib-compressed, with also deduplication The resulting ibdata is only 70GB. Is this write amplification only caused by indexes & doublewrite ? Thanks, Julien
Hi, You should try to enable innodb_file_per_table (and recreate the table), to avoid filling the ibdata file. Try also to increase the innodb_log_buffer_size to 16M (your Innodb_log_waits is increasing). How much memory do you have on your system ? The innodb_buffer_pool_size of 1G seems to be a little bit low. Try also to use the Barracuda file_format instead of Antelope : innodb_file_format = Barracuda & innodb_file_format_max = Barracuda. For such a big amount of data to import, you could also try to increase the innodb_log_file_size. On prod, depending on the speed of your disk (what is your disk / cpu setup on prod ?), you can also increase the innodb_io_capacity. Depending on your usage, you could also benefit from switching to TokuDB (fast insert and much better compression) HTH, Jocelyn Fournier Le 26/04/2015 19:29, Julien Muchembled a écrit :
Hello,
I am trying to optimize the insertion of 124 GB of data into InnoDB and it takes much more time than I expected. For the moment this is only a test before migrating prod, using a USB HDD. Prod has much faster storage but I prefer to check if there's anything wrong in my MariaDB setup.
The source DB is read as less than 1MB/s with less than 20% CPU usage (<10% for mysqld and <10% for the client app doing the migration).
I know that several unsafe settings can be used to import data but I won't be able to use them for the migration of prod. This is because the migration will be done transparently in background, the old db being accessed for not yet migrated data. For the same reason, indexes can't be disabled.
Anyway, it was slow that for the moment, I only did a full test with unsafe settings: innodb_flush_log_at_trx_commit = 0 innodb_flush_method = nosync innodb_doublewrite = 0 sync_frm = 0
It took exactly 19.8 hours to import all the data. See attached file for all parameters and detailed statistics. (Sorry for the 'show status' outputs, no idea what I expected to get when I tried 'flush status', whereas 21G were already processed. So 'show status.20941778944' is the output just before the flush, and 'show status' is after everything was finished)
The schema of tables can be found here: http://git.erp5.org/gitweb/neoppod.git/blob/HEAD:/neo/storage/database/mysql...
1. fsync
Not sure if I read 'show status' correctly, but I'd say it's actually 6 fsyncs per second, and this matches what I see with strace. Why not 0 whereas I disabled them ?
There's also less than 1 COMMIT per second, so even without disabling fsyncs I was surprised to see so many in the middle of transactions.
2. Write amplification
During my first tests, I saw with iotop that mysqld wrote at 10MB/s. Which means a factor 10. With doublewrite disabled, it's rather a factor 5.
These ratios are probably underestimated. For better interpretation of the statistics, I should add that: - source DB is uncompressed but the file format is very compact - most of the imported data ends up in the 'data.value' column, and most values are zlib-compressed, with also deduplication The resulting ibdata is only 70GB.
Is this write amplification only caused by indexes & doublewrite ?
Thanks, Julien
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
For the fsync rate, my memory is vague, but I think InnoDB respects the write-ahead-log rule even with innodb_flush_method = nosync and this means that before a dirty page is written back the log must be forced for any changes to that page. If you run strace you can determine which files get the fsync and if it is only the log then my memory is correct. I haven't used innodb_flush_method = nosync for a very long time (maybe 6 years ago) and when I did I hacked it (probably to not sync the redo log as suggested above). I almost always use innodb_flush_method=O_DIRECT and then configure enough background write threads. For write amplification you are computing the rate as Innodb-write-rate / InnoDB-read-rate. But I think your InnoDB-read-rate is the scans of the PK index while the write rate includes IO for PK and secondary indexes. This is still write-amp, but deserves an asterisk to explain that point. Write-amp from the doublewrite buffer can be much cheaper than write-amp from dirty page writeback where "cheaper" means fewer seeks. If you want to make the load go faster than creating the secondary indexes after the table has been loaded might help. Another interesting question is whether to load in parallel or one table at a time. Sometimes, loading one table at a time helps because you are more likely to keep the working set in memory during the load. This can be very important when secondary indexes are maintained during the load as you can do too much random read - modify - write for secondary indexes, although the InnoDB change buffer helps a bit with that. The choice here isn't all parallel vs 1-at-a-time. Doing a few at a time gives some parallel benefit, but also gives each table load a better chance at keeping secondary indexes in memory during the load. It might help to increase innodb_max_dirty_pages_pct to 90 (you used 75). I don't have current experience with it but you might also want to increase innodb_max_dirty_pages_pct_lwm. The goal is to not write back dirty pages too soon. Making your InnoDB log files as large as possible will also help defer writing back dirty pages and reduce your page write back rate (and write-amp) in a good way. | innodb_log_file_size | 524288000 On Sun, Apr 26, 2015 at 10:29 AM, Julien Muchembled <jm@nexedi.com> wrote:
Hello,
I am trying to optimize the insertion of 124 GB of data into InnoDB and it takes much more time than I expected. For the moment this is only a test before migrating prod, using a USB HDD. Prod has much faster storage but I prefer to check if there's anything wrong in my MariaDB setup.
The source DB is read as less than 1MB/s with less than 20% CPU usage (<10% for mysqld and <10% for the client app doing the migration).
I know that several unsafe settings can be used to import data but I won't be able to use them for the migration of prod. This is because the migration will be done transparently in background, the old db being accessed for not yet migrated data. For the same reason, indexes can't be disabled.
Anyway, it was slow that for the moment, I only did a full test with unsafe settings: innodb_flush_log_at_trx_commit = 0 innodb_flush_method = nosync innodb_doublewrite = 0 sync_frm = 0
It took exactly 19.8 hours to import all the data. See attached file for all parameters and detailed statistics. (Sorry for the 'show status' outputs, no idea what I expected to get when I tried 'flush status', whereas 21G were already processed. So 'show status.20941778944' is the output just before the flush, and 'show status' is after everything was finished)
The schema of tables can be found here:
http://git.erp5.org/gitweb/neoppod.git/blob/HEAD:/neo/storage/database/mysql...
1. fsync
Not sure if I read 'show status' correctly, but I'd say it's actually 6 fsyncs per second, and this matches what I see with strace. Why not 0 whereas I disabled them ?
There's also less than 1 COMMIT per second, so even without disabling fsyncs I was surprised to see so many in the middle of transactions.
2. Write amplification
During my first tests, I saw with iotop that mysqld wrote at 10MB/s. Which means a factor 10. With doublewrite disabled, it's rather a factor 5.
These ratios are probably underestimated. For better interpretation of the statistics, I should add that: - source DB is uncompressed but the file format is very compact - most of the imported data ends up in the 'data.value' column, and most values are zlib-compressed, with also deduplication The resulting ibdata is only 70GB.
Is this write amplification only caused by indexes & doublewrite ?
Thanks, Julien
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
Thank you for all your answers. They really helped me to speed up my setup. Indeed innodb_buffer_pool_size is important in my case, as you suggested, but I don't really know why. Unless direct io is used, innodb_buffer_pool_size is redundant with kernel FS cache. So if MySQL does not compete with another software, it would only save syscalls, not IO access. In fact, I didn't even plan to configure innodb_buffer_pool_size at all. For my use, FS cache would have been fast enough. So why is innodb_buffer_pool_size so important when importing so much data ? This is only intuition but the only explanation I see is that it would buffer index updates: IOW, secondary indexes would not be updated on disk immediately on commit (not important for Durability if this can be recovered after a crash). If I'm right, is there any other settings to make MariaDB even more aggressive about this ? Here are results for different values of innodb_buffer_pool_size: 128M (default) -> 67 hours 1G -> 25 hours 2G -> 13.4 hours 4G -> 12.1 hours All other settings being identical. Attached files are full information for 2G & 4G (it looks like for 4G, I showed status before MariaDB finished background tasks). I think the CPU is fast enough to process everything in less than 5 hours. By switching from USB HDD to SSD, I should reach this goal :) With only 1 client, I tried innodb_buffer_pool_instances=1 but that didn't change anything. Le 04/26/15 21:39, jocelyn fournier a écrit :
You should try to enable innodb_file_per_table (and recreate the table), to avoid filling the ibdata file.
Unless Barracuda is used, I don't think this affects performance. I tested and it was not faster.h
Try also to increase the innodb_log_buffer_size to 16M (your Innodb_log_waits is increasing).
Ah, this helped. Given a non-negligible number of transactions exceeds 16M, I even increased it to 128M.
How much memory do you have on your system ? The innodb_buffer_pool_size of 1G seems to be a little bit low.
12G RAM
Depending on your usage, you could also benefit from switching to TokuDB (fast insert and much better compression)
It was not easy to test on this machine which has too old gcc (gcc >= 4.7 is required). So I used another MySQL server inside a Debian chroot. Here are results for different values of tokudb_cache_size: 1G -> 24.8 hours 2G -> 12.6 hours 4G -> 13.4 hours
Try also to use the Barracuda file_format instead of Antelope : innodb_file_format = Barracuda & innodb_file_format_max = Barracuda.
This means changing the CREATE TABLE requests in the software and I prefer to spend time on evaluating TokuDB.
For such a big amount of data to import, you could also try to increase the innodb_log_file_size.
I did 1 test with 1G instead of 500M (and innodb_buffer_pool_size=2G) and it didn't change anything. Le 04/27/15 18:09, MARK CALLAGHAN a écrit :
For the fsync rate, my memory is vague, but I think InnoDB respects the write-ahead-log rule even with innodb_flush_method = nosync and this means that before a dirty page is written back the log must be forced for any changes to that page. If you run strace you can determine which files get the fsync and if it is only the log then my memory is correct.
Since my first email, I didn't do any other test with unsafe settings, but IIRC, there were also fsync to ibdata. Isn't it visible in 'show status' ? | Innodb_data_fsyncs | 334552 | | Innodb_os_log_fsyncs | 94311 |
I almost always use innodb_flush_method=O_DIRECT and then configure enough background write threads.
Exceptionally for the import, I could try O_DIRECT. During normal operation, we never use this option because with several services on the same machine, it's too difficult to find the best value of innodb_buffer_pool_size, and we prefer to let the kernel caches IO for all processes.
For write amplification you are computing the rate as Innodb-write-rate / InnoDB-read-rate.
No I computed the read rate of the source DB (which is not a SQL DB) vs the write rate of mysqld. But you're right on the fact that reading the source DB does not read any secondary index.
But I think your InnoDB-read-rate is the scans of the PK index while the write rate includes IO for PK and secondary indexes. This is still write-amp, but deserves an asterisk to explain that point. Write-amp from the doublewrite buffer can be much cheaper than write-amp from dirty page writeback where "cheaper" means fewer seeks.
If you want to make the load go faster than creating the secondary indexes after the table has been loaded might help.
Another interesting question is whether to load in parallel or one table at a time. Sometimes, loading one table at a time helps because you are more likely to keep the working set in memory during the load. This can be very important when secondary indexes are maintained during the load as you can do too much random read - modify - write for secondary indexes, although the InnoDB change buffer helps a bit with that. The choice here isn't all parallel vs 1-at-a-time. Doing a few at a time gives some parallel benefit, but also gives each table load a better chance at keeping secondary indexes in memory during the load.
Good to know but that would make the import algorithm too complex. I think we reach a point where buying more RAM is cheaper.
It might help to increase innodb_max_dirty_pages_pct to 90 (you used 75). I don't have current experience with it but you might also want to increase innodb_max_dirty_pages_pct_lwm. The goal is to not write back dirty pages too soon.
I tried innodb_buffer_pool_size=2G and innodb_max_dirty_pages_pct=90 with: - innodb_max_dirty_pages_pct_lwm=0 - innodb_max_dirty_pages_pct_lwm=75 I can't tell if it was faster or not. The difference in speed is too small.
Making your InnoDB log files as large as possible will also help defer writing back dirty pages and reduce your page write back rate (and write-amp) in a good way. | innodb_log_file_size | 524288000
Answered above.
participants (3)
-
jocelyn fournier
-
Julien Muchembled
-
MARK CALLAGHAN