[Maria-discuss] Very slow insert with load data infile
Hello, Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow. Here are the time for each load data infile : LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running... Now the task is running but mariadb is only using 2% cpu and read to the disk at 2500 Kb/sec (which is slow regarding the disk performance). There is a lot of disk space. If insertion speed continue to slow down at this rate, it will take weeks to insert the 3 remainings files ! It's only 166M rows so I think it's not normal. If I disable the "unique" constraint insertion speed is great but I need to be sure there is no duplicate and I also need an index. Can someone explain me what is happening internaly and why it is so slow ? Do you think if I just create an index (not a unique index) it will be faster ? The problem is I specially delegated this task of unique checking to mysql. If no solution, do you know a database or a key/value store with better for performance for this use case ? (create an index and remove duplicate ?) I can rearrange my table structure (at a cost of more disk space usage) to match a key/value structure. Here is my current table structure : CREATE TABLE `wallets` ( `p1` varbinary(20) DEFAULT NULL, `p2` varbinary(20) DEFAULT NULL, `data` varbinary(32) DEFAULT NULL, UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED; Computer specs: 16go RAM and 4x3.1 ghz, 2x1To (with 90% unused disk space) The configuration variables (I tweaked specially for this insertion job) : key_buffer_size = 6G # MyISAM: used with insert to an empty table bulk_insert_buffer_size = 6G # MyISAM/Aria: insert to a non empty table; default 8M myisam_sort_buffer_size = 6G # MyISAM: used for sort but only when : "CREATE INDEX", "ALTER TABLE" or "REPAIR TABLE" read_buffer_size = 6G # MyISAM/Aria/MERGE: allocated for each table scan, for order by, nested query caching, bulk insert into partitions sort_buffer_size = 4M # ALL: allocated each time a session need to do a sort myisam_max_sort_file_size = 128G # MyISAM: tmp file max size I'm using mariadb 5.5.31. Best regards,
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`), * remove keyes * insert data * add kyes https://dev.mysql.com/doc/refman/5.5/en/insert-speed.html
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish. I think there is a bug here. Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task. So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time. So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM. However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour. ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0 In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index. It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ? It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
Hi, This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node. Faster IO can help, so you can look into flash storage. Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups. --Justin On Wed, Feb 12, 2014 at 12:04 AM, Pierre <pierz@hotmail.it> wrote:
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM. Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before. If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index. Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Load the data THEN add the index. This will do the unique check once instead of on every insertion. On bloom filter miss, select from the table and insert if it isn't there. If this is multithreaded use innodb or tokudb and select .... for update, to prevent race. Sent from my iPhone
On Feb 12, 2014, at 12:49 AM, Pierre <pierz@hotmail.it> wrote:
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before.
If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index.
Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
You don't undestand, I already did this. Read the thread since the beginning : https://lists.launchpad.net/maria-discuss/msg01338.html <= load data in empty table with index https://lists.launchpad.net/maria-discuss/msg01361.html <= Load the data in empty table THEN add the index (what you are suggesting). Le 12/02/2014 11:03, Justin Swanhart a écrit :
Load the data THEN add the index. This will do the unique check once instead of on every insertion.
On bloom filter miss, select from the table and insert if it isn't there. If this is multithreaded use innodb or tokudb and select .... for update, to prevent race.
Sent from my iPhone
On Feb 12, 2014, at 12:49 AM, Pierre <pierz@hotmail.it> wrote:
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before.
If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index.
Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
your keys *must* fit in the RAM, period your key-buffers must be large enough Am 12.02.2014 11:10, schrieb Pierre:
You don't undestand, I already did this. Read the thread since the beginning :
https://lists.launchpad.net/maria-discuss/msg01338.html <= load data in empty table with index https://lists.launchpad.net/maria-discuss/msg01361.html <= Load the data in empty table THEN add the index (what you are suggesting).
Le 12/02/2014 11:03, Justin Swanhart a écrit :
Load the data THEN add the index. This will do the unique check once instead of on every insertion.
On bloom filter miss, select from the table and insert if it isn't there. If this is multithreaded use innodb or tokudb and select .... for update, to prevent race.
Sent from my iPhone
On Feb 12, 2014, at 12:49 AM, Pierre <pierz@hotmail.it> wrote:
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before.
If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index.
Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
Yes, that's what I'm saying, you can't create a UNIQUE index of a table where the keys doesn't fit in RAM/key_buffer. (if it's a basic "not unique" INDEX it's not a problem and it doesn't apply). If it's not a bug, it's a feature, then it should be documented. Le 12/02/2014 11:15, Reindl Harald a écrit :
your keys *must* fit in the RAM, period your key-buffers must be large enough
Am 12.02.2014 11:10, schrieb Pierre:
You don't undestand, I already did this. Read the thread since the beginning :
https://lists.launchpad.net/maria-discuss/msg01338.html <= load data in empty table with index https://lists.launchpad.net/maria-discuss/msg01361.html <= Load the data in empty table THEN add the index (what you are suggesting).
Le 12/02/2014 11:03, Justin Swanhart a écrit :
Load the data THEN add the index. This will do the unique check once instead of on every insertion.
On bloom filter miss, select from the table and insert if it isn't there. If this is multithreaded use innodb or tokudb and select .... for update, to prevent race.
Sent from my iPhone
On Feb 12, 2014, at 12:49 AM, Pierre <pierz@hotmail.it> wrote:
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before.
If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index.
Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre: > Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 > rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because > it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
What data type are the constraints over? A unique index over a raw md5 is much better than one on a varchar(255) or even varchar(32) Sent from my iPhone
On Feb 12, 2014, at 12:49 AM, Pierre <pierz@hotmail.it> wrote:
Ok I understand, but this doesn't resolve the initialization problem, I'm sure there is a bug, or something which can be improved a lot. I can't use the UNIQUE constraint when I have to much key which doesn't fit in RAM.
Because even If have memcache/bloom filter in front, I still need to create my initial dataset with the unique constraint,and even if my initial dataset is unique, I can't insert them in my table because of the bug explained before.
If mysql can do a sort so quickly (create index), it should be trivial to reuse this sorted data, eliminate the duplicate and create the UNIQUE index.
Le 12/02/2014 09:24, Justin Swanhart a écrit :
Hi,
This is not a bug, but how b tree indexes work. For them to be efficient they must fit in ram. There are buffering mechanisms that can be used for secondary indexes in some cases, because the write can be done without a read, but ONLY when the index is not unique. It if it unique, then the index dive is necessary and a btree traversal can take multiple random IO to get to the leaf node.
Faster IO can help, so you can look into flash storage.
Consider fronting your lookups with memcache (a hash index) or a bloom filter (there are many implementations on the net) to reduce lookups.
--Justin
Le 12/02/2014 09:04, Pierre a écrit :
Using this technique I have the same Issue. It's now running for severals hours, I'm at a 40% and looking at show full processlist, it's getting slower and slower. It will never finish.
I think there is a bug here.
Firstly, regardly the memory usage, It doesn't correctly use the buffer I did set, I think it's only using the key_buffer_size. myisam_sort_buffer_size or bulk_insert_buffer_size are not used on this task.
So what's happening ? When the RAM is full, mariadb is doing random access on the disk to sort and filter Go of data ! That's why my cpu was only used a few percent, the process was in i/o wait most of the time.
So what I am saying here is : mariadb can't crate UNIQUE keys/index if the rows doesn't fit in RAM.
However if I try to create a standard index (ie non unique), it works well and it's done in less than an hour.
ALTER IGNORE TABLE mytable ADD INDEX (c1), ADD INDEX(c2); Query OK, 349086532 rows affected (44 min 25.21 sec) Records: 349086532 Duplicates: 0 Warnings: 0
In this second usage case, Maria is doing a good work by using the myisam_sort_buffer_size. I think it's doing something like an external/merge sort, spliting the rows in part that fit in RAM, sorting them, merging them and creating index.
It was 100% cpu most of the time, when It was not it was because mysql was loading the rows in RAM from hard disk (and not doing a random access on the hard disk like in create unique index). So why UNIQUE index is not behaving the same way ?
It's easy to reproduce the bug, just create a binary file of 2 or 3 x size of RAM, then load data infile and try to create a UNIQUE index on it. It will never end.
Am 10.02.2014 13:45, schrieb Pierre:
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
he reason are the permamently updated keys in general for large inserts on a new table
UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`),
* remove keyes * insert data * add kyes
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi Pierre, there are quite a few MySQL bug reports in MySQL bug database that can affect your use case. Like: http://bugs.mysql.com/bug.php?id=5731 http://bugs.mysql.com/bug.php?id=29446 http://bugs.mysql.com/bug.php?id=59925 http://bugs.mysql.com/bug.php?id=62570 http://bugs.mysql.com/bug.php?id=62827 http://bugs.mysql.com/bug.php?id=45702 BUG#62827 is probably the most interesting. Even though some of them are fixed, it is very likely that they're not fixed completely. I tend to remember 32-bit variables on both key cache and "repair by sort" ways. Could you check actual value of myisam_sort_buffer_size and key_buffer_size just to make sure it wasn't cut at startup at least? I'd suggest to report a bug in jira: https://mariadb.atlassian.net I believe it is somthing worth checking at least. Regards, Sergey On Mon, Feb 10, 2014 at 01:45:34PM +0100, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Now the task is running but mariadb is only using 2% cpu and read to the disk at 2500 Kb/sec (which is slow regarding the disk performance). There is a lot of disk space. If insertion speed continue to slow down at this rate, it will take weeks to insert the 3 remainings files ! It's only 166M rows so I think it's not normal. If I disable the "unique" constraint insertion speed is great but I need to be sure there is no duplicate and I also need an index.
Can someone explain me what is happening internaly and why it is so slow ? Do you think if I just create an index (not a unique index) it will be faster ? The problem is I specially delegated this task of unique checking to mysql. If no solution, do you know a database or a key/value store with better for performance for this use case ? (create an index and remove duplicate ?) I can rearrange my table structure (at a cost of more disk space usage) to match a key/value structure.
Here is my current table structure : CREATE TABLE `wallets` ( `p1` varbinary(20) DEFAULT NULL, `p2` varbinary(20) DEFAULT NULL, `data` varbinary(32) DEFAULT NULL, UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED;
Computer specs: 16go RAM and 4x3.1 ghz, 2x1To (with 90% unused disk space)
The configuration variables (I tweaked specially for this insertion job) :
key_buffer_size = 6G # MyISAM: used with insert to an empty table bulk_insert_buffer_size = 6G # MyISAM/Aria: insert to a non empty table; default 8M myisam_sort_buffer_size = 6G # MyISAM: used for sort but only when : "CREATE INDEX", "ALTER TABLE" or "REPAIR TABLE" read_buffer_size = 6G # MyISAM/Aria/MERGE: allocated for each table scan, for order by, nested query caching, bulk insert into partitions sort_buffer_size = 4M # ALL: allocated each time a session need to do a sort myisam_max_sort_file_size = 128G # MyISAM: tmp file max size
I'm using mariadb 5.5.31.
Best regards,
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Yep, myisam_sort_buffer_size and key_buffer_size were set both to 6G and show variables confirmed it. Moreover, when I created non-unique index both buffers were used (mysql was using 12G of RAM), however, when I created UNINQUE index only key_buffer was used (mysql was using 6G of RAM). I'll do a bug report. Le 12/02/2014 13:23, Sergey Vojtovich a écrit :
Hi Pierre,
there are quite a few MySQL bug reports in MySQL bug database that can affect your use case. Like: http://bugs.mysql.com/bug.php?id=5731 http://bugs.mysql.com/bug.php?id=29446 http://bugs.mysql.com/bug.php?id=59925 http://bugs.mysql.com/bug.php?id=62570 http://bugs.mysql.com/bug.php?id=62827 http://bugs.mysql.com/bug.php?id=45702
BUG#62827 is probably the most interesting. Even though some of them are fixed, it is very likely that they're not fixed completely. I tend to remember 32-bit variables on both key cache and "repair by sort" ways.
Could you check actual value of myisam_sort_buffer_size and key_buffer_size just to make sure it wasn't cut at startup at least?
I'd suggest to report a bug in jira: https://mariadb.atlassian.net I believe it is somthing worth checking at least.
Regards, Sergey
On Mon, Feb 10, 2014 at 01:45:34PM +0100, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Now the task is running but mariadb is only using 2% cpu and read to the disk at 2500 Kb/sec (which is slow regarding the disk performance). There is a lot of disk space. If insertion speed continue to slow down at this rate, it will take weeks to insert the 3 remainings files ! It's only 166M rows so I think it's not normal. If I disable the "unique" constraint insertion speed is great but I need to be sure there is no duplicate and I also need an index.
Can someone explain me what is happening internaly and why it is so slow ? Do you think if I just create an index (not a unique index) it will be faster ? The problem is I specially delegated this task of unique checking to mysql. If no solution, do you know a database or a key/value store with better for performance for this use case ? (create an index and remove duplicate ?) I can rearrange my table structure (at a cost of more disk space usage) to match a key/value structure.
Here is my current table structure : CREATE TABLE `wallets` ( `p1` varbinary(20) DEFAULT NULL, `p2` varbinary(20) DEFAULT NULL, `data` varbinary(32) DEFAULT NULL, UNIQUE KEY `p1` (`p1`), UNIQUE KEY `p2` (`p2`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED;
Computer specs: 16go RAM and 4x3.1 ghz, 2x1To (with 90% unused disk space)
The configuration variables (I tweaked specially for this insertion job) :
key_buffer_size = 6G # MyISAM: used with insert to an empty table bulk_insert_buffer_size = 6G # MyISAM/Aria: insert to a non empty table; default 8M myisam_sort_buffer_size = 6G # MyISAM: used for sort but only when : "CREATE INDEX", "ALTER TABLE" or "REPAIR TABLE" read_buffer_size = 6G # MyISAM/Aria/MERGE: allocated for each table scan, for order by, nested query caching, bulk insert into partitions sort_buffer_size = 4M # ALL: allocated each time a session need to do a sort myisam_max_sort_file_size = 128G # MyISAM: tmp file max size
I'm using mariadb 5.5.31.
Best regards,
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
Hi, Pierre! Okay, there were quite a few replies already here. I'll chime in, but only to confirm that there's no easy solution, unfortunately :( On Feb 10, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Can someone explain me what is happening internaly and why it is so slow ?
Yes. That's because indexes are implemented as B-trees, and B-tree has a cost of insertion of O(log N). As you insert more data, insertions become slower. Furthermore, B-tree organizes data in pages. MariaDB tries to keep all pages in memory, but as soon as index becomes larger than keycache size, every key insertion means at least one disk read and one write (assuming your key values are distributed normally and randomly).
Do you think if I just create an index (not a unique index) it will be faster ?
Not really. Inserting values into a B-tree one-by-one will always be slow when your B-tree becomes big enough. But you don't need to insert values into a B-tree one by one. MyISAM has a much faster (like, two orders of magnitude) way of building b-trees by creating a tree all at once. This is used when you insert data into the empty table. Or when you add an index to the existing table. Or when you enable indexes. That is: 1. LOAD DATA INFILE 'everything, not in chunks' (a convenient way would be to load from a pipe, and cat all your chunks into it). 2. ALTER TABLE ... ADD KEY 3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS But unfortunately, all this only works for non-unique keys, when MariaDB doesn't need to expect a unique constraint error and won't need to delete rows when creating indexes. For unique indexes values will be inserted into a b-tree one by one. Now, there's another trick you can use. Create a table. Then disable all indexes - not with ALTER TABLE ENABLE KEYS (which only disables non-unique indexes) - but from a command-line with myisamchk: $ myisamchk --keys-used=0 /path/to/table.MYI then insert your data. And REPAIR TABLE or REPAIR TABLE ... QUICK. This will rebuild all indexes, the fast way. The second variant (with QUICK) will not copy your (presumably, huge) data - it will only rebuild indexes. But it cannot remove duplicates either. And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY. Regards, Sergei
hi guys one doubt... using btree have a O(log N), what about using tokudb? 2014-02-12 12:45 GMT-02:00 Sergei Golubchik <serg@mariadb.org>:
Hi, Pierre!
Okay, there were quite a few replies already here. I'll chime in, but only to confirm that there's no easy solution, unfortunately :(
On Feb 10, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Can someone explain me what is happening internaly and why it is so slow ?
Yes. That's because indexes are implemented as B-trees, and B-tree has a cost of insertion of O(log N). As you insert more data, insertions become slower. Furthermore, B-tree organizes data in pages. MariaDB tries to keep all pages in memory, but as soon as index becomes larger than keycache size, every key insertion means at least one disk read and one write (assuming your key values are distributed normally and randomly).
Do you think if I just create an index (not a unique index) it will be faster ?
Not really. Inserting values into a B-tree one-by-one will always be slow when your B-tree becomes big enough.
But you don't need to insert values into a B-tree one by one. MyISAM has a much faster (like, two orders of magnitude) way of building b-trees by creating a tree all at once.
This is used when you insert data into the empty table. Or when you add an index to the existing table. Or when you enable indexes. That is:
1. LOAD DATA INFILE 'everything, not in chunks' (a convenient way would be to load from a pipe, and cat all your chunks into it). 2. ALTER TABLE ... ADD KEY 3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS
But unfortunately, all this only works for non-unique keys, when MariaDB doesn't need to expect a unique constraint error and won't need to delete rows when creating indexes. For unique indexes values will be inserted into a b-tree one by one.
Now, there's another trick you can use. Create a table. Then disable all indexes - not with ALTER TABLE ENABLE KEYS (which only disables non-unique indexes) - but from a command-line with myisamchk:
$ myisamchk --keys-used=0 /path/to/table.MYI
then insert your data. And REPAIR TABLE or REPAIR TABLE ... QUICK. This will rebuild all indexes, the fast way. The second variant (with QUICK) will not copy your (presumably, huge) data - it will only rebuild indexes. But it cannot remove duplicates either.
And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY.
Regards, Sergei
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-- Roberto Spadim SPAEmpresarial Eng. Automação e Controle
And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY.
Well this is a good news ! I'm going to give it a try. Le 12/02/2014 15:45, Sergei Golubchik a écrit :
Hi, Pierre!
Okay, there were quite a few replies already here. I'll chime in, but only to confirm that there's no easy solution, unfortunately :(
On Feb 10, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Can someone explain me what is happening internaly and why it is so slow ?
Yes. That's because indexes are implemented as B-trees, and B-tree has a cost of insertion of O(log N). As you insert more data, insertions become slower. Furthermore, B-tree organizes data in pages. MariaDB tries to keep all pages in memory, but as soon as index becomes larger than keycache size, every key insertion means at least one disk read and one write (assuming your key values are distributed normally and randomly).
Do you think if I just create an index (not a unique index) it will be faster ?
Not really. Inserting values into a B-tree one-by-one will always be slow when your B-tree becomes big enough.
But you don't need to insert values into a B-tree one by one. MyISAM has a much faster (like, two orders of magnitude) way of building b-trees by creating a tree all at once.
This is used when you insert data into the empty table. Or when you add an index to the existing table. Or when you enable indexes. That is:
1. LOAD DATA INFILE 'everything, not in chunks' (a convenient way would be to load from a pipe, and cat all your chunks into it). 2. ALTER TABLE ... ADD KEY 3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS
But unfortunately, all this only works for non-unique keys, when MariaDB doesn't need to expect a unique constraint error and won't need to delete rows when creating indexes. For unique indexes values will be inserted into a b-tree one by one.
Now, there's another trick you can use. Create a table. Then disable all indexes - not with ALTER TABLE ENABLE KEYS (which only disables non-unique indexes) - but from a command-line with myisamchk:
$ myisamchk --keys-used=0 /path/to/table.MYI
then insert your data. And REPAIR TABLE or REPAIR TABLE ... QUICK. This will rebuild all indexes, the fast way. The second variant (with QUICK) will not copy your (presumably, huge) data - it will only rebuild indexes. But it cannot remove duplicates either.
And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY.
Regards, Sergei
Hi Sergei, Pierre, On Wed, Feb 12, 2014 at 03:45:06PM +0100, Sergei Golubchik wrote:
Hi, Pierre!
Okay, there were quite a few replies already here. I'll chime in, but only to confirm that there's no easy solution, unfortunately :(
On Feb 10, Pierre wrote:
Hello,
Mariadb is getting slower and slower when I'm inserting a massive amout of data. I'm trying to insert 166 507 066 rows (12go) using load data infile '' into an empty table. I splitted my file in 13 parts of the same size because it was too long to insert in one shot. When I inserted more than 100M rows, it starts to be extremely slow.
Here are the time for each load data infile :
LOAD DATA INFILE '/tmp/hash/xaa' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xab' : 2min 49sec LOAD DATA INFILE '/tmp/hash/xac' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xad' : 3min 48sec LOAD DATA INFILE '/tmp/hash/xae' : 3min 49sec LOAD DATA INFILE '/tmp/hash/xaf' : 5min 59sec LOAD DATA INFILE '/tmp/hash/xag' : 10min 50sec LOAD DATA INFILE '/tmp/hash/xah' : 20min 7sec LOAD DATA INFILE '/tmp/hash/xai' : 47min 7sec LOAD DATA INFILE '/tmp/hash/xaj' : 4 hours 1 min 9.34 sec ( ouch !) LOAD DATA INFILE '/tmp/hash/xak' : still running...
Can someone explain me what is happening internaly and why it is so slow ?
Yes. That's because indexes are implemented as B-trees, and B-tree has a cost of insertion of O(log N). As you insert more data, insertions become slower. Furthermore, B-tree organizes data in pages. MariaDB tries to keep all pages in memory, but as soon as index becomes larger than keycache size, every key insertion means at least one disk read and one write (assuming your key values are distributed normally and randomly). Pierre noted that key_buffer_size is 6Gb. It should fit about 95M rows (or ~7.5 parts). According to my calculations resulting index size will be about 11Gb.
Do you think if I just create an index (not a unique index) it will be faster ?
Not really. Inserting values into a B-tree one-by-one will always be slow when your B-tree becomes big enough.
But you don't need to insert values into a B-tree one by one. MyISAM has a much faster (like, two orders of magnitude) way of building b-trees by creating a tree all at once.
This is used when you insert data into the empty table. Or when you add an index to the existing table. Or when you enable indexes. That is:
1. LOAD DATA INFILE 'everything, not in chunks' (a convenient way would be to load from a pipe, and cat all your chunks into it). 2. ALTER TABLE ... ADD KEY 3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS
But unfortunately, all this only works for non-unique keys, when MariaDB doesn't need to expect a unique constraint error and won't need to delete rows when creating indexes. For unique indexes values will be inserted into a b-tree one by one. Indeed, it goes through key cache. Just curious why is that needed for
Said the above I believe just increasing key_buffer_size may greatly improve load performance. Note that performance is starting to go down heavily exactly after 8-th part. p.2 - it should never have to delete rows. AFAIR it should just drop temporary table in case of failure. I guess it was fixed this way in 10.0. And p.1 can just truncate data file...
Now, there's another trick you can use. Create a table. Then disable all indexes - not with ALTER TABLE ENABLE KEYS (which only disables non-unique indexes) - but from a command-line with myisamchk:
$ myisamchk --keys-used=0 /path/to/table.MYI
then insert your data. And REPAIR TABLE or REPAIR TABLE ... QUICK. This will rebuild all indexes, the fast way. The second variant (with QUICK) will not copy your (presumably, huge) data - it will only rebuild indexes. But it cannot remove duplicates either.
And also it's worth noting that MariaDB-10.0 rebuilds both UNIQUE and non-UNIQUE indexes the fast way when you do ALTER TABLE ... ADD KEY.
Thanks, Sergey
Hi, Sergey! On Feb 12, Sergey Vojtovich wrote:
This is used when you insert data into the empty table. Or when you add an index to the existing table. Or when you enable indexes. That is:
1. LOAD DATA INFILE 'everything, not in chunks' (a convenient way would be to load from a pipe, and cat all your chunks into it). 2. ALTER TABLE ... ADD KEY 3. ALTER TABLE DISABLE KEYS. insert the data. ALTER TABLE ENABLE KEYS
But unfortunately, all this only works for non-unique keys, when MariaDB doesn't need to expect a unique constraint error and won't need to delete rows when creating indexes. For unique indexes values will be inserted into a b-tree one by one.
Indeed, it goes through key cache. Just curious why is that needed for p.2 - it should never have to delete rows. AFAIR it should just drop temporary table in case of failure. I guess it was fixed this way in 10.0.
And p.1 can just truncate data file...
For repair-by-sort, ALTER TABLE needs first insert all rows, and only then rebuild all indexes. With UNIQUE indexes a conflict is possible, it will need to abort the operation. Old assumption was that it's better to detect a conflict as soon as possible - when rows are being inserted - and abort sooner. As compared to copying everything, and only then during repair-by-sort noticing a conflict. The second approach would end up doing much more work before the operation is aborted. There is some logic in that reasoning. But still, when UNIQUE index is added, conflicts are typically rare. And even if we'd need to abort the operation, doing more work and repair-by-sort is usually faster than doing less work. Furthermore - what's even more important - most often ALTER TABLE does not add a unique index, but performs some unrelated operation on the table. In this case one cannot expect any conflicts at all in the existing unique indexes. So, in 10.0 we've changed ALTER TABLE to use repair-by-sort also for unique indexes. Regards, Sergei
participants (6)
-
Justin Swanhart
-
Pierre
-
Reindl Harald
-
Roberto Spadim
-
Sergei Golubchik
-
Sergey Vojtovich