[Maria-developers] MDEV-520: consider parallel replication patch from taobao patches
Hi xiaobin, I read through your patch found here: http://mysql.taobao.org/index.php/RLPR_for_MariaDB I am still a bit confused about the status of this patch. It would be useful if you could explain a bit about the status of the patch: - How have you tested the patch. - How are you currently using the patch? Or is it still just development/test? - What are your current plans for the patch? (You mentioned that you have some changes you are implementing at the moment). Anyway, I think I have a fair understanding of the overall idea, which seems very interesting! I will try to summarise here in a bit more detail, as well as give a few general comments on the patch. Please correct me if I misunderstood anything. ---- The basic idea is to build a set (implemented as a hash table) of every unique key in every row-event in each transaction received from the master. If two transactions have no overlapping keys, they can be applied in parallel (foreign_key_checks is set to OFF to make this work). There is a single thread that processes each event received in sequence. It checks the key set of the new transaction against the key sets of all transactions queued for each of N worker threads: - If there is no conflict with any already queued transaction, the new transaction can be queued to any worker thread (the one with fewest queued transaction is chosen). - If there is a conflict with just one thread, then the transaction can be queued for that thread - it will then execute in correct sequence with the conflicting transactions. - If there is a conflict with multiple threads, wait until the conflicting transactions have been processed. No new transactions are queued while waiting. Then there are N (default 16) threads that apply the queued transactions. Queued transactions are processsed in parallel within one thread, but transaction in different queues are processed in parallel with each other. If a transaction is received which is not using row events (statement-based replication or DDL), then we wait until all queued transactions are processed. Only after that is the new transaction queued, so that such events are processed correctly (but not in parallel, of course). Likewise, if the worker threads get too far behind, we wait with queuing any more transactions. ---- I think this is a very interesting approach, and I am happy that someone have actually tried implementing it. The biggest potential issue seems to be the performance of the thread that checks the transactions for conflicts and queues them for the appropriate thread: - Main issue will be huge transactions, if the size of hash tables of key values become too big for available memory. Maybe this could be solved by having some limit above which transactions are not tried in parallel. - There will be a lot of hash lookups of key values, all done in a single thread. This could become a limit I suppose - testing will be needed to understand this fully. But I think there will be many workloads where this will be useful. Overall, it seems to me this technique is very promising. It seems to be better at getting maximum parallelism on the slave than any of the other approaches that have been considered. As to the patch itself, it seems a bit immature in the present form, but I understood you are still working on it. Let us discuss what the plans could be for finishing it and get it ready for inclusion in MariaDB. The main criteria is that it must be reasonably complete - it must not crash or silently corrupt data in any usage. However, reasonable restrictions for what is supported is fine. (For example, we could say that LOAD DATA is not supported and give an error, but not crash or silently ignore such statement). But we can discuss this in more details. ---- Finally, a couple general comments on the code: 1. get_pk_value() does not seem to handle collations, eg. case insensitivity? It needs to understand that eg. "abab" and "ABAB" can conflict. I think this should be solved by copying in values with non-binary collations using the appropriate strxfrm() variant. 2. remote_query() retries the query in case of deadlock - but that seems wrrong, doesn't it need to retry the whole _transaction_, not just the query? 3. What about other events like User_var_event ? Will they be included in a later patch? 4. There is a my_sleep(10000) in the code when a transaction conflicts with two or more workers. It would be better to replace this with a condition variable so we can wake up as soon as the conflict is resolved. Same for some other sleeps like in wait_for_all_dml_done(). 5. Shouldn't there be some locking in dispatch_transaction()? To protect the hash lookups when checking conflicts against worker threads doing pk_hash_minus() ? 6. get_pk_value(). There seems to be a buffer overflow - st_hash_item::key is fixed size, but there seems no check against overrunning it. One solution is to just truncate if data does not fit - this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok. Thanks! - Kristian.
Hi, Kristian Thank you very much for looking into the patch and comments. You have got a deep understanding of the patch. Let's discuss the issues one by one. About Bugs: 1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break; This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL). As you mentioned in last mail, this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok. 2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash. 3. There is lock protection in dispatch_transaction(). It is pk_hash_lock in the function "check_pk_exist_key" 4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction. 5. We handle the events including Query_event, Table_map_event, Write_row_event, Update_row_event, Delete_row_event, Xid_event. For all other events, in current implement, I just run it locally (that is in Transfer), but not in the remote_slave. This will lose some information when master use statment-based-binlog. If the master use row-based-binlog, I have not found the serious affects in data consistency. As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems. About some strategies 1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority. 2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue. 3. There are N(default as 16) threads for parallel execution. But the No. 0 thread is special. It is used for the transactions whose affecting rows do not have any primary key or unique key, that means all modification on no-primary-rows will be execute serially. About uses and plans I tested it using some simple cases, valgrind used during the cases. The important method for testing is using the productive data. Before a project use it, a necessary step is to build a slave(and a Transfer) in the test environment, data consistency will be checked after running for some days. In my plan, the next modification is to change it to the "pure slave", that is to run every event using the ev->apply_event directly locally. After it is done, the test-case suit can be used for test. I will ping you after I upload the new version of the patch. Thanks Xiaobin ________________________________________ 发件人: Kristian Nielsen [knielsen@knielsen-hq.org] 发送时间: 2012年10月3日 19:26 到: 丁奇; maria-developers@lists.launchpad.net Cc: Sergei Golubchik 主题: MDEV-520: consider parallel replication patch from taobao patches Hi xiaobin, I read through your patch found here: http://mysql.taobao.org/index.php/RLPR_for_MariaDB I am still a bit confused about the status of this patch. It would be useful if you could explain a bit about the status of the patch: - How have you tested the patch. - How are you currently using the patch? Or is it still just development/test? - What are your current plans for the patch? (You mentioned that you have some changes you are implementing at the moment). Anyway, I think I have a fair understanding of the overall idea, which seems very interesting! I will try to summarise here in a bit more detail, as well as give a few general comments on the patch. Please correct me if I misunderstood anything. ---- The basic idea is to build a set (implemented as a hash table) of every unique key in every row-event in each transaction received from the master. If two transactions have no overlapping keys, they can be applied in parallel (foreign_key_checks is set to OFF to make this work). There is a single thread that processes each event received in sequence. It checks the key set of the new transaction against the key sets of all transactions queued for each of N worker threads: - If there is no conflict with any already queued transaction, the new transaction can be queued to any worker thread (the one with fewest queued transaction is chosen). - If there is a conflict with just one thread, then the transaction can be queued for that thread - it will then execute in correct sequence with the conflicting transactions. - If there is a conflict with multiple threads, wait until the conflicting transactions have been processed. No new transactions are queued while waiting. Then there are N (default 16) threads that apply the queued transactions. Queued transactions are processsed in parallel within one thread, but transaction in different queues are processed in parallel with each other. If a transaction is received which is not using row events (statement-based replication or DDL), then we wait until all queued transactions are processed. Only after that is the new transaction queued, so that such events are processed correctly (but not in parallel, of course). Likewise, if the worker threads get too far behind, we wait with queuing any more transactions. ---- I think this is a very interesting approach, and I am happy that someone have actually tried implementing it. The biggest potential issue seems to be the performance of the thread that checks the transactions for conflicts and queues them for the appropriate thread: - Main issue will be huge transactions, if the size of hash tables of key values become too big for available memory. Maybe this could be solved by having some limit above which transactions are not tried in parallel. - There will be a lot of hash lookups of key values, all done in a single thread. This could become a limit I suppose - testing will be needed to understand this fully. But I think there will be many workloads where this will be useful. Overall, it seems to me this technique is very promising. It seems to be better at getting maximum parallelism on the slave than any of the other approaches that have been considered. As to the patch itself, it seems a bit immature in the present form, but I understood you are still working on it. Let us discuss what the plans could be for finishing it and get it ready for inclusion in MariaDB. The main criteria is that it must be reasonably complete - it must not crash or silently corrupt data in any usage. However, reasonable restrictions for what is supported is fine. (For example, we could say that LOAD DATA is not supported and give an error, but not crash or silently ignore such statement). But we can discuss this in more details. ---- Finally, a couple general comments on the code: 1. get_pk_value() does not seem to handle collations, eg. case insensitivity? It needs to understand that eg. "abab" and "ABAB" can conflict. I think this should be solved by copying in values with non-binary collations using the appropriate strxfrm() variant. 2. remote_query() retries the query in case of deadlock - but that seems wrrong, doesn't it need to retry the whole _transaction_, not just the query? 3. What about other events like User_var_event ? Will they be included in a later patch? 4. There is a my_sleep(10000) in the code when a transaction conflicts with two or more workers. It would be better to replace this with a condition variable so we can wake up as soon as the conflict is resolved. Same for some other sleeps like in wait_for_all_dml_done(). 5. Shouldn't there be some locking in dispatch_transaction()? To protect the hash lookups when checking conflicts against worker threads doing pk_hash_minus() ? 6. get_pk_value(). There seems to be a buffer overflow - st_hash_item::key is fixed size, but there seems no check against overrunning it. One solution is to just truncate if data does not fit - this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok. Thanks! - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
Hi,Kristian I just reposted the patch file in the last setion in our page. http://mysql.taobao.org/index.php/RLPR_for_MariaDB#Source_code This version shows how I want to change it to a "pure slave" A dummy_rli is added, and all the remote_slave_xxx variables are not neecssary now. My plan next as: 1 Change the case insensitivity bug that you have mentioned before. 2 Run mysql-test suits and pass all the testes. Best Regards, xiaobin ________________________________________ 发件人: 丁奇 发送时间: 2012年10月4日 23:24 到: Kristian Nielsen; maria-developers@lists.launchpad.net Cc: Sergei Golubchik 主题: 答复: MDEV-520: consider parallel replication patch from taobao patches Hi, Kristian Thank you very much for looking into the patch and comments. You have got a deep understanding of the patch. Let's discuss the issues one by one. About Bugs: 1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break; This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL). As you mentioned in last mail, this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok. 2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash. 3. There is lock protection in dispatch_transaction(). It is pk_hash_lock in the function "check_pk_exist_key" 4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction. 5. We handle the events including Query_event, Table_map_event, Write_row_event, Update_row_event, Delete_row_event, Xid_event. For all other events, in current implement, I just run it locally (that is in Transfer), but not in the remote_slave. This will lose some information when master use statment-based-binlog. If the master use row-based-binlog, I have not found the serious affects in data consistency. As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems. About some strategies 1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority. 2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue. 3. There are N(default as 16) threads for parallel execution. But the No. 0 thread is special. It is used for the transactions whose affecting rows do not have any primary key or unique key, that means all modification on no-primary-rows will be execute serially. About uses and plans I tested it using some simple cases, valgrind used during the cases. The important method for testing is using the productive data. Before a project use it, a necessary step is to build a slave(and a Transfer) in the test environment, data consistency will be checked after running for some days. In my plan, the next modification is to change it to the "pure slave", that is to run every event using the ev->apply_event directly locally. After it is done, the test-case suit can be used for test. I will ping you after I upload the new version of the patch. Thanks Xiaobin ________________________________________ 发件人: Kristian Nielsen [knielsen@knielsen-hq.org] 发送时间: 2012年10月3日 19:26 到: 丁奇; maria-developers@lists.launchpad.net Cc: Sergei Golubchik 主题: MDEV-520: consider parallel replication patch from taobao patches Hi xiaobin, I read through your patch found here: http://mysql.taobao.org/index.php/RLPR_for_MariaDB I am still a bit confused about the status of this patch. It would be useful if you could explain a bit about the status of the patch: - How have you tested the patch. - How are you currently using the patch? Or is it still just development/test? - What are your current plans for the patch? (You mentioned that you have some changes you are implementing at the moment). Anyway, I think I have a fair understanding of the overall idea, which seems very interesting! I will try to summarise here in a bit more detail, as well as give a few general comments on the patch. Please correct me if I misunderstood anything. ---- The basic idea is to build a set (implemented as a hash table) of every unique key in every row-event in each transaction received from the master. If two transactions have no overlapping keys, they can be applied in parallel (foreign_key_checks is set to OFF to make this work). There is a single thread that processes each event received in sequence. It checks the key set of the new transaction against the key sets of all transactions queued for each of N worker threads: - If there is no conflict with any already queued transaction, the new transaction can be queued to any worker thread (the one with fewest queued transaction is chosen). - If there is a conflict with just one thread, then the transaction can be queued for that thread - it will then execute in correct sequence with the conflicting transactions. - If there is a conflict with multiple threads, wait until the conflicting transactions have been processed. No new transactions are queued while waiting. Then there are N (default 16) threads that apply the queued transactions. Queued transactions are processsed in parallel within one thread, but transaction in different queues are processed in parallel with each other. If a transaction is received which is not using row events (statement-based replication or DDL), then we wait until all queued transactions are processed. Only after that is the new transaction queued, so that such events are processed correctly (but not in parallel, of course). Likewise, if the worker threads get too far behind, we wait with queuing any more transactions. ---- I think this is a very interesting approach, and I am happy that someone have actually tried implementing it. The biggest potential issue seems to be the performance of the thread that checks the transactions for conflicts and queues them for the appropriate thread: - Main issue will be huge transactions, if the size of hash tables of key values become too big for available memory. Maybe this could be solved by having some limit above which transactions are not tried in parallel. - There will be a lot of hash lookups of key values, all done in a single thread. This could become a limit I suppose - testing will be needed to understand this fully. But I think there will be many workloads where this will be useful. Overall, it seems to me this technique is very promising. It seems to be better at getting maximum parallelism on the slave than any of the other approaches that have been considered. As to the patch itself, it seems a bit immature in the present form, but I understood you are still working on it. Let us discuss what the plans could be for finishing it and get it ready for inclusion in MariaDB. The main criteria is that it must be reasonably complete - it must not crash or silently corrupt data in any usage. However, reasonable restrictions for what is supported is fine. (For example, we could say that LOAD DATA is not supported and give an error, but not crash or silently ignore such statement). But we can discuss this in more details. ---- Finally, a couple general comments on the code: 1. get_pk_value() does not seem to handle collations, eg. case insensitivity? It needs to understand that eg. "abab" and "ABAB" can conflict. I think this should be solved by copying in values with non-binary collations using the appropriate strxfrm() variant. 2. remote_query() retries the query in case of deadlock - but that seems wrrong, doesn't it need to retry the whole _transaction_, not just the query? 3. What about other events like User_var_event ? Will they be included in a later patch? 4. There is a my_sleep(10000) in the code when a transaction conflicts with two or more workers. It would be better to replace this with a condition variable so we can wake up as soon as the conflict is resolved. Same for some other sleeps like in wait_for_all_dml_done(). 5. Shouldn't there be some locking in dispatch_transaction()? To protect the hash lookups when checking conflicts against worker threads doing pk_hash_minus() ? 6. get_pk_value(). There seems to be a buffer overflow - st_hash_item::key is fixed size, but there seems no check against overrunning it. One solution is to just truncate if data does not fit - this can lead to false positives where potential parallelism is not exploited, but that is still safe and should be ok. Thanks! - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
Hi 丁奇,
Thanks for your answers, you have a good understanding of the potential issues
and how to solve them.
(I've replied to individual items below, but I mostly agree with your
answers).
I have thought a bit more about the overall idea, and I quite like it. In a
way, this is the natural way to do parallel slave: analyse the changes for
conflicts, and run in parallel any that do not conflict. So I think it is
great that you went and actually tried this and got some real code for it.
I will mention a couple more challenges that will need to be overcome. But
I hope you will first try to complete your plans as you explained in your
mail. This will allow us to see if it works in practise (I think it will), and
then we can work together to handle the possible challenges (which I am sure
can be overcome).
The challenges are mainly to get this working with other replication features
that are already in MariaDB 10.0 or are planned.
----
The first challenge is multi-source replication, which is now in 10.0. This is
the patch by Lixun Peng, which you may already know.
With multi-source replication, we already have multiple threads, one SQL
thread per master connection. In your parallel replication patch, we have also
multiple threads.
So now we could have 16 threads (parallel replication) for each master
connection (multi-source). Then the thread handling starts to be a bit
complex. Then there are a couple of other ideas for parallel replication that
we might want to do later (eg. that work for statement-based binlog or for
DDL), they will require other threads and even more complex thread handling.
I think the way to solve this is to start with your plan, where you just have
16 (by default) threads. And then later we can extend this so that we have a
general pool of replication threads, and some general mechanism of
distributing work to the appropriate thread. (I would be happy to help getting
this working).
Then eventually we will have just one pool of threads which are shared between
multi-source replication, your parallel replication, and whatever else might
be implemented in the future.
----
The second challenge is the commit order, and global transaction ID.
With your patch, transactions on the slave can be committed in a different
order than on the master (because they are run in parallel). This means that
the order in the binlog on the slave (slave-bin.XXXXXX) will be different from
on the master (master-bin.XXXXXX).
This makes it harder if the old master is removed, and one of the slaves
should become the new master. Because the different slaves will have
transactions in different order, and it will be hard to know which
transactions from a new master to apply and which have already been applied.
The MySQL 5.6 implementation of global transaction ID has a way to handle
this, but it is very complex and has some problems. We plan to do another
design (MDEV-26, https://mariadb.atlassian.net/browse/MDEV-26) which requires
that transactions are committed in the same order on the slave as they are on
the master.
Besides, if transactions are committed in different order on the slave, then
some applications may have problems if they require that SELECT sees
transactions in the same order on all slaves (but other applications will not
have problems with this, it depends on the application).
I think this can be solved by implementing a configuration option that enables
or disables that commits in parallel replication happens in the same order as
on the master. If enabled, then each thread will apply the transaction in
parallel as normal, but wait (on a condition variable) for the previous
transaction to commit before committing itself. But if disabled, then we do as
your patch now. Then the user can choose whether to keep the order, which is
safe for all applications and works with global transaction ID, but will be
somewhat less parallel. Or if the user wants to do everything at full
parallelism, but then get commits in different order on the slave.
----
丁奇
1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break; This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL).
Ok, sounds good.
2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash.
Yes, agree.
4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction.
Ah, yes you are right, missed that. Thanks for the explanation.
As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems.
Yes, I think this will work fine. There are lots of details with other events, but they can be handled like DDL. Then most things should just work. There will probably be some details and corner cases to work out, but I think it should be manageable, we can deal with it at the appropriate time.
About some strategies 1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority.
Yes, agree. It is a good idea to start with a simple thread scheduling strategy, if that works then we can improve things later.
2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue.
Yes, I agree. If transaction is too big, we can fall back to doing it serially.
My plan next as: 1 Change the case insensitivity bug that you have mentioned before. 2 Run mysql-test suits and pass all the testes.
Sounds good! Looking forward to seeing the result. - Kristian.
Hi, Kristian
(I was writting to you when received you mail, it's good news that you get well from ill)
After I run the mysql-test-run, I find that when consider into other engines like myisam, other binlog formats (mixed, statement) and more complex scence (temporary table, load data statements), I found it is necessary to re-structure code and deal the details more carefully.
So I have changed the structure of the patch these two days. The main concept is not changed.
Changes are:
1) fix the case insensitivity problem
2) fix the invalid memory access when the key is not long enough.
3) Change the strategy in mixed cases:
A transaction is buffer as a whole first, and decide the way of applying based different case -- If a transaction contains one or more statement-format queries, it is treated like an DDL.
This may occur when the master's binlog format is mixed.
My aim is that, if the master fulfills the conditions that we think we can use multiple-thread to increase performce, we improve it . If not, we just confirm the correctness.
Please fetch the new version in the following URL.
http://mysql.taobao.org/index.php/RLPR_for_MariaDB#Source_code
If consider the general cases, there maybe many bugs in the patch.
There are some errors in the result of ./mtr, I am dealing with them one by one, but some of them are some complex that I need your help (Such as rpl.rpl_deadlock_innodb).
You mentioned the GTIDs in 5.6. I am so agree with you.
I used a inner queue to record the order of completed transacton. The reason is that in Version 5.5 and older versions, log_file_name and log_file_pos can not show the adjoining relation.
If the GTIDs (or a similar mechanism from MariaDB) introduced, the logic of getting the "slowest" worker position can become more simple.
Best Regards,
Xiaobin
_____________________
___________________
发件人: Kristian Nielsen [knielsen@knielsen-hq.org]
发送时间: 2012年10月12日 19:25
到: 丁奇
Cc: maria-developers@lists.launchpad.net
主题: Re: [Maria-developers] 答复: MDEV-520: consider parallel replication patch from taobao patches
Hi 丁奇,
Thanks for your answers, you have a good understanding of the potential issues
and how to solve them.
(I've replied to individual items below, but I mostly agree with your
answers).
I have thought a bit more about the overall idea, and I quite like it. In a
way, this is the natural way to do parallel slave: analyse the changes for
conflicts, and run in parallel any that do not conflict. So I think it is
great that you went and actually tried this and got some real code for it.
I will mention a couple more challenges that will need to be overcome. But
I hope you will first try to complete your plans as you explained in your
mail. This will allow us to see if it works in practise (I think it will), and
then we can work together to handle the possible challenges (which I am sure
can be overcome).
The challenges are mainly to get this working with other replication features
that are already in MariaDB 10.0 or are planned.
----
The first challenge is multi-source replication, which is now in 10.0. This is
the patch by Lixun Peng, which you may already know.
With multi-source replication, we already have multiple threads, one SQL
thread per master connection. In your parallel replication patch, we have also
multiple threads.
So now we could have 16 threads (parallel replication) for each master
connection (multi-source). Then the thread handling starts to be a bit
complex. Then there are a couple of other ideas for parallel replication that
we might want to do later (eg. that work for statement-based binlog or for
DDL), they will require other threads and even more complex thread handling.
I think the way to solve this is to start with your plan, where you just have
16 (by default) threads. And then later we can extend this so that we have a
general pool of replication threads, and some general mechanism of
distributing work to the appropriate thread. (I would be happy to help getting
this working).
Then eventually we will have just one pool of threads which are shared between
multi-source replication, your parallel replication, and whatever else might
be implemented in the future.
----
The second challenge is the commit order, and global transaction ID.
With your patch, transactions on the slave can be committed in a different
order than on the master (because they are run in parallel). This means that
the order in the binlog on the slave (slave-bin.XXXXXX) will be different from
on the master (master-bin.XXXXXX).
This makes it harder if the old master is removed, and one of the slaves
should become the new master. Because the different slaves will have
transactions in different order, and it will be hard to know which
transactions from a new master to apply and which have already been applied.
The MySQL 5.6 implementation of global transaction ID has a way to handle
this, but it is very complex and has some problems. We plan to do another
design (MDEV-26, https://mariadb.atlassian.net/browse/MDEV-26) which requires
that transactions are committed in the same order on the slave as they are on
the master.
Besides, if transactions are committed in different order on the slave, then
some applications may have problems if they require that SELECT sees
transactions in the same order on all slaves (but other applications will not
have problems with this, it depends on the application).
I think this can be solved by implementing a configuration option that enables
or disables that commits in parallel replication happens in the same order as
on the master. If enabled, then each thread will apply the transaction in
parallel as normal, but wait (on a condition variable) for the previous
transaction to commit before committing itself. But if disabled, then we do as
your patch now. Then the user can choose whether to keep the order, which is
safe for all applications and works with global transaction ID, but will be
somewhat less parallel. Or if the user wants to do everything at full
parallelism, but then get commits in different order on the slave.
----
丁奇
1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break; This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL).
Ok, sounds good.
2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash.
Yes, agree.
4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction.
Ah, yes you are right, missed that. Thanks for the explanation.
As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems.
Yes, I think this will work fine. There are lots of details with other events, but they can be handled like DDL. Then most things should just work. There will probably be some details and corner cases to work out, but I think it should be manageable, we can deal with it at the appropriate time.
About some strategies 1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority.
Yes, agree. It is a good idea to start with a simple thread scheduling strategy, if that works then we can improve things later.
2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue.
Yes, I agree. If transaction is too big, we can fall back to doing it serially.
My plan next as: 1 Change the case insensitivity bug that you have mentioned before. 2 Run mysql-test suits and pass all the testes.
Sounds good! Looking forward to seeing the result. - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
丁奇
So I have changed the structure of the patch these two days. The main concept is not changed. Changes are: 1) fix the case insensitivity problem 2) fix the invalid memory access when the key is not long enough.
Great!
3) Change the strategy in mixed cases: A transaction is buffer as a whole first, and decide the way of applying based different case -- If a transaction contains one or more statement-format queries, it is treated like an DDL. This may occur when the master's binlog format is mixed.
Right, this will be necessary.
My aim is that, if the master fulfills the conditions that we think we can use multiple-thread to increase performce, we improve it . If not, we just confirm the correctness.
Yes, agree.
Please fetch the new version in the following URL.
http://mysql.taobao.org/index.php/RLPR_for_MariaDB#Source_code
If consider the general cases, there maybe many bugs in the patch. There are some errors in the result of ./mtr, I am dealing with them one by one, but some of them are some complex that I need your help (Such as rpl.rpl_deadlock_innodb).
Ok. Let me know when you are done with the simple ones, and I will try to find time to look into the complex ones. - Kristian.
Hi, Kristian,
Sorry for a huge delay for this mail. It is a so busy month in this Nov.
In the attachment is that result of "mysql-test-run.pl --force --parallel=auto"
There are 52 fails in the result.
Failing test(s): rpl.rpl_dual_pos_advance rpl.rpl_slave_skip rpl.rpl_row_basic_3innodb rpl.rpl_skip_replication rpl.rpl_slave_grp_exec rpl.rpl_deadlock_innodb rpl.rpl_begin_commit_rollback rpl.rpl_stm_EE_err2 rpl.rpl_row_basic_2myisam rpl.rpl_row_loaddata_concurrent rpl.rpl_row_binlog_max_cache_size rpl.rpl_stm_conflicts rpl.rpl_row_conflicts rpl.rpl_stm_loaddata_concurrent rpl.rpl_row_idempotency rpl.rpl_row_sp002_innodb rpl.rpl_loaddata rpl.rpl_circular_for_4_hosts funcs_1.innodb_func_view rpl.rpl_rotate_logs funcs_1.memory_func_view funcs_1.myisam_func_view plugins.unix_socket
The failing tests can be sort into some categories as follow:
1) rpl.rpl_slave_skip
rpl.rpl_row_binlog_max_cache_size
rpl.rpl_begin_commit_rollback
rpl.rpl_skip_replication
rpl.slave_skip_errors_basic
In the multi-thread mode, SQL_SLAVE_SKIP_COUNTER may lead to different result in slave. Test-cases above are affected by that. Though it is by designed, we should get some strategres about this issue.
2) rpl.rpl_dual_pos_advance
it almost because the reason in 2). In this case, "start slave until ..." is used.
3) rpl.rpl_stm_conflicts
wait_for_slave_sql_error_and_skip
rpl_row_conflicts
rpl_stm_loaddata_concurrent
wait_for_slave_sql_error_and_skip
rpl.rpl_rotate_logs,main.not_embedded_server
rpl.rpl_stm_EE_err2
rpl.rpl_row_loaddata_concurrent
rpl.rpl_loaddata
These cases failed because the slave is waiting for the error number 1062, which is ignore in Parallel mode.
4) rpl.rpl_row_basic_3innodb
rpl_row_basic_2myisam
These fails because the assertion of "Counter for COM_COMMIT", in our mode, the transactions are not executed in the original SQL_THREAD
5) rpl.rpl_row_idempotency, rpl.rpl_row_sp002_innodb fail because we disable foreign key checkes
6) rpl.rpl_slave_grp_exec
rpl.rpl_row_annotate
These fails are by designed. In Parallel mode, we do not ensure the order of insertions, if the order do not affect the data consistency.
So in some test results, rpl.rpl_row_annotate are reported successful, when it happend that the order of execution are the same with that of binary log.
7) sys_vars.slave_skip_errors_basic
main.not_embedded_server
main.mysqld--help
I change the default value of SLAVE_SKIP_ERRORS from OFF to 1032,1062, the test-result should be modified.
8) funcs_1.innodb_func_view
funcs_1.memory_func_view
funcs_1.myisam_func_view
plugins.unix_socket
These fails are the same with that in 10.0-base, not introduced by this patch.
Finally, there are some test-cases that are too complex and need your help.
rpl.rpl_deadlock_innodb
rpl.rpl_circular_for_4_hosts
Can you review the patch again and give me more suggestion? Thank you.
Best regards
Xiaobin
________________________________________
发件人: Kristian Nielsen [knielsen@knielsen-hq.org]
发送时间: 2012年10月15日 14:49
到: 丁奇
Cc: maria-developers@lists.launchpad.net
主题: Re: 答复: [Maria-developers] 答复: MDEV-520: consider parallel replication patch from taobao patches
丁奇
So I have changed the structure of the patch these two days. The main concept is not changed. Changes are: 1) fix the case insensitivity problem 2) fix the invalid memory access when the key is not long enough.
Great!
3) Change the strategy in mixed cases: A transaction is buffer as a whole first, and decide the way of applying based different case -- If a transaction contains one or more statement-format queries, it is treated like an DDL. This may occur when the master's binlog format is mixed.
Right, this will be necessary.
My aim is that, if the master fulfills the conditions that we think we can use multiple-thread to increase performce, we improve it . If not, we just confirm the correctness.
Yes, agree.
Please fetch the new version in the following URL.
http://mysql.taobao.org/index.php/RLPR_for_MariaDB#Source_code
If consider the general cases, there maybe many bugs in the patch. There are some errors in the result of ./mtr, I am dealing with them one by one, but some of them are some complex that I need your help (Such as rpl.rpl_deadlock_innodb).
Ok. Let me know when you are done with the simple ones, and I will try to find time to look into the complex ones. - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
丁奇
Finally, there are some test-cases that are too complex and need your help. rpl.rpl_deadlock_innodb
I debugged this one. When I run the test, it crashes inside sql_plugin.cc. The reason is this code in Transfer_worker::run(): thd= new THD; ... thd->variables= rli->sql_thd->variables; I assume the intention is that slave workers should have the same session variables as the "normal" sql thread. However, it does not work to copy thd->variables like this. The reason it crashes here is that it copies thd->variables.table_plugin without using plugin_lock(). And debug builds have some extra code to help catch missing lock() / incorrect refcounting (it is turned into a double free). There may be other problems with copying thd->variables directly that I'm not aware of. But maybe it would be better if we did not try to copy thd->variables from the sql thread? I tried the following patch, and then rpl_deadlock_innodb no longer crashes: === modified file 'sql/slave.cc' --- sql/slave.cc 2012-12-05 14:05:37 +0000 +++ sql/slave.cc 2012-12-05 14:13:29 +0000 @@ -2889,13 +2889,11 @@ int Transfer_worker::run() thd= new THD; pthread_detach_this_thread(); thd->thread_stack= (char*) &i; - thd->variables= rli->sql_thd->variables; + init_slave_thread(thd, SLAVE_THD_SQL); thd->variables.option_bits|= OPTION_NO_FOREIGN_KEY_CHECKS; thd->variables.dynamic_variables_ptr= NULL; - thd->security_ctx= rli->sql_thd->security_ctx; thd->lex->thd= thd; thd->lex->derived_tables= 0; - thd->system_thread = SYSTEM_THREAD_SLAVE_SQL; dummy_rli= new Relay_log_info(FALSE); dummy_rli->no_storage= TRUE; Basically, initialise thd->variables from scratch like sql thread does, not copy from the thd of the sql thread. What do you think? BTW, the rpl_deadlock_innodb test fails differently with my patch, it times out waiting for error 1205. I did not investigate yet, maybe it is now similar to some of the other errors in other tests that you already explained. I will continue looking at the other questions in your mail. BTW, are you keeping the code in revision control somewhere (bzr, git, whatever)? It might be easier to work together on some tree where we can commit changes, rather than mail patches back and forth? Later, - Kristian.
Hi, Kristian
It's great. I just test all the test-case again after modified as you said, the result looks normal.
What's your cmake parameter? I did not crah it in my envieronment, in fact if it crashed, I should resolved it before sending it to you.
I am now working in my local , base on the bzr : http://bazaar.launchpad.net/~maria-captains/maria/10.0-base/
Can you suggest a suitable place on launchpad? or a branch under ~maria-captains/maria/ ?
Xiaobin
发件人: Kristian Nielsen [knielsen@knielsen-hq.org]
发送时间: 2012年12月5日 22:31
到: 丁奇
Cc: maria-developers@lists.launchpad.net
主题: Re: [Maria-developers] 答复: 答复: 答复: MDEV-520: consider parallel replication patch from taobao patches
丁奇
Finally, there are some test-cases that are too complex and need your help. rpl.rpl_deadlock_innodb
I debugged this one. When I run the test, it crashes inside sql_plugin.cc. The reason is this code in Transfer_worker::run(): thd= new THD; ... thd->variables= rli->sql_thd->variables; I assume the intention is that slave workers should have the same session variables as the "normal" sql thread. However, it does not work to copy thd->variables like this. The reason it crashes here is that it copies thd->variables.table_plugin without using plugin_lock(). And debug builds have some extra code to help catch missing lock() / incorrect refcounting (it is turned into a double free). There may be other problems with copying thd->variables directly that I'm not aware of. But maybe it would be better if we did not try to copy thd->variables from the sql thread? I tried the following patch, and then rpl_deadlock_innodb no longer crashes: === modified file 'sql/slave.cc' --- sql/slave.cc 2012-12-05 14:05:37 +0000 +++ sql/slave.cc 2012-12-05 14:13:29 +0000 @@ -2889,13 +2889,11 @@ int Transfer_worker::run() thd= new THD; pthread_detach_this_thread(); thd->thread_stack= (char*) &i; - thd->variables= rli->sql_thd->variables; + init_slave_thread(thd, SLAVE_THD_SQL); thd->variables.option_bits|= OPTION_NO_FOREIGN_KEY_CHECKS; thd->variables.dynamic_variables_ptr= NULL; - thd->security_ctx= rli->sql_thd->security_ctx; thd->lex->thd= thd; thd->lex->derived_tables= 0; - thd->system_thread = SYSTEM_THREAD_SLAVE_SQL; dummy_rli= new Relay_log_info(FALSE); dummy_rli->no_storage= TRUE; Basically, initialise thd->variables from scratch like sql thread does, not copy from the thd of the sql thread. What do you think? BTW, the rpl_deadlock_innodb test fails differently with my patch, it times out waiting for error 1205. I did not investigate yet, maybe it is now similar to some of the other errors in other tests that you already explained. I will continue looking at the other questions in your mail. BTW, are you keeping the code in revision control somewhere (bzr, git, whatever)? It might be easier to work together on some tree where we can commit changes, rather than mail patches back and forth? Later, - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
丁奇
It's great. I just test all the test-case again after modified as you said, the result looks normal.
Ok, good!
What's your cmake parameter? I did not crah it in my envieronment, in fact if it crashed, I should resolved it before sending it to you.
cmake -DCMAKE_BUILD_TYPE=Debug In the debug build, additional instrumentation is added to the code which catches some errors, as in this case. I will continue and look deeper in the rpl_deadlock_innodb failure and in the other issues.
I am now working in my local , base on the bzr : http://bazaar.launchpad.net/~maria-captains/maria/10.0-base/ Can you suggest a suitable place on launchpad? or a branch under ~maria-captains/maria/ ?
For now, I think we can just put it as personal branches, ie. you could push to lp:~dingqi.lxb/maria/10.0-base-parallel-replication and I can push to a branch of my own, and we can pull from each other if necessary. - Kristian.
Kristian Nielsen
I will continue and look deeper in the rpl_deadlock_innodb failure and in the other issues.
Ok, I debugged the problem in rpl.rpl_deadlock_innodb where I get this failure: CURRENT_TEST: rpl.rpl_deadlock_innodb mysqltest: In included file "./include/wait_for_slave_param.inc": included from ./include/wait_for_slave_sql_error.inc at line 41: included from ./extra/rpl_tests/rpl_deadlock.test at line 84: included from /home/knielsen/my/10.0/work-10.0-mdev520/mysql-test/suite/rpl/t/rpl_deadlock_innodb.test at line 6: At line 115: Timeout in include/wait_for_slave_param.inc This test case tries to replicate a transaction, but the slave is blocked by row locks held by a user transaction. So the slave transaction gets a "Lock wait timeout exceeded" error, and retries the transaction, this repeats until @@global.slave_transaction_retries is exceeded. The test case waits for the maximum number of retries to happen and the slave to stop with an error. However, this does not happen in your parallel-replication tree. Instead the slave loops endlessly retrying (and timing out) the transaction. The transaction is retried the correct number of times, and then an error is returned from execute_single_transaction(). But somehow this error is not caught correctly, and the slave is not stopped. Instead, execute_single_transaction() gets called again with the *same* transaction, and it fails again, and so on endlessly. Until the test case itself times out and gives up waiting for the slave to stop with an error. I did not so far find exactly where the error check is missing, but it must be somewhere up in the call chain of execute_single_transaction(). It needs to catch the error somewhere and stop the slave and set the error code and message for SHOW SLAVE STATUS. I hope you can sort it out from there, else ask again. By the way, while debugging I found something else that may be an error also. I was replicating three CREATE TABLE statements in sequence: CREATE TABLE t1 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB; CREATE TABLE t2 (a INT) ENGINE=InnoDB; CREATE TABLE t3 (a INT NOT NULL, KEY(a)) ENGINE=InnoDB; It looks as if those are executed as a single transaction (a single call to execute_single_transaction()). Is this on purpose? My guess is your code may not correctly handle event groups that are not bracketed by BEGIN ... END. Basically, if there is no BEGIN ... END, then the event is an event group by itself, however these events form a grop with the following event(s) and do not constitute a group by themselves: INTVAR_EVENT RAND_EVENT USER_VAR_EVENT TABLE_MAP_EVENT ANNOTATE_ROWS_EVENT. ---- The logic around event execution failure and retry and so on (in the normal slave code) is quite tricky, and it seems likely that there will be other issues to deal with :-/. Hopefully the above can get you a bit further. For the long-term, I will try to get hold of Monty and discuss with him how to improve the slave SQL thread code. We already have multiple SQL threads for multi-slave, now your patch has multiple threads for your parallel replication, and we may get even more threads for other features. I am hoping we could make a general refactoring to support properly multiple threads, where all of the event apply and error handling code can be cleaned up and re-used by all the different features. Then your job will become a bit easier. - Kristian.
丁奇
These cases failed because the slave is waiting for the error number 1062, which is ignore in Parallel mode. 4) rpl.rpl_row_basic_3innodb rpl_row_basic_2myisam
Finally, there are some test-cases that are too complex and need your help.
rpl.rpl_circular_for_4_hosts
Hm, I looked at this. For me, it fails like this: CURRENT_TEST: rpl.rpl_circular_for_4_hosts mysqltest: In included file "./include/wait_for_slave_param.inc": included from ./include/wait_for_slave_sql_error.inc at line 41: included from /home/knielsen/my/10.0/work-10.0-mdev520/mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test at line 92: At line 115: Timeout in include/wait_for_slave_param.inc The test sets up circular replication S1->S2->S3->S4->S1. Then it injects a duplicate primary key error, and waits for server S3 to stop on that error (1062 / ER_DUP_ENTRY). But since with your patch that error is ignored, the wait times out. So this seems to be the same as the other tests you mentioned in (4). (But let me know if I missed something?). ---- I next want to try one thing that I may have mentioned before: Implement an option so that threads wait with COMMIT until the previous event group has committed. I will try to implement this and let you know how it goes. One nice thing about such an option is that then parallel replication is completely transparent to the application. Transactions can never be observed in a different order on a slave than on the master. So it could perhaps even be enabled by default. This is something I value a lot, to implement improvements that improve things for users in the default configuration, without DBAs having to learn of new features (and their possible limitations) and explicitly enable them. Of course, such option should also help remove a lot of the test failures you mentioned. I say "an option", as I imagine we would in any case keep the possibility of the current behaviour in your patch, where commits can happen out-of-order. Since enforcing commit order could decrease performance in some cases. And we surely would want to compare the performance between the two. I actually think in many cases performance will not suffer by enforcing commits in-order. The main time spent in a transaction in InnoDB is in the data changes, and these can run in parallel as before. The commit step just marks the transaction as committed and writes the commit record to the redo log. There only slow part of commit is the fsync() of binlog and redo log to disk. However MariaDB has group commit, so if transaction B waits for transaction A with COMMIT, then we could just group commit A and B together with a single fsync(), this may even improve performance compared to committing them separately. In fact, as I'm writing this, I get the idea to do this waiting inside the group commit. This could be quite elegant, let me try this. Of course, there are some cases where enforcing commit order will reduce performance: - If we have a large transaction, like bulk load of lots of data, then other transactions will be delayed in their commit. This means that application will see a larger replication delay for those transactions. - If more threads are waiting for previous transactions to COMMIT, more threads will need to be configured to keep the same amount of parallelism. - If a transaction is waiting instead of committing, then it can cause more conflicts with later transactions, reducing parallelism. Still, with mostly small transactions that mostly do not conflict, I think ordered commits will not reduce performance much. It would be interesting to try, at least. - Kristian.
Hi, Kristian
Yes, rpl.rpl_circular_for_4_hosts now looks like the known issues for us. Thank you.
I like the idea of ordered commit, it will indeed get less difference, if the performance does not increase much. Or we can have it as an option parameter for DBAs, great idea. Looking forwrard it.
I will look into the issues that you mentioned in previous mails next week, and put my code into bzr, then I will ping you.
Keep in touch
Best regards
Xiaobin
________________________________________
发件人: Kristian Nielsen [knielsen@knielsen-hq.org]
发送时间: 2012年12月7日 19:21
到: 丁奇
Cc: maria-developers@lists.launchpad.net
主题: Re: [Maria-developers] 答复: 答复: 答复: MDEV-520: consider parallel replication patch from taobao patches
丁奇
These cases failed because the slave is waiting for the error number 1062, which is ignore in Parallel mode. 4) rpl.rpl_row_basic_3innodb rpl_row_basic_2myisam
Finally, there are some test-cases that are too complex and need your help.
rpl.rpl_circular_for_4_hosts
Hm, I looked at this. For me, it fails like this: CURRENT_TEST: rpl.rpl_circular_for_4_hosts mysqltest: In included file "./include/wait_for_slave_param.inc": included from ./include/wait_for_slave_sql_error.inc at line 41: included from /home/knielsen/my/10.0/work-10.0-mdev520/mysql-test/suite/rpl/t/rpl_circular_for_4_hosts.test at line 92: At line 115: Timeout in include/wait_for_slave_param.inc The test sets up circular replication S1->S2->S3->S4->S1. Then it injects a duplicate primary key error, and waits for server S3 to stop on that error (1062 / ER_DUP_ENTRY). But since with your patch that error is ignored, the wait times out. So this seems to be the same as the other tests you mentioned in (4). (But let me know if I missed something?). ---- I next want to try one thing that I may have mentioned before: Implement an option so that threads wait with COMMIT until the previous event group has committed. I will try to implement this and let you know how it goes. One nice thing about such an option is that then parallel replication is completely transparent to the application. Transactions can never be observed in a different order on a slave than on the master. So it could perhaps even be enabled by default. This is something I value a lot, to implement improvements that improve things for users in the default configuration, without DBAs having to learn of new features (and their possible limitations) and explicitly enable them. Of course, such option should also help remove a lot of the test failures you mentioned. I say "an option", as I imagine we would in any case keep the possibility of the current behaviour in your patch, where commits can happen out-of-order. Since enforcing commit order could decrease performance in some cases. And we surely would want to compare the performance between the two. I actually think in many cases performance will not suffer by enforcing commits in-order. The main time spent in a transaction in InnoDB is in the data changes, and these can run in parallel as before. The commit step just marks the transaction as committed and writes the commit record to the redo log. There only slow part of commit is the fsync() of binlog and redo log to disk. However MariaDB has group commit, so if transaction B waits for transaction A with COMMIT, then we could just group commit A and B together with a single fsync(), this may even improve performance compared to committing them separately. In fact, as I'm writing this, I get the idea to do this waiting inside the group commit. This could be quite elegant, let me try this. Of course, there are some cases where enforcing commit order will reduce performance: - If we have a large transaction, like bulk load of lots of data, then other transactions will be delayed in their commit. This means that application will see a larger replication delay for those transactions. - If more threads are waiting for previous transactions to COMMIT, more threads will need to be configured to keep the same amount of parallelism. - If a transaction is waiting instead of committing, then it can cause more conflicts with later transactions, reducing parallelism. Still, with mostly small transactions that mostly do not conflict, I think ordered commits will not reduce performance much. It would be interesting to try, at least. - Kristian. ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
Kristian Nielsen
I next want to try one thing that I may have mentioned before: Implement an option so that threads wait with COMMIT until the previous event group has committed. I will try to implement this and let you know how it goes.
I implemented a first version of this, attached below. This is work in progress, there will be more work to do. But it seems to mostly work, so I wanted that you have the chance to look at it and see what my idea is and tell me your opinion. The patch makes later transactions wait for earlier transactions to commit before committing themselves. The waiting probably looks more complex than necessary, but this is because I prepared for integration with group commit. For now transactions will be committed one at a time, in order. Later I want to implement that if transaction T2 waits for T1, then T1 and T2 will be committed together. This could maybe give a nice speedup if --sync-binlog=1. The waiting stuff in sql_class.cc is prepared for this. There were three new test failures: rpl.rpl_semi_sync, rpl.rpl_row_001, and rpl.rpl_row_inexist_tbl. I will look into that later, but wanted to send you the patch as early as possible. If you want, you can try to test it for performance against no ordering. Do let me know if there are problems, I will try to fix it. - Kristian.
Kristian Nielsen
The patch makes later transactions wait for earlier transactions to commit before committing themselves.
Hm, I now realise a problem with this approach :-/ The problem is with MyISAM tables, or other tables that use table-level locking instead of row-level locking. If transactions T1 and T2 both update the same table mytable, then it can happen that T2 runs first and gets a table lock on table mytable. T1 then has to wait for T2 to commit, after which the table lock is released. But T2 will never commit, it will wait for T1 to commit first with my patch. Thus we have a deadlock. One way to fix this could be to extend your algorithm for detecting conflicts between different transactions. If some table T uses table-level locking, then put into the hash just the table name, without any primary key values. Then any two such transactions will be considered conflicting, and we avoid the deadlock. No loss of parallelism should occur, as two updates to the same MyISAM table can anyway not run in parallel. So there seems to be a solution, but it is somewhat more complex that I had hoped. I will ask Monty to help if/how the code can know in general that a table uses table-level locking. I also worry that there can be other problems that can cause such a deadlock. This would be bad, as it would require the user/DBA to discover this and manually kill the stuck thread. On the other hand, it would be really nice to get in-order commit, this will make it much easier to integrate the parallel replication into MariaDB. For example, I do not know how to get this parallel replication working properly with global transaction ID without in-order commit. - Kristian.
On Thu, 10 Jan 2013 14:24:32 +0100, Kristian Nielsen
Kristian Nielsen
writes: The patch makes later transactions wait for earlier transactions to commit before committing themselves.
Hm, I now realise a problem with this approach :-/
The problem is with MyISAM tables, or other tables that use table-level locking instead of row-level locking.
If transactions T1 and T2 both update the same table mytable, then it can happen that T2 runs first and gets a table lock on table mytable. T1 then has to wait for T2 to commit, after which the table lock is released. But T2 will never commit, it will wait for T1 to commit first with my patch. Thus we have a deadlock.
One way to fix this could be to extend your algorithm for detecting conflicts between different transactions. If some table T uses table-level locking, then put into the hash just the table name, without any primary key values. Then any two such transactions will be considered conflicting, and we avoid the deadlock. No loss of parallelism should occur, as two updates to the same MyISAM table can anyway not run in parallel.
But multiple inserts can occur in parallel if rows are fixed-length. This is an important feature for a particular purpose I use daily (event logging getting replicated from data collectors to data aggregators). Having a way to get past the single-thread bottleneck on a pure-insert workload would be extremely useful. I have to use a non-transactional table engine for the replicated table because of a quirk with data commits straddling replication stopping/starting and table renaming/re-creation. Gordan
Gordan Bobic
But multiple inserts can occur in parallel if rows are fixed-length. This
True. Probably there should be an option to allow out-of-order updates to allow MyISAM parallel replication. This will not be crash-safe (eg. for global transaction ID), but then MyISAM cannot be crash-safe for that anyway. - Kristian.
Hi!
"Kristian" == Kristian Nielsen
writes:
Kristian> Kristian Nielsen
The patch makes later transactions wait for earlier transactions to commit before committing themselves.
Kristian> Hm, I now realise a problem with this approach :-/ Kristian> The problem is with MyISAM tables, or other tables that use table-level Kristian> locking instead of row-level locking. Kristian> If transactions T1 and T2 both update the same table mytable, then it can Kristian> happen that T2 runs first and gets a table lock on table mytable. T1 then has Kristian> to wait for T2 to commit, after which the table lock is released. But T2 will Kristian> never commit, it will wait for T1 to commit first with my patch. Thus we have Kristian> a deadlock. The simplest way to solve things like this is that for transactions that are using table handlers with table level locking, you start the transactions also in order; As soon as the table locks are taken (after open_-and-lock-tables) you can start the next transaction. Kristian> One way to fix this could be to extend your algorithm for detecting conflicts Kristian> between different transactions. If some table T uses table-level locking, then Kristian> put into the hash just the table name, without any primary key values. Then Kristian> any two such transactions will be considered conflicting, and we avoid the Kristian> deadlock. No loss of parallelism should occur, as two updates to the same Kristian> MyISAM table can anyway not run in parallel. Note that for Aria tables, multiple insert can be run at the same time. Regards, Monty
Kristian Nielsen
Hm, I now realise a problem with this approach :-/
The problem is with MyISAM tables, or other tables that use table-level locking instead of row-level locking.
If transactions T1 and T2 both update the same table mytable, then it can happen that T2 runs first and gets a table lock on table mytable. T1 then has to wait for T2 to commit, after which the table lock is released. But T2 will never commit, it will wait for T1 to commit first with my patch. Thus we have a deadlock.
Hm, no, actually I was wrong. There seems to be no deadlock with MyISAM tables. I am not sure yet, but I think what is happening is that MyISAM table locks are released during the "statement" commit. But the wait-for-prior-commit occurs only at the "transaction" commit, which happens after. This is as it should be, and it means that T2 can release table locks before waiting for T1, so T1 can complete and signal T2 to continue. So no deadlock. So this looks like it is not a problem after all. Of course, with MyISAM we will not really get ordered-commit semantics. It is not possible with MyISAM to run statements in parallel out-of-order without also seeing the updates out-of-order; this requires a transactional MVCC engine. But this is expected with MyISAM. So the problem I see in the test rpl_row_001 must be something else, still looking... - Kristian.
Kristian Nielsen
The problem is with MyISAM tables, or other tables that use table-level locking instead of row-level locking.
What about just run those txns serially rather than attempting to do in parallel? As soon as you see a MyISAM operation, have a thread that waits for all replication up to that point to be committed, and then do the MyISAM operation. (as there's no reason why other replication couldn't be applied at the same time... after all, MyISAM isn't crash safe so would never be consistent with InnoDB txns). -- Stewart Smith
Stewart Smith
What about just run those txns serially rather than attempting to do in parallel?
Maybe. Well, first we need to generalise "MyISAM" and "InnoDB" to a general way of putting any storage engine into one or the other category. I assume Monty or Serg can help me figure out how to do this.
As soon as you see a MyISAM operation, have a thread that waits for all replication up to that point to be committed, and then do the MyISAM operation. (as there's no reason why other replication couldn't be applied at the same time... after all, MyISAM isn't crash safe so would never be consistent with InnoDB txns).
Well, consistency is not just about crash safety. If you do out-of-order parallel replication, then the application can see a state on the slave that is impossible to reach on the master. Only the application developer can say if this is ok for the application. In the general case, it will not be. So this needs explicit configuration. But with an MVCC engine like InnoDB, changes only become visible to other transactions after COMMIT. So even if we apply the transactions in parallel in a potentially different order than on the master, if we commit in the same order as on the master, then there is no way for an application to see a state on the slave that did not also exist on the master. Thus in-order parallel replication is correct and can be safely enabled always. MyISAM is not MVCC nor does it have COMMIT, so it seems by default we should do what you suggest: run MyISAM transactions serially. In fact they have to run serially with respect to InnoDB transactions also, to avoid getting the slave into a state that might be invalid to the application. On the other hand, if an application can tolerate out-of-order, then it would be very useful to support also parallel apply for MyISAM. So it could be a configuration to allow out-of-order for MyISAM also. Of course, this will only be possible for updates to different MyISAM tables, due to table-level locking. Possible INSERTs could be allowed out-of-order, as they can sometimes happen in parallel even for MyISAM. - Kristian.
Hi!
"Kristian" == Kristian Nielsen
writes:
Kristian> Stewart Smith
What about just run those txns serially rather than attempting to do in parallel?
Kristian> Maybe. Well, first we need to generalise "MyISAM" and "InnoDB" to a general Kristian> way of putting any storage engine into one or the other category. I assume Kristian> Monty or Serg can help me figure out how to do this.
As soon as you see a MyISAM operation, have a thread that waits for all replication up to that point to be committed, and then do the MyISAM operation. (as there's no reason why other replication couldn't be applied at the same time... after all, MyISAM isn't crash safe so would never be consistent with InnoDB txns).
<cut> Kristian> MyISAM is not MVCC nor does it have COMMIT, so it seems by default we should Kristian> do what you suggest: run MyISAM transactions serially. In fact they have to Kristian> run serially with respect to InnoDB transactions also, to avoid getting the Kristian> slave into a state that might be invalid to the application. Note that MyISAM can mix SELECT and INSERT and these behave as there would be MVCC. Regards, Monty
participants (5)
-
Gordan Bobic
-
Kristian Nielsen
-
Michael Widenius
-
Stewart Smith
-
丁奇