Hi 丁奇,
Thanks for your answers, you have a good understanding of the potential issues
and how to solve them.
(I've replied to individual items below, but I mostly agree with your
answers).
I have thought a bit more about the overall idea, and I quite like it. In a
way, this is the natural way to do parallel slave: analyse the changes for
conflicts, and run in parallel any that do not conflict. So I think it is
great that you went and actually tried this and got some real code for it.
I will mention a couple more challenges that will need to be overcome. But
I hope you will first try to complete your plans as you explained in your
mail. This will allow us to see if it works in practise (I think it will), and
then we can work together to handle the possible challenges (which I am sure
can be overcome).
The challenges are mainly to get this working with other replication features
that are already in MariaDB 10.0 or are planned.
----
The first challenge is multi-source replication, which is now in 10.0. This is
the patch by Lixun Peng, which you may already know.
With multi-source replication, we already have multiple threads, one SQL
thread per master connection. In your parallel replication patch, we have also
multiple threads.
So now we could have 16 threads (parallel replication) for each master
connection (multi-source). Then the thread handling starts to be a bit
complex. Then there are a couple of other ideas for parallel replication that
we might want to do later (eg. that work for statement-based binlog or for
DDL), they will require other threads and even more complex thread handling.
I think the way to solve this is to start with your plan, where you just have
16 (by default) threads. And then later we can extend this so that we have a
general pool of replication threads, and some general mechanism of
distributing work to the appropriate thread. (I would be happy to help getting
this working).
Then eventually we will have just one pool of threads which are shared between
multi-source replication, your parallel replication, and whatever else might
be implemented in the future.
----
The second challenge is the commit order, and global transaction ID.
With your patch, transactions on the slave can be committed in a different
order than on the master (because they are run in parallel). This means that
the order in the binlog on the slave (slave-bin.XXXXXX) will be different from
on the master (master-bin.XXXXXX).
This makes it harder if the old master is removed, and one of the slaves
should become the new master. Because the different slaves will have
transactions in different order, and it will be hard to know which
transactions from a new master to apply and which have already been applied.
The MySQL 5.6 implementation of global transaction ID has a way to handle
this, but it is very complex and has some problems. We plan to do another
design (MDEV-26, https://mariadb.atlassian.net/browse/MDEV-26) which requires
that transactions are committed in the same order on the slave as they are on
the master.
Besides, if transactions are committed in different order on the slave, then
some applications may have problems if they require that SELECT sees
transactions in the same order on all slaves (but other applications will not
have problems with this, it depends on the application).
I think this can be solved by implementing a configuration option that enables
or disables that commits in parallel replication happens in the same order as
on the master. If enabled, then each thread will apply the transaction in
parallel as normal, but wait (on a condition variable) for the previous
transaction to commit before committing itself. But if disabled, then we do as
your patch now. Then the user can choose whether to keep the order, which is
safe for all applications and works with global transaction ID, but will be
somewhat less parallel. Or if the user wants to do everything at full
parallelism, but then get commits in different order on the slave.
----
丁奇
1. There indeed is a possible invalid memory access in get_pk_value(). I have changed the definition of st_hash_item::key to char[1024+2*NAME_CHAR_LEN+2], and when building the hash key, if (item->key_len + pack_length >= 1024) break; This can guarantee that, even if the total length of the primary key is bigger than 1024, at least one key_part of the key can be recorded into the hash_key.(As the max length of one key_part is 1000 in MySQL).
Ok, sounds good.
2. The problem of case insensitivity is a bug. I will modify it in next version. Simply we can test the definition in the table schema, and decide whethere change the string to lower case before adding to pk_hash.
Yes, agree.
4. When deadlock occurs, the whole transaction needs to retry. In current implement, a whole transaction is packed into one "Query", so retrying a query is equal to one transaction.
Ah, yes you are right, missed that. Thanks for the explanation.
As I told you before, I am changing the patch to make the patched mysqld as a "pure" slave. In the pure slave mode, all other events will be treated like a DDL statement.That means an User_var_event will wait for all the worker queues to be empty, and call ev->apply_event. Is this strategy suitable? Pleast point out if there are potential problems.
Yes, I think this will work fine. There are lots of details with other events, but they can be handled like DDL. Then most things should just work. There will probably be some details and corner cases to work out, but I think it should be manageable, we can deal with it at the appropriate time.
About some strategies 1. For simplify , I used some sleep(1000000) in the patch, a condition variable will be better, it will be modified in future, but not the highest priority.
Yes, agree. It is a good idea to start with a simple thread scheduling strategy, if that works then we can improve things later.
2. Huge transactions may lead to the memory usage and cpu load, thank you for pointing it out, I never think of it before. I think we can deal with it in this way: If a transaction contains too many events, such as bigger than a centain number, we can treat it like a DDL statement. Because it will be executed after all the worker queues to be empty, there is no "order problem" here, so we do not need to construct the pk_hash. Please give me some suggestion for this issue.
Yes, I agree. If transaction is too big, we can fall back to doing it serially.
My plan next as: 1 Change the case insensitivity bug that you have mentioned before. 2 Run mysql-test suits and pass all the testes.
Sounds good! Looking forward to seeing the result. - Kristian.