Re: [Maria-developers] [Commits] Rev 4376: MDEV-6676: Speculative parallel replication in http://bazaar.launchpad.net/~maria-captains/maria/10.0
Intermediate commit. Patch is far from complete, but this small patch was nevertheless sufficient to be able to sysbench-0.4 OLTP with full parallelisation.
"full parallelisation" does that mean that X threads on master make slave achieve k*X higher throughput ? /Jonas
Jonas Oreland <jonaso@google.com> writes: Hi Jonas, I actually was planning to discuss this with you, as it is based on some of the ideas you mentioned earlier on parallel replication...
Intermediate commit. Patch is far from complete, but this small patch was nevertheless sufficient to be able to sysbench-0.4 OLTP with full parallelisation.
"full parallelisation" does that mean that X threads on master make slave achieve k*X higher throughput ?
Hm, actually, it's not related to threads on the _master_ at all. Rather, it is potentially a throughput of k*Y where Y is the number of worker threads on the _slave_, up to some limit of scalability, of course. Suppose in the binlog we have transactions T1, T2, T3, T4. With this patch, we are going to try to replicate _all_ of them in parallel (up to a maximum of Y). If the transactions are non-conflicting, then great, everything will work fine and we will still commit them in the correct order, so applications will not see any difference. But suppose eg. T3 modifies the same row as T1, and T3 manages to touch the row first. In this case, T1 will need to wait for T3. This is detected as a deadlock (because T3 needs to eventually wait for T1 to commit before). So we roll back T3, allowing T1 to continue, and later re-try T3. So it is safe to try to run everything in parallel, at least for transactional events that can be safely rolled back. The only catch seems to be if there are a lot of potential conflicts in the application load. Then we could end up with too many rollbacks, causing throughput to decrease rather than increase. The next step is to add some flags to the GTID event on the master, and use those flags to control what to run in parallel on the slave: - If DDL or non-transactional tables are involved, set a flag to not run this event group in parallel with those that come before or after. - Remember on the master if a transaction had to do a lock wait on another transaction; in this case it seems likely that a similar wait could be needed on the slave, so do not start this transaction in parallel with any earlier ones. - Maybe we can have a flag for "large" transactions that modify many rows; we could choose not to run those in parallel with earlier transactions, to avoid the need for expensive rollback of lots of rows. - Allow the user to set some @@rpl_not_parallel variable, to explicitly annotate transactions that are known to be likely to conflict, and hence not worth it to try to run in parallel. This should be simple to do. Later we could also think about adding checks on the slave to further control what to do in parallel, however, I have not thought much about this. This patch seems to have a lot of potential to finally get a good solution to the single-threaded slave problem. But testing against real-life workloads will be needed to understand how to balance the speculative parallelisation against avoiding excessive rollbacks. - Kristian.
ack /Jonas On Wed, Sep 10, 2014 at 11:06 AM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Jonas Oreland <jonaso@google.com> writes:
Hi Jonas, I actually was planning to discuss this with you, as it is based on some of the ideas you mentioned earlier on parallel replication...
Intermediate commit. Patch is far from complete, but this small patch was nevertheless sufficient to be able to sysbench-0.4 OLTP with full parallelisation.
"full parallelisation" does that mean that X threads on master make slave achieve k*X higher throughput ?
Hm, actually, it's not related to threads on the _master_ at all. Rather, it is potentially a throughput of k*Y where Y is the number of worker threads on the _slave_, up to some limit of scalability, of course.
Suppose in the binlog we have transactions T1, T2, T3, T4. With this patch, we are going to try to replicate _all_ of them in parallel (up to a maximum of Y).
If the transactions are non-conflicting, then great, everything will work fine and we will still commit them in the correct order, so applications will not see any difference.
But suppose eg. T3 modifies the same row as T1, and T3 manages to touch the row first. In this case, T1 will need to wait for T3. This is detected as a deadlock (because T3 needs to eventually wait for T1 to commit before). So we roll back T3, allowing T1 to continue, and later re-try T3.
So it is safe to try to run everything in parallel, at least for transactional events that can be safely rolled back.
The only catch seems to be if there are a lot of potential conflicts in the application load. Then we could end up with too many rollbacks, causing throughput to decrease rather than increase.
The next step is to add some flags to the GTID event on the master, and use those flags to control what to run in parallel on the slave:
- If DDL or non-transactional tables are involved, set a flag to not run this event group in parallel with those that come before or after.
- Remember on the master if a transaction had to do a lock wait on another transaction; in this case it seems likely that a similar wait could be needed on the slave, so do not start this transaction in parallel with any earlier ones.
- Maybe we can have a flag for "large" transactions that modify many rows; we could choose not to run those in parallel with earlier transactions, to avoid the need for expensive rollback of lots of rows.
- Allow the user to set some @@rpl_not_parallel variable, to explicitly annotate transactions that are known to be likely to conflict, and hence not worth it to try to run in parallel.
This should be simple to do. Later we could also think about adding checks on the slave to further control what to do in parallel, however, I have not thought much about this.
This patch seems to have a lot of potential to finally get a good solution to the single-threaded slave problem. But testing against real-life workloads will be needed to understand how to balance the speculative parallelisation against avoiding excessive rollbacks.
- Kristian.
Hi Kristian, There is one thing I have never understood about your parallel apply algorithm. How do you handle the case where the server crashes when some threads have committed but others have not? It seems as if you could have a problem with recovery. Cheers, Robert Hodges On Wed, Sep 10, 2014 at 2:06 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Jonas Oreland <jonaso@google.com> writes:
Hi Jonas, I actually was planning to discuss this with you, as it is based on some of the ideas you mentioned earlier on parallel replication...
Intermediate commit. Patch is far from complete, but this small patch was nevertheless sufficient to be able to sysbench-0.4 OLTP with full parallelisation.
"full parallelisation" does that mean that X threads on master make slave achieve k*X higher throughput ?
Hm, actually, it's not related to threads on the _master_ at all. Rather, it is potentially a throughput of k*Y where Y is the number of worker threads on the _slave_, up to some limit of scalability, of course.
Suppose in the binlog we have transactions T1, T2, T3, T4. With this patch, we are going to try to replicate _all_ of them in parallel (up to a maximum of Y).
If the transactions are non-conflicting, then great, everything will work fine and we will still commit them in the correct order, so applications will not see any difference.
But suppose eg. T3 modifies the same row as T1, and T3 manages to touch the row first. In this case, T1 will need to wait for T3. This is detected as a deadlock (because T3 needs to eventually wait for T1 to commit before). So we roll back T3, allowing T1 to continue, and later re-try T3.
So it is safe to try to run everything in parallel, at least for transactional events that can be safely rolled back.
The only catch seems to be if there are a lot of potential conflicts in the application load. Then we could end up with too many rollbacks, causing throughput to decrease rather than increase.
The next step is to add some flags to the GTID event on the master, and use those flags to control what to run in parallel on the slave:
- If DDL or non-transactional tables are involved, set a flag to not run this event group in parallel with those that come before or after.
- Remember on the master if a transaction had to do a lock wait on another transaction; in this case it seems likely that a similar wait could be needed on the slave, so do not start this transaction in parallel with any earlier ones.
- Maybe we can have a flag for "large" transactions that modify many rows; we could choose not to run those in parallel with earlier transactions, to avoid the need for expensive rollback of lots of rows.
- Allow the user to set some @@rpl_not_parallel variable, to explicitly annotate transactions that are known to be likely to conflict, and hence not worth it to try to run in parallel.
This should be simple to do. Later we could also think about adding checks on the slave to further control what to do in parallel, however, I have not thought much about this.
This patch seems to have a lot of potential to finally get a good solution to the single-threaded slave problem. But testing against real-life workloads will be needed to understand how to balance the speculative parallelisation against avoiding excessive rollbacks.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Why ? Transactions are committed *in order*. /Jonas On Wed, Sep 10, 2014 at 4:32 PM, Robert Hodges <robert.hodges@continuent.com
wrote:
Hi Kristian,
There is one thing I have never understood about your parallel apply algorithm. How do you handle the case where the server crashes when some threads have committed but others have not? It seems as if you could have a problem with recovery.
Cheers, Robert Hodges
On Wed, Sep 10, 2014 at 2:06 AM, Kristian Nielsen < knielsen@knielsen-hq.org> wrote:
Jonas Oreland <jonaso@google.com> writes:
Hi Jonas, I actually was planning to discuss this with you, as it is based on some of the ideas you mentioned earlier on parallel replication...
Intermediate commit. Patch is far from complete, but this small patch was nevertheless sufficient to be able to sysbench-0.4 OLTP with full parallelisation.
"full parallelisation" does that mean that X threads on master make slave achieve k*X higher throughput ?
Hm, actually, it's not related to threads on the _master_ at all. Rather, it is potentially a throughput of k*Y where Y is the number of worker threads on the _slave_, up to some limit of scalability, of course.
Suppose in the binlog we have transactions T1, T2, T3, T4. With this patch, we are going to try to replicate _all_ of them in parallel (up to a maximum of Y).
If the transactions are non-conflicting, then great, everything will work fine and we will still commit them in the correct order, so applications will not see any difference.
But suppose eg. T3 modifies the same row as T1, and T3 manages to touch the row first. In this case, T1 will need to wait for T3. This is detected as a deadlock (because T3 needs to eventually wait for T1 to commit before). So we roll back T3, allowing T1 to continue, and later re-try T3.
So it is safe to try to run everything in parallel, at least for transactional events that can be safely rolled back.
The only catch seems to be if there are a lot of potential conflicts in the application load. Then we could end up with too many rollbacks, causing throughput to decrease rather than increase.
The next step is to add some flags to the GTID event on the master, and use those flags to control what to run in parallel on the slave:
- If DDL or non-transactional tables are involved, set a flag to not run this event group in parallel with those that come before or after.
- Remember on the master if a transaction had to do a lock wait on another transaction; in this case it seems likely that a similar wait could be needed on the slave, so do not start this transaction in parallel with any earlier ones.
- Maybe we can have a flag for "large" transactions that modify many rows; we could choose not to run those in parallel with earlier transactions, to avoid the need for expensive rollback of lots of rows.
- Allow the user to set some @@rpl_not_parallel variable, to explicitly annotate transactions that are known to be likely to conflict, and hence not worth it to try to run in parallel.
This should be simple to do. Later we could also think about adding checks on the slave to further control what to do in parallel, however, I have not thought much about this.
This patch seems to have a lot of potential to finally get a good solution to the single-threaded slave problem. But testing against real-life workloads will be needed to understand how to balance the speculative parallelisation against avoiding excessive rollbacks.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Robert Hodges <robert.hodges@continuent.com> writes:
There is one thing I have never understood about your parallel apply algorithm. How do you handle the case where the server crashes when some threads have committed but others have not? It seems as if you could have a problem with recovery.
Transactions are executed in parallel, but they are committed sequentially. So if we crash, there is always a single point before which all transactions are committed and after which no transactions are. If using InnoDB and global transaction ID, this point is saved in a crash-safe way, so no problems with crash recovery. If using MyISAM, or if using old filename/offset position in relay-log.info, then a crash may leave an inconsistency - but this is also the case with non-parallel replication. A substantial effort was spent in the code to make the sequential commit of the parallelly applied transaction efficient. For example, it is possible for such transactions to be group committed. - Kristian.
Thanks Kristian. That's a nice optimization that is difficult to do outside the DBMS engine. Cheers, Robert On Wed, Sep 10, 2014 at 2:39 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Robert Hodges <robert.hodges@continuent.com> writes:
There is one thing I have never understood about your parallel apply algorithm. How do you handle the case where the server crashes when some threads have committed but others have not? It seems as if you could have a problem with recovery.
Transactions are executed in parallel, but they are committed sequentially.
So if we crash, there is always a single point before which all transactions are committed and after which no transactions are.
If using InnoDB and global transaction ID, this point is saved in a crash-safe way, so no problems with crash recovery.
If using MyISAM, or if using old filename/offset position in relay-log.info, then a crash may leave an inconsistency - but this is also the case with non-parallel replication.
A substantial effort was spent in the code to make the sequential commit of the parallelly applied transaction efficient. For example, it is possible for such transactions to be group committed.
- Kristian.
Robert Hodges <robert.hodges@continuent.com> writes:
Thanks Kristian. That's a nice optimization that is difficult to do outside the DBMS engine.
Indeed. We might be able to expose it to you, though. Like SET SESSION wait_for_other_commit=<processid>, or something. - Kristian.
Perhaps slightly more useful would be a SET SESSION inherit_transaction=<process_id> Use cases: * Multi threaded backup on mysqldump --single-transaction with each thread fetching a table * An application that wants do a number of read/write queries in parallel all isolation level. A commit or rollback will apply to all transactions. ----- Original Message -----
Robert Hodges <robert.hodges@continuent.com> writes:
Thanks Kristian. That's a nice optimization that is difficult to do outside the DBMS engine.
Indeed. We might be able to expose it to you, though. Like SET SESSION wait_for_other_commit=<processid>, or something.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- -- Daniel Black, Engineer @ Open Query (http://openquery.com.au) Remote expertise & maintenance for MySQL/MariaDB server environments.
Hi Kristian, That's an intriguing proposal. There are numerous applications involving tasks like restoring backups in parallel where being able to ensure all or nothing commit across a group of client threads would be incredibly handy. To my knowledge no other RDBMS offers it. It's also a great feature for data warehouse loading. We usually have to fake it on Tungsten by tricks to make transactions idempotent. These work poorly in conventional RDBMS due to constraint checking. What would be cool is something like a group transaction that other threads can join so that the commit becomes atomic. Most of the parallel load use cases I can think of don't require you to coordinate things like isolation because they are conflict free anyway--it's just atomic group commit. I had a quick look in Weikum and Vossen (2002). If you have that handy there are several interesting discussions about parallel transactions. It sounds as if you have something similar baked in and just need to expose it. (But you probably read W&V closely already before writing any code. :) Cheers, Robert On Thu, Sep 11, 2014 at 12:17 AM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Robert Hodges <robert.hodges@continuent.com> writes:
Thanks Kristian. That's a nice optimization that is difficult to do outside the DBMS engine.
Indeed. We might be able to expose it to you, though. Like SET SESSION wait_for_other_commit=<processid>, or something.
- Kristian.
Robert Hodges <robert.hodges@continuent.com> writes:
What would be cool is something like a group transaction that other threads can join so that the commit becomes atomic. Most of the parallel load use cases I can think of don't require you to coordinate things like isolation because they are conflict free anyway--it's just atomic group commit.
Hm, I don't quite have this. Given T1 and T2, there is a facility to ensure that T2 will commit no sooner than T1. But there is no facility to prevent T1 from commiting before T2. (This is what was needed for parallel replication). - Kristian.
----- Original Message -----
Robert Hodges <robert.hodges@continuent.com> writes:
What would be cool is something like a group transaction that other threads can join so that the commit becomes atomic. Most of the parallel load use cases I can think of don't require you to coordinate things like isolation because they are conflict free anyway--it's just atomic group commit.
Hm, I don't quite have this.
Given T1 and T2, there is a facility to ensure that T2 will commit no sooner than T1. But there is no facility to prevent T1 from commiting before T2.
(This is what was needed for parallel replication).
How does Galera handle this in its parallel apply threads? Especially the case below marked on MDEV-6676 T1: DELETE FROM t1 WHERE a=1 T2: INSERT INTO t1 SET a=1 -- -- Daniel Black, Engineer @ Open Query (http://openquery.com.au) Remote expertise & maintenance for MySQL/MariaDB server environments.
On Thu, Sep 11, 2014 at 12:49 PM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Robert Hodges <robert.hodges@continuent.com> writes:
What would be cool is something like a group transaction that other threads can join so that the commit becomes atomic. Most of the parallel load use cases I can think of don't require you to coordinate things like isolation because they are conflict free anyway--it's just atomic group commit.
Hm, I don't quite have this.
Given T1 and T2, there is a facility to ensure that T2 will commit no sooner than T1. But there is no facility to prevent T1 from commiting before T2.
(This is what was needed for parallel replication).
In that case it gets complex for outsiders to figure out how to restart if there's a failure. Let's say my transaction fails after T1 commits but before T1 commits. Then on restart I have to regenerate T2 and rerun it. That could be hard if T2 contains statements that came before T1 in the original, serialized log.
Without a "group" transaction another approach is to keep separate restart points for each worker thread. It works well as long as you can deterministically assign transactions to threads and keep those same threads from getting too far apart. The logic to do this is non-trivial, so it's not an approach for everybody. Cheers, Robert
- Kristian.
Robert Hodges <robert.hodges@continuent.com> writes:
In that case it gets complex for outsiders to figure out how to restart if there's a failure. Let's say my transaction fails after T1 commits but before T1 commits. Then on restart I have to regenerate T2 and rerun it. That could be hard if T2 contains statements that came before T1 in the original, serialized log.
Agree, there are different approaches possible. The facility in MariaDB discussed here is precisely to apply in parallel a serialized log. We apply the transaction in parallel, but the commits happen in the original serial order. This makes the parallel apply transparent to applications, assuming MVCC. Each transaction inserts in a table its own monotonic transaction number. So after a restart, it is easy to find the point at which to resume - just find the highest number in the table. One limitation of this approach is that with N worker threads, we can never execute more than N transactions ahead of the slowest transaction in the log, because every thread needs to wait for the prior commit.
Without a "group" transaction another approach is to keep separate restart points for each worker thread. It works well as long as you can deterministically assign transactions to threads and keep those same threads from getting too far apart. The logic to do this is non-trivial, so it's not an approach for everybody.
MariaDB also supports this approach. Here, the user explicitly assigns each transaction to a replication domain, and restart point is kept per-domain (not per-transaction). Different domains commit independently and out-of-order with respect to one another. Within one domain, commit order is strictly serialised. I like the approach with replication domains. Not only does it avoid the need for non-trivial logic to assign transactions to different threads and restart points. It also ensures that those restart points make sense to the user/DBA. - Kristian.
On Tue, Sep 16, 2014 at 12:38 PM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Robert Hodges <robert.hodges@continuent.com> writes:
In that case it gets complex for outsiders to figure out how to restart if there's a failure. Let's say my transaction fails after T1 commits but before T1 commits. Then on restart I have to regenerate T2 and rerun it. That could be hard if T2 contains statements that came before T1 in the original, serialized log.
Agree, there are different approaches possible.
The facility in MariaDB discussed here is precisely to apply in parallel a serialized log. We apply the transaction in parallel, but the commits happen in the original serial order. This makes the parallel apply transparent to applications, assuming MVCC.
Each transaction inserts in a table its own monotonic transaction number. So after a restart, it is easy to find the point at which to resume - just find the highest number in the table.
One limitation of this approach is that with N worker threads, we can never execute more than N transactions ahead of the slowest transaction in the log, because every thread needs to wait for the prior commit.
Right. I thought about that problem a lot in the Tungsten parallel apply design and ended up with an approach that allows workers to diverge by several minutes or longer. This enables Tungsten to maintain good throughput even in the face of lumpy workloads that contain transactions ranging from single inserts to updates involving hundreds of thousands or millions of rows. We did some early performance work in production environments that showed the need for wide divergence to avoid serialization around the "lumps" in the load.
Without a "group" transaction another approach is to keep separate restart points for each worker thread. It works well as long as you can deterministically assign transactions to threads and keep those same threads from getting too far apart. The logic to do this is non-trivial, so it's not an approach for everybody.
MariaDB also supports this approach. Here, the user explicitly assigns each transaction to a replication domain, and restart point is kept per-domain (not per-transaction). Different domains commit independently and out-of-order with respect to one another. Within one domain, commit order is strictly serialised.
I like the approach with replication domains. Not only does it avoid the need for non-trivial logic to assign transactions to different threads and restart points. It also ensures that those restart points make sense to the user/DBA.
So are replication domains "shards"? My definition of a shard in this
context is a causally independent stream of transactions, which is effectively a partial order within the fully serialized log. That's an excellent feature. Assuming that's what you have done, how do you handle operations like CREATE USER that are global in effect? (Just point me to docs or your blog if you wrote it up. I would love to learn more.) Cheers, Robert
Robert Hodges <robert.hodges@continuent.com> writes:
Right. I thought about that problem a lot in the Tungsten parallel apply design and ended up with an approach that allows workers to diverge by several minutes or longer. This enables Tungsten to maintain good throughput even in the face of lumpy workloads that contain transactions
So are replication domains "shards"? My definition of a shard in this context is a causally independent stream of transactions, which is effectively a partial order within the fully serialized log. That's an excellent feature. Assuming that's what you have done, how do you handle operations like CREATE USER that are global in effect?
Yes, it sounds like replication domains are basically the same as shard. So in MariaDB, I suppose my approach is that we will try to do some amount of parallelisation automatically, and completely transparent to all applications (this is the in-order parallel replication). If that is not sufficient, the user can additionally help by splitting their load into replication domains, eg. to put the "lumps" in a separate domain, which will allow other transactions to execute ahead. And when splitting into separate domains, the burden falls on the user/application to ensure that different domains can replicate independently. So for something like CREATE USER or CREATE TABLE and the like, it will be necessary to ensure manually that all slaves have replicated the statement with global effect, before doing dependent transactions in a separate domain on the master. One way to ensure this is to run a MASTER_GTID_WAIT() on all slaves with the @@LAST_GTID of the statement from the master.
(Just point me to docs or your blog if you wrote it up. I would love to learn more.)
Docs are here: https://mariadb.com/kb/en/mariadb/documentation/replication-cluster-multi-ma... https://mariadb.com/kb/en/mariadb/documentation/replication-cluster-multi-ma... I wrote some stuff on my blog: http://kristiannielsen.livejournal.com/18435.html http://kristiannielsen.livejournal.com/16826.html http://kristiannielsen.livejournal.com/17008.html http://kristiannielsen.livejournal.com/17238.html http://kristiannielsen.livejournal.com/18308.html I notice that I wrote mostly about global transaction ID, and less about the parallel replication. Well, they are strongly interdependent, and eg. the replication domains are well explained in my writings on GTID, I hope. Though some features of parallel replication can be used even without GTID. - Kristian.
I have done some more patches on this, and basic functionality should be there now in this tree: lp:~maria-captains/maria/10.0-knielsen Let me know if you want to experiment with it, and I can provide more details... - Kristian.
participants (4)
-
Daniel Black
-
Jonas Oreland
-
Kristian Nielsen
-
Robert Hodges