Re: [Maria-developers] [Maria-discuss] Known limitation with TokuDB in Read Free Replication & parallel replication ?
[Moving the discussion to maria-developers@, hope that is ok/makes sense...] Ok, so here is a proof-of-concept patch for this, which seems to make TokuDB work with optimistic parallel replication. The core of the patch is this line in lock_request.cc lock_wait_callback(callback_data, m_txnid, conflicts.get(i)); which ends up doing this: thd_report_wait_for (requesting_thd, blocking_thd); All the rest of the patch is just getting the right information around between the different parts of the code. I put this on top of Jocelyn Fournier's tokudb_rpl.rpl_parallel_optimistic patches, and pushed it on my github: https://github.com/knielsen/server/tree/toku_opr2 With this patch, the test case passes! So that's promising. Some things still left to do for this to be a good patch: - I think the callback needs to trigger also for an already waiting transaction, in case another transaction arrives later to contend for the same lock, but happens to get the lock earlier. I can look into this. - This patch needs linear time (in number of active transactions) per callback to find the THD from the TXNID, maybe that could be optimised. - Probably the new callback etc. needs some cleanup to better match TokuDB code organisation and style. - And testing, of course. I'll definitely need some help there, as I'm not familiar with how to run TokuDB efficiently. Any thoughts or comments? - Kristian.
Hello Kristian,
I am running your opt2 branch with a small sysbench oltp test (1 table,
1000 rows, 8 threads). the good news is that the slave stalls due to lock
timeouts are gone. the bad news is that the slave performance is suspect.
when slave in conservative mode with 2 threads, the tokudb wait for
callback is being called (i put in a "printf"), which implies a parallel
lock conflict. I assumed that conservative mode implies parallel execution
of transactions that were group committed together, which I assumed would
imply that these transactions were conflict free. Obviously not the case.
when slave in optimistic mode with 8 threads, i see very high slave query
execution times in processlist.
| Id | User | Host | db | Command | Time | State
| Info | Progress |
+----+-------------+-----------+------+---------+------+-----------------------------------------------+------------------+----------+
| 6 | root | localhost | NULL | Query | 0 | init
| show processlist | 0.000 |
| 16 | system user | | NULL | Connect | 383 | Waiting for master
to send event | NULL | 0.000 |
| 17 | system user | | NULL | Connect | 7 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 18 | system user | | NULL | Connect | 3 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 19 | system user | | NULL | Connect | 3 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 20 | system user | | NULL | Connect | 3 |
Delete_rows_log_event::find_row(-1) | NULL | 0.000
|
| 21 | system user | | NULL | Connect | 3 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 22 | system user | | NULL | Connect | 3 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 23 | system user | | NULL | Connect | 7 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 24 | system user | | NULL | Connect | 3 | Waiting for prior
transaction to commit | NULL | 0.000 |
| 25 | system user | | NULL | Connect | 382 | Waiting for room
in worker thread event queue | NULL | 0.000 |
It appears that there is some MULTIPLE SECOND STALL somewhere. gdb shows
that the threads are either
(1) waiting in the tokudb lock manager, or
(2) waiting in the wait_for_commit::wait_for_prior_commit2 function.
On Fri, Aug 12, 2016 at 8:50 AM, Kristian Nielsen
[Moving the discussion to maria-developers@, hope that is ok/makes sense...]
Ok, so here is a proof-of-concept patch for this, which seems to make TokuDB work with optimistic parallel replication.
The core of the patch is this line in lock_request.cc
lock_wait_callback(callback_data, m_txnid, conflicts.get(i));
which ends up doing this:
thd_report_wait_for (requesting_thd, blocking_thd);
All the rest of the patch is just getting the right information around between the different parts of the code.
I put this on top of Jocelyn Fournier's tokudb_rpl.rpl_parallel_optimistic patches, and pushed it on my github:
https://github.com/knielsen/server/tree/toku_opr2
With this patch, the test case passes! So that's promising.
Some things still left to do for this to be a good patch:
- I think the callback needs to trigger also for an already waiting transaction, in case another transaction arrives later to contend for the same lock, but happens to get the lock earlier. I can look into this.
- This patch needs linear time (in number of active transactions) per callback to find the THD from the TXNID, maybe that could be optimised.
- Probably the new callback etc. needs some cleanup to better match TokuDB code organisation and style.
- And testing, of course. I'll definitely need some help there, as I'm not familiar with how to run TokuDB efficiently.
Any thoughts or comments?
- Kristian.
Hello Kristian,
I suspect that the poor slave replication performance for optimistic
replication occurs because TokuDB does not implement the kill_query
handlerton function. kill_handlerton gets called to resolve lock wait for
situations that occur when parallel replicating a small sysbench table.
InnoDB implements kill_query while TokuDB does not implement it.
On Fri, Aug 12, 2016 at 12:47 PM, Rich Prohaska
Hello Kristian, I am running your opt2 branch with a small sysbench oltp test (1 table, 1000 rows, 8 threads). the good news is that the slave stalls due to lock timeouts are gone. the bad news is that the slave performance is suspect.
when slave in conservative mode with 2 threads, the tokudb wait for callback is being called (i put in a "printf"), which implies a parallel lock conflict. I assumed that conservative mode implies parallel execution of transactions that were group committed together, which I assumed would imply that these transactions were conflict free. Obviously not the case.
when slave in optimistic mode with 8 threads, i see very high slave query execution times in processlist.
| Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------+------+---------+------+---- -------------------------------------------+------------------+----------+ | 6 | root | localhost | NULL | Query | 0 | init | show processlist | 0.000 | | 16 | system user | | NULL | Connect | 383 | Waiting for master to send event | NULL | 0.000 | | 17 | system user | | NULL | Connect | 7 | Waiting for prior transaction to commit | NULL | 0.000 | | 18 | system user | | NULL | Connect | 3 | Waiting for prior transaction to commit | NULL | 0.000 | | 19 | system user | | NULL | Connect | 3 | Waiting for prior transaction to commit | NULL | 0.000 | | 20 | system user | | NULL | Connect | 3 | Delete_rows_log_event::find_row(-1) | NULL | 0.000 | | 21 | system user | | NULL | Connect | 3 | Waiting for prior transaction to commit | NULL | 0.000 | | 22 | system user | | NULL | Connect | 3 | Waiting for prior transaction to commit | NULL | 0.000 | | 23 | system user | | NULL | Connect | 7 | Waiting for prior transaction to commit | NULL | 0.000 | | 24 | system user | | NULL | Connect | 3 | Waiting for prior transaction to commit | NULL | 0.000 | | 25 | system user | | NULL | Connect | 382 | Waiting for room in worker thread event queue | NULL | 0.000 |
It appears that there is some MULTIPLE SECOND STALL somewhere. gdb shows that the threads are either (1) waiting in the tokudb lock manager, or (2) waiting in the wait_for_commit::wait_for_prior_commit2 function.
On Fri, Aug 12, 2016 at 8:50 AM, Kristian Nielsen < knielsen@knielsen-hq.org> wrote:
[Moving the discussion to maria-developers@, hope that is ok/makes sense...]
Ok, so here is a proof-of-concept patch for this, which seems to make TokuDB work with optimistic parallel replication.
The core of the patch is this line in lock_request.cc
lock_wait_callback(callback_data, m_txnid, conflicts.get(i));
which ends up doing this:
thd_report_wait_for (requesting_thd, blocking_thd);
All the rest of the patch is just getting the right information around between the different parts of the code.
I put this on top of Jocelyn Fournier's tokudb_rpl.rpl_parallel_optimi stic patches, and pushed it on my github:
https://github.com/knielsen/server/tree/toku_opr2
With this patch, the test case passes! So that's promising.
Some things still left to do for this to be a good patch:
- I think the callback needs to trigger also for an already waiting transaction, in case another transaction arrives later to contend for the same lock, but happens to get the lock earlier. I can look into this.
- This patch needs linear time (in number of active transactions) per callback to find the THD from the TXNID, maybe that could be optimised.
- Probably the new callback etc. needs some cleanup to better match TokuDB code organisation and style.
- And testing, of course. I'll definitely need some help there, as I'm not familiar with how to run TokuDB efficiently.
Any thoughts or comments?
- Kristian.
Rich Prohaska
I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait for situations that occur when parallel replicating a small sysbench table. InnoDB implements kill_query while TokuDB does not implement it.
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over. I'm thinking that it's just because my patch is incomplete - it only handles the case where transaction T1 goes to wait on T2 and T2 is already holding the lock. If the lock is later passed to T3 (while T1 is still waiting), then my patch doesn't handle killing T3. So T1 will need to wait for its lock wait timeout to trigger, and then it will be re-tried - and _then_ T3 will be killed. At least it looks a bit like that is what is happening in your processlist output. But I'll need to do some tests to be sure. And I think I know how to fix my patch, hopefully I'll have something in a day or two.
when slave in conservative mode with 2 threads, the tokudb wait for callback is being called (i put in a "printf"), which implies a parallel lock conflict. I assumed that conservative mode implies parallel execution of transactions that were group committed together, which I assumed would imply that these transactions were conflict free. Obviously not the case.
This is interesting. Is there somewhere I can read details of how TokuDB does lock waits? That would help me understand what is going on. We actually have the same situation in InnoDB in some cases. For example: CREATE TABLE t4 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB; INSERT INTO t4 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6); UPDATE t4 SET b=NULL WHERE a=6; DELETE FROM t4 WHERE b <= 3; The UPDATE and DELETE may or may not conflict, depending on the order in which they run. So it is possible for them to group commit together on the master, but still conflict on the slave. Maybe something similar is possible in TokuDB? Another option is that some of the callbacks are false positives. Lock waits should only be reported if they are for locks that will be held until COMMIT. For example in InnoDB, there are shorter-lived locks on the auto-increment counter, and such locks should _not_ be reported. - Kristian.
Hello Kristian,
On Sun, Aug 14, 2016 at 1:51 PM, Kristian Nielsen
Rich Prohaska
writes: I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait for situations that occur when parallel replicating a small sysbench table. InnoDB implements kill_query while TokuDB does not implement it.
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over.
I'm thinking that it's just because my patch is incomplete - it only handles the case where transaction T1 goes to wait on T2 and T2 is already holding the lock. If the lock is later passed to T3 (while T1 is still waiting), then my patch doesn't handle killing T3. So T1 will need to wait for its lock wait timeout to trigger, and then it will be re-tried - and _then_ T3 will be killed.
At least it looks a bit like that is what is happening in your processlist output. But I'll need to do some tests to be sure. And I think I know how to fix my patch, hopefully I'll have something in a day or two.
tokudb lock timeouts are resolving the replication stall. unfortunately, the tokudb lock timeout is 4 seconds, so the throughput is almost zero.
when slave in conservative mode with 2 threads, the tokudb wait for callback is being called (i put in a "printf"), which implies a parallel lock conflict. I assumed that conservative mode implies parallel execution of transactions that were group committed together, which I assumed would imply that these transactions were conflict free. Obviously not the case.
This is interesting. Is there somewhere I can read details of how TokuDB does lock waits? That would help me understand what is going on.
TokuFT implements pessimistic locking and 2 phase locking algorithms. This wiki describes locking and concurrency in a little more detail: https://github.com/percona/tokudb-engine/wiki/Transactions-and-Concurrency.
We actually have the same situation in InnoDB in some cases. For example:
CREATE TABLE t4 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB; INSERT INTO t4 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6); UPDATE t4 SET b=NULL WHERE a=6; DELETE FROM t4 WHERE b <= 3;
The UPDATE and DELETE may or may not conflict, depending on the order in which they run. So it is possible for them to group commit together on the master, but still conflict on the slave. Maybe something similar is possible in TokuDB?
Another option is that some of the callbacks are false positives. Lock waits should only be reported if they are for locks that will be held until COMMIT. For example in InnoDB, there are shorter-lived locks on the auto-increment counter, and such locks should _not_ be reported.
Yes, I think they are false positives since the thd_report_wait_for API is called but it does NOT call the THD::awake function.
- Kristian.
Rich Prohaska
tokudb lock timeouts are resolving the replication stall. unfortunately, the tokudb lock timeout is 4 seconds, so the throughput is almost zero.
Yes. Sorry for not making it clear that my proof-of-concept patch was incomplete...
I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over.
No, you're absolutely right, after testing (and thinking) some more, I realise that indeed the kill_query functionality is important. A possible scenario is, given transactions T1, T2, and T3 in that order: T3 acquires a lock on row R3, T2 similarly acquires R2. Now T3 tries to acquire R2, but has to wait for T2 to release it. Later T1 tries to acquire R3, also has to wait. At this point, we kill T3, since it is holding a lock (R3) needed by an earlier transaction T1. However, T3 will not notice the kill until its own wait (on R2 held by T2) times out. T2 cannot release the lock because it is waiting for T1 to commit first. So we have a deadlock :-/ With InnoDB, the kill causes T3 to wake up immediately and roll back, so that T1 can proceed without much delay. Ok, so something more is needed here. I see there is a killed_callback() which seems to check for the kill, so I'm hoping that can be used with a suitable wakeup of the offending lock_request (or all requests, perhaps). But as I'm completely new to TokuDB, I still need some more time to read the code and try to understand how everything fits together...
TokuFT implements pessimistic locking and 2 phase locking algorithms. This wiki describes locking and concurrency in a little more detail: https://github.com/percona/tokudb-engine/wiki/Transactions-and-Concurrency.
Thanks, this was quite helpful.
Yes, I think they are false positives since the thd_report_wait_for API is called but it does NOT call the THD::awake function.
Ah. Then it's probably normal, caused by the group-commit optimisation. In conservative mode, if two transactions T1 and T2 did not group commit on the master, then cannot be started in parallel on the slave. But T2 can start as soon as T1 has reached COMMIT. Thus, if T2 happens to conflict with T1, there is a small window where T2 can need to wait on T1 until T1 has completed its commit. Thanks, - Kristian.
Hello Kristian,
The simplest kill_query implementation for tokudb would just signal all of
the pending lock request's condition variables. This would cause the
killed callback to be called. A performance refinement, if necessary,
would allow thread A (executing the kill_query function) to identify and
signal a condition variable for a blocked thread B.
On Mon, Aug 15, 2016 at 5:42 AM, Kristian Nielsen
Rich Prohaska
writes: tokudb lock timeouts are resolving the replication stall. unfortunately, the tokudb lock timeout is 4 seconds, so the throughput is almost zero.
Yes. Sorry for not making it clear that my proof-of-concept patch was incomplete...
I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over.
No, you're absolutely right, after testing (and thinking) some more, I realise that indeed the kill_query functionality is important.
A possible scenario is, given transactions T1, T2, and T3 in that order:
T3 acquires a lock on row R3, T2 similarly acquires R2. Now T3 tries to acquire R2, but has to wait for T2 to release it. Later T1 tries to acquire R3, also has to wait.
At this point, we kill T3, since it is holding a lock (R3) needed by an earlier transaction T1. However, T3 will not notice the kill until its own wait (on R2 held by T2) times out. T2 cannot release the lock because it is waiting for T1 to commit first. So we have a deadlock :-/
With InnoDB, the kill causes T3 to wake up immediately and roll back, so that T1 can proceed without much delay.
Ok, so something more is needed here. I see there is a killed_callback() which seems to check for the kill, so I'm hoping that can be used with a suitable wakeup of the offending lock_request (or all requests, perhaps). But as I'm completely new to TokuDB, I still need some more time to read the code and try to understand how everything fits together...
TokuFT implements pessimistic locking and 2 phase locking algorithms. This wiki describes locking and concurrency in a little more detail: https://github.com/percona/tokudb-engine/wiki/ Transactions-and-Concurrency.
Thanks, this was quite helpful.
Yes, I think they are false positives since the thd_report_wait_for API is called but it does NOT call the THD::awake function.
Ah. Then it's probably normal, caused by the group-commit optimisation. In conservative mode, if two transactions T1 and T2 did not group commit on the master, then cannot be started in parallel on the slave. But T2 can start as soon as T1 has reached COMMIT. Thus, if T2 happens to conflict with T1, there is a small window where T2 can need to wait on T1 until T1 has completed its commit.
Thanks,
- Kristian.
Hello Kristian,
See attached snapshot of slave threads and tokudb locks. Thread 16 is
waiting for a tokudb lock held by thread 16, which is waiting for a tokudb
lock held by thread 14. Thread 14 is waiting for a prior transaction to
complete, presumably either thread 15 or 16. So, we have a deadlock that
tokudb can not detect because the ordering constraint is not available to
tokudb. I assume that the optimistic scheduler killed thread 16, but since
tokudb does not implement the kill_query function, the deadlock is only
resolved when the tokudb lock timer pops.
On Mon, Aug 15, 2016 at 8:16 AM, Rich Prohaska
Hello Kristian, The simplest kill_query implementation for tokudb would just signal all of the pending lock request's condition variables. This would cause the killed callback to be called. A performance refinement, if necessary, would allow thread A (executing the kill_query function) to identify and signal a condition variable for a blocked thread B.
On Mon, Aug 15, 2016 at 5:42 AM, Kristian Nielsen < knielsen@knielsen-hq.org> wrote:
Rich Prohaska
writes: tokudb lock timeouts are resolving the replication stall. unfortunately, the tokudb lock timeout is 4 seconds, so the throughput is almost zero.
Yes. Sorry for not making it clear that my proof-of-concept patch was incomplete...
I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over.
No, you're absolutely right, after testing (and thinking) some more, I realise that indeed the kill_query functionality is important.
A possible scenario is, given transactions T1, T2, and T3 in that order:
T3 acquires a lock on row R3, T2 similarly acquires R2. Now T3 tries to acquire R2, but has to wait for T2 to release it. Later T1 tries to acquire R3, also has to wait.
At this point, we kill T3, since it is holding a lock (R3) needed by an earlier transaction T1. However, T3 will not notice the kill until its own wait (on R2 held by T2) times out. T2 cannot release the lock because it is waiting for T1 to commit first. So we have a deadlock :-/
With InnoDB, the kill causes T3 to wake up immediately and roll back, so that T1 can proceed without much delay.
Ok, so something more is needed here. I see there is a killed_callback() which seems to check for the kill, so I'm hoping that can be used with a suitable wakeup of the offending lock_request (or all requests, perhaps). But as I'm completely new to TokuDB, I still need some more time to read the code and try to understand how everything fits together...
TokuFT implements pessimistic locking and 2 phase locking algorithms. This wiki describes locking and concurrency in a little more detail: https://github.com/percona/tokudb-engine/wiki/Transactions- and-Concurrency.
Thanks, this was quite helpful.
Yes, I think they are false positives since the thd_report_wait_for API is called but it does NOT call the THD::awake function.
Ah. Then it's probably normal, caused by the group-commit optimisation. In conservative mode, if two transactions T1 and T2 did not group commit on the master, then cannot be started in parallel on the slave. But T2 can start as soon as T1 has reached COMMIT. Thus, if T2 happens to conflict with T1, there is a small window where T2 can need to wait on T1 until T1 has completed its commit.
Thanks,
- Kristian.
Hello Kristian,
I have a prototype of the TokuFT code that will cause ALL lock waiters to
call their killed callback here:
https://github.com/prohaska7/tokuft/tree/kill_lockers
On Mon, Aug 15, 2016 at 11:51 AM, Rich Prohaska
Hello Kristian, See attached snapshot of slave threads and tokudb locks. Thread 16 is waiting for a tokudb lock held by thread 16, which is waiting for a tokudb lock held by thread 14. Thread 14 is waiting for a prior transaction to complete, presumably either thread 15 or 16. So, we have a deadlock that tokudb can not detect because the ordering constraint is not available to tokudb. I assume that the optimistic scheduler killed thread 16, but since tokudb does not implement the kill_query function, the deadlock is only resolved when the tokudb lock timer pops.
On Mon, Aug 15, 2016 at 8:16 AM, Rich Prohaska
wrote: Hello Kristian, The simplest kill_query implementation for tokudb would just signal all of the pending lock request's condition variables. This would cause the killed callback to be called. A performance refinement, if necessary, would allow thread A (executing the kill_query function) to identify and signal a condition variable for a blocked thread B.
On Mon, Aug 15, 2016 at 5:42 AM, Kristian Nielsen < knielsen@knielsen-hq.org> wrote:
Rich Prohaska
writes: tokudb lock timeouts are resolving the replication stall. unfortunately, the tokudb lock timeout is 4 seconds, so the throughput is almost zero.
Yes. Sorry for not making it clear that my proof-of-concept patch was incomplete...
I suspect that the poor slave replication performance for optimistic replication occurs because TokuDB does not implement the kill_query handlerton function. kill_handlerton gets called to resolve lock wait
Possibly, but I'm not sure it's that important. The kill will be effective as soon as the wait is over.
No, you're absolutely right, after testing (and thinking) some more, I realise that indeed the kill_query functionality is important.
A possible scenario is, given transactions T1, T2, and T3 in that order:
T3 acquires a lock on row R3, T2 similarly acquires R2. Now T3 tries to acquire R2, but has to wait for T2 to release it. Later T1 tries to acquire R3, also has to wait.
At this point, we kill T3, since it is holding a lock (R3) needed by an earlier transaction T1. However, T3 will not notice the kill until its own wait (on R2 held by T2) times out. T2 cannot release the lock because it is waiting for T1 to commit first. So we have a deadlock :-/
With InnoDB, the kill causes T3 to wake up immediately and roll back, so that T1 can proceed without much delay.
Ok, so something more is needed here. I see there is a killed_callback() which seems to check for the kill, so I'm hoping that can be used with a suitable wakeup of the offending lock_request (or all requests, perhaps). But as I'm completely new to TokuDB, I still need some more time to read the code and try to understand how everything fits together...
TokuFT implements pessimistic locking and 2 phase locking algorithms. This wiki describes locking and concurrency in a little more detail: https://github.com/percona/tokudb-engine/wiki/Transactions-a nd-Concurrency.
Thanks, this was quite helpful.
Yes, I think they are false positives since the thd_report_wait_for API is called but it does NOT call the THD::awake function.
Ah. Then it's probably normal, caused by the group-commit optimisation. In conservative mode, if two transactions T1 and T2 did not group commit on the master, then cannot be started in parallel on the slave. But T2 can start as soon as T1 has reached COMMIT. Thus, if T2 happens to conflict with T1, there is a small window where T2 can need to wait on T1 until T1 has completed its commit.
Thanks,
- Kristian.
participants (2)
-
Kristian Nielsen
-
Rich Prohaska