Re: [Maria-developers] [Maria-discuss] Known limitation with TokuDB in Read Free Replication & parallel replication ?
data:image/s3,"s3://crabby-images/2cef3/2cef380fa1898966dbddae070e8711a81d0d89a3" alt=""
Rich Prohaska <prohaska7@gmail.com> writes:
On Tue, Aug 23, 2016 at 1:45 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
In the original parallel replication patch, thd_report_wait_for() did not call back directly into tokudb_kill_query(). The kill happened asynchronously, in a separate background thread. Then there is no problem with TokuDB (or InnoDB) holding locks over the call to thd_report_wait_for().
I am considering re-introducing that orginal code - this might simplify TokuDB implementation (and would also simplify InnoDB implementation). I was never very happy about the way thd_report_wait_for() works currently.
Ok, so I implemented this, available in this branch: https://github.com/knielsen/server/commits/toku_opr3 With this code, I am no longer able to reproduce any hangs with the tests I've been running so far.
BTW, the current_lock_tree_mutex logic is broken. The underlying tokuft lock manager has one manager object (and its mutex) and many lock tree objects (each with its own mutex). Since the thd_report_wait_for function is called when holding the lock tree mutex (not the manager mutex), it can be called in parallel; thus there is a race on the current_lock_tree_mutex logic.
Yes, and this is the main problem solved by doing the kill asynchronously. And from my tests, it looks like this was actually what was causing the problems - crash, assertion, and hangs. (I haven't determined this conclusively, but it seems at least plausible). It really was always an ugly hack around the locking problem with thd_report_wait_for() (InnoDB had a similar issue), it seems good to get rid of it. I still have the should_retry_lock_requests disabled in retry_all_lock_requests() - otherwise I get hangs. I haven't investigated this deeply yet. I also want to go back over the entire set of patches and see what needs cleaning up, and think about how this could go into 10.2 and possibly 10.1. - Kristian.
participants (1)
-
Kristian Nielsen