丁奇 <dingqi.lxb@taobao.com> writes:
Hi, Kristian Ok. I have got the information from JIRA.
I find you control the commit order inside the user thread.
Will it be easier to let Trans_worker thread hold this logic?
Yes, I think you are right. Of course, the user thread is the one that knows the ordering, but the logic for waiting needs to be in the Trans_worker thread. In fact this is a bug in my first patch: Transaction T3 could wait for the THD of worker thread 1 which has both T1 and T2 queued; then it will wake up too early, when T1 commits rather than when T2 does. I will try to implement the new idea today.
After they have done the execution of one transaction, "register the transaction and wait" if there are transactions from other workers should be commited ahead. After commit in one worker, wake up another worker, the worker who wait for the next "head of commitee" should be woken up.
Right, I'll need to look into this a bit deeper. Actually, in my patch the actual wait and wakeup happens inside ha_commit_trans(), and there is a reason for this. Because eventually I want to do it inside tc_log->log_and_order(), which is called from ha_commit_trans(). Here is how a commit happens: InnoDB prepare step fsync() InnoDB redo log (*A) TC_LOG_BINLOG::log_and_order Write transaction to binlog fsync() binlog (*B) InnoDB commit_ordered() (*C) Write commit record to InnoDB redo log InnoDB commit step The steps (*A) and (*B) are slow, typically around 1-10 milliseconds depending on disk system. So we need many threads to commit in parallel and reach points (*A) and (*B) at the same time, so we only need to do the fsync() once for many threads. This is group commit. Thus for in-order parallel replication, we must not do the wait for the previous commit before the (*B) step. Because if we do, then it becomes impossible for two transactions to be at point (*B) at the same time, and group commit is impossible. On the other hand, point (*C) is where the commit order is determined. So if we do the wait after point (*C), then we cannot enforce that T1 commits before T2. So therefore, the wait must happen exactly around point (B) and (C), inside TC_LOG_BINLOG::log_and_order(). That is why I invented all the register_wait_for_prior_commit() and so on: so that log_and_order() has somewhere to look for exactly who is waiting for who. Then if T2 is waiting for T1 to commit, we can do steps (*B) and (*C) for both of them together, achiving both group commit and in-order parallel replication. Anyway, I just wanted to mention this, I know it will be difficult to understand fully from just this description. This is something that I have been planning to have for years, but I still need to show some real code that actually works. If I manage that, hopefully things will be clearer. (If not - then I need to think again ;-) Thanks, - Kristian.