Re: [Maria-developers] 答复: in-order commit
丁奇 <dingqi.lxb@taobao.com> writes:
Hi, Kristian Ok. I have got the information from JIRA.
I find you control the commit order inside the user thread.
Will it be easier to let Trans_worker thread hold this logic?
Yes, I think you are right. Of course, the user thread is the one that knows the ordering, but the logic for waiting needs to be in the Trans_worker thread. In fact this is a bug in my first patch: Transaction T3 could wait for the THD of worker thread 1 which has both T1 and T2 queued; then it will wake up too early, when T1 commits rather than when T2 does. I will try to implement the new idea today.
After they have done the execution of one transaction, "register the transaction and wait" if there are transactions from other workers should be commited ahead. After commit in one worker, wake up another worker, the worker who wait for the next "head of commitee" should be woken up.
Right, I'll need to look into this a bit deeper. Actually, in my patch the actual wait and wakeup happens inside ha_commit_trans(), and there is a reason for this. Because eventually I want to do it inside tc_log->log_and_order(), which is called from ha_commit_trans(). Here is how a commit happens: InnoDB prepare step fsync() InnoDB redo log (*A) TC_LOG_BINLOG::log_and_order Write transaction to binlog fsync() binlog (*B) InnoDB commit_ordered() (*C) Write commit record to InnoDB redo log InnoDB commit step The steps (*A) and (*B) are slow, typically around 1-10 milliseconds depending on disk system. So we need many threads to commit in parallel and reach points (*A) and (*B) at the same time, so we only need to do the fsync() once for many threads. This is group commit. Thus for in-order parallel replication, we must not do the wait for the previous commit before the (*B) step. Because if we do, then it becomes impossible for two transactions to be at point (*B) at the same time, and group commit is impossible. On the other hand, point (*C) is where the commit order is determined. So if we do the wait after point (*C), then we cannot enforce that T1 commits before T2. So therefore, the wait must happen exactly around point (B) and (C), inside TC_LOG_BINLOG::log_and_order(). That is why I invented all the register_wait_for_prior_commit() and so on: so that log_and_order() has somewhere to look for exactly who is waiting for who. Then if T2 is waiting for T1 to commit, we can do steps (*B) and (*C) for both of them together, achiving both group commit and in-order parallel replication. Anyway, I just wanted to mention this, I know it will be difficult to understand fully from just this description. This is something that I have been planning to have for years, but I still need to show some real code that actually works. If I manage that, hopefully things will be clearer. (If not - then I need to think again ;-) Thanks, - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
丁奇 <dingqi.lxb@taobao.com> writes:
Hi, Kristian Ok. I have got the information from JIRA.
I find you control the commit order inside the user thread.
Will it be easier to let Trans_worker thread hold this logic?
Yes, I think you are right. Of course, the user thread is the one that knows the ordering, but the logic for waiting needs to be in the Trans_worker thread. In fact this is a bug in my first patch: Transaction T3 could wait for the THD of worker thread 1 which has both T1 and T2 queued; then it will wake up too early, when T1 commits rather than when T2 does.
I will try to implement the new idea today.
I fixed two bugs in my earlier patch (sorry for that, I should have tested a bit better :-) This one looks better. It fixes the earlier test failures I saw in mysql-test-run.pl. I tested it with 20000 inserts, it works and keeps the commit order, with a nice speedup compared to single-threaded insert on the master. There may still be more problems of course. Also, especially with --sync-binlog=0, I will need to implement my group commit ideas to achieve the best speed with in-order commit. New patch is attached (replaces earlier patches). I also pushed it to my branch: lp:~knielsen/maria/dingqi-parallel-replication - Kristian.
Hi, I implemented the last part of in-order commit, which pushes the wait into the transaction coordinator, so that group commit can work and performance can be good. On the one hand I am really pleased to get this done, it is something I have been thinking on for 3 years now. On the other hand I realise this is fairly complex stuff, so please be aware that I am 100% open to suggestions for any changes to this or other ideas on how to proceed. I ran some quick benchmarks. What I did was setup a master with 20000 independent inserts into a table. I run with sync_binlog=1, innodb_flush_log_at_trx_commit=1, and --log-slave-updates. The base MariaDB needs 71 seconds to replicate the 20000 transactions. Your original patch needs 12 seconds. With my in-order patch 15 seconds are needed. But with my in-order patch and increasing the number of threads from 16 to 24, then just 11 seconds are needed. 71 seconds Base 12 seconds Original @ 16 threads 15 seconds In-order @ 16 threads 11 seconds In-order @ 24 threads So for this quick benchmark, in-order is somewhat slower, but one can compensate for this by increasing the number of threads. This makes sense; with in-order there will be some threads waiting, so adding more threads is needed to ensure enough non-waiting threads to get full performance. This is great results I think, in-order appears quite viable performance-wise. There are a number of things that becomes easier when commits on the slave are guaranteed to be in-order (such as global transaction id). (And btw, that's 6 times faster replication without the user having to do anything special, which is also *very* nice! I am really looking forward to getting this fully integrated in MariaDB). On the other hand, it is clear that some workloads will suffer under in-order. For example something like this: UPDATE t1 SET a=5 WHERE id=10; UPDATE t1 SET a=4 WHERE id=10; UPDATE t1 SET a=3 WHERE id=10; UPDATE t1 SET a=2 WHERE id=10; UPDATE t1 SET a=1 WHERE id=10; UPDATE t1 SET a=1 WHERE id=20; UPDATE t1 SET a=2 WHERE id=20; UPDATE t1 SET a=3 WHERE id=20; UPDATE t1 SET a=4 WHERE id=20; UPDATE t1 SET a=5 WHERE id=20; With out-of-order, all the id=20 updates can run in parallel with all the id=10 updates. But with in-order, the first id=20 update will only commit after all the id=10 updates have run, so the remaining id=20 updates can not run in parallel. Performance will be slower, unless there are more events deeper down in the binlog which can be run in parallel instead. It is hard to predict how common such cases will be, but I am at least hopeful! To fix a potential deadlock with MyISAM, I changed conflict detection. Now for non-transactional tables, the hash key will have only the table name (not the PK values). Thus, any two updates of the same MyISAM tables will be a conflict. Updates to different tables are ok to run in parallel. I pushed the changes as usual to lp:~knielsen/maria/dingqi-parallel-replication/ and I attached the full patch. This completes the in-order experiment for me. Of course the patch still needs more work to be finished, like it should probably be possible to enable/disable in-order with an option etc. But probably first we should discuss more if in-order is a good idea at all, and in general how to proceed with the integration of the parallel replication feature. A big benefit of the in-order method is that users will be able to enable it without fear that their applications will break. Unlike for example the MySQL 5.6 multi-threaded slave, there is no need to partition the data into different schemas and audit/rewrite all applications to ensure no cross-schema queries. With in-order things will work exactly as normal, it is invisible to applications. Only thing is that row-based is required to get speedup, but everything works correctly even if some statement-based events turn up (and if we combine it with http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184 then we can even do some statement-based parallel replication also using the in-order stuff). And we can still have out-of-order as an option. - Kristian.
participants (1)
-
Kristian Nielsen