Zardosht Kasheff <zardosht@gmail.com> writes:
Here is a very high level overview.
Thanks for the detailed explanation! I think an easy way to implement commit_ordered() is that all you do is increment a counter and assign it to the transaction. Then in commit(), each transaction waits for the previous commit to write to recovery log before writing itself (so the order becomes correct). That should be a very small modification to your code. But there may be extra context switching, unless you are clever with your write lock access to the recovery log. Maybe there is a different possibility. If I understand correctly, this is the situation: - You need part of commit to run in parallel in multiple threads for good performance ("send a message into the dictionary for every ..."). - The first phase of checkpointing needs to stall all commits while it runs (but it is short). - When a checkpoint is waiting to start, you also need to stall new commits, to prevent starvation of checkpointing. Is this accurate? In fact, such a situation is exactly why I did the split in commit_ordered() and commit(). So that a storage engine can have freedom to choose which part of commit should run serially, and which in parallel. It seems to me that the problem here is that you are using a simple read lock to handle the stalling and avoid starvation. And your read lock implementation does not allow to take the read lock in one thread and release it in another (which is reasonable). Maybe it can be solved simply by just using a different mechanism? Like, keep a counter of threads running inside commit. When a checkpoint is about to start, set a flag, "checkpoint pending", then wait for counter to drop to zero. When a new thread wants to commit, wait for the "checkpoint pending" flag to clear, stalling the commit until checkpoint has completed. Note that it is not a problem to do the wait for checkpoint complete inside commit_ordered(). Yes, this is single-threaded, but all other committing threads will have to wait anyway, so in fact doing the wait just in the one thread will reduce context switches and speed up things. But you could do the wait for checkpoint to complete in eg. prepare() instead if you want. But maybe I am missing something? Not knowing your implementation, I cannot know of course if this naive second idea is infeasible for some reason...
may be expensive, thereby hurting concurrency. So, for such a thing to work, we would have to find a way to grab the read lock in commit_ordered once for each transaction (and because the lock is fair, we can't just regrab the lock on the same thread), write to the recovery log, then perform everything else under commit, then release the read lock. It can probably be done, but it is messy. If unnecessary, I prefer to not do it.
Yes, I understand, deep surgery on synchronisation primitives in the core of an engine is not trivial stuff... In MariaDB, it is not necessary. The commit_ordered() is optional, though it gives you some benefits (and likely more benefits in future versions). If necessary, we can try make eg. the removal of commit fsync() work without commit_ordered(), so you get some of the benefits regardless. And it sounds like my first suggestion should be an easy way to implement commit_ordered(). Though it might require benchmarking to check that it does not hurt performance. If you try any solution, feel free to send me the patch for review and suggestions. But what can you do in MySQL 5.6? In 5.6, effectively what you get is commit_ordered() only, no commit() (their call of commit() with HA_IGNORE_DURABILITY set is essentially the same as commit_ordered()). So you do not get to decide which code to run serially, and which in parallel. Everything in commit() runs serially. Total breakage of the storage engine API, and their developers do not even understand this when pointed out to them :-( I vaguely remember some option in 5.6 that would disable the serialisation of commit(), maybe you can recommend your users to enable that ... - Kristian (painfully aware of writing too long emails).