Sergei Golubchik <serg@askmonty.org> writes:
Hi, Kristian!
Hot and cool, yes, I've used (and coded) them myself :) But I only use them where I believe they actually improve concurrency, and I think this one is not the case.
Very true. So let us forget about the lock-free stuff. I'll remove the atomics and just protect the queue with the already taken LOCK_prepare_ordered mutex. Thanks for persisting ;-) (And I'll try to think of a better name for LOCK_prepare_ordered).
But I think that in many cases one can rewrite broadcast to a set of signals and one condition per thread. It's kind of optimizing the thread library itself - I typically prefer to use the library as designed and leave its optimization to library developers.
Yes, I agree with this general principle. (In this case the issue is a bit different, as it is the semantics of pthread_cond_broadcast() that we want to change, not the implementation we want to optimise. But that is getting off-topic I guess).
But anyway, the fact is that I didn't benchmark and don't know for sure what's better one broadcast or many signals.
So I could not resist the temptation to actually do the benchmark, Turns out that pthread_cond_broadcast() is slower than individual signal, more so the more cores are available (5 times slower were observed, check my blog if you're interested in the details). However, as I revisited the algorithm, it occured to me that it is in any case better to wake up threads individually as soon as commit_ordered() has finished. This way, the first threads in the queue are free to continue doing useful work while we are still running commit_ordered() for the last threads. So now the algorithm is something like this: thd->ready= false lock(LOCK_prepare_ordered) old_queue= group_commit_queue thd->next= old_queue group_commit_queue= thd ht->prepare_ordered() unlock(LOCK_prepare_ordered) if (old_queue == NULL) // leader? lock(LOCK_group_commit) lock(LOCK_prepare_ordered) queue= reverse(group_commit_queue) group_commit_queue= NULL unlock(LOCK_prepare_ordered) group_log_xid(queue) lock(LOCK_commit_ordered) // but see below unlock(LOCK_group_commit) for thd2 in <queue> lock(thd2->LOCK_wakeup) thd2->ready= true signal(thd2->COND_wakeup) unlock(thd2->LOCK_wakeup) unlock(LOCK_commit_ordered) else lock (thd->LOCK_wakeup) while (!thd->ready) wait(COND_wakeup, LOCK_wakeup) unlock (thd->LOCK_wakeup) cookie= xid_log_after() You suggest a couple more simplifications:
The reason for the LOCK_commit_ordered is that there are two code paths that call commit_ordered(). One is the group commit leader thread that invokes commit_ordered() for all threads that have passed group_log_xid(). The other is when no 2-phase commit is done, ha_commit_one_phase(). It does not actually matter in which order the non-2-phase commit_ordered() calls happen, but I still wanted to provide the storage engine the guarantee that commit_ordered() would not be invoked by multiple threads in parallel. It just seemed cleaner that way to me.
Uhm, I don't know. The engine will have its own shared data structures and locks anyway, so it'll ensure that commits cannot be executed in parallel. I don't see why we need to take care of that.
Right. Hm, originally I thought it was cleaner if the commit_ordered() call was protected by a mutex by the API, thinking that a call that defines commit order should not run in parallel. But now I tend to agree with you ... if two commit_ordered() run in parallel, it just means there is no enforced ordering between them, and the engine will most likely have to do its own locking anyway as you say. So maybe we should just omit it... On the other hand, the algorithm I suggested earlier for START TRANSACTION WITH CONSISTENT SNAPSHOT used the LOCK_commit_ordered, and there might be other uses... In the above algorithm, the LOCK_commit_ordered makes the locking a bit more fine-grained, as the next group of queued transactions can start group_log_xid() while the previous group is running commit_ordered(). So I am not sure. I'd like to think more about it, or what do you think?
It would be possible to iterate over the queue to invoke prepare_ordered() in sequence from a single thread, just like group_log_xid() and commit_ordered(). But this would delay the calls until the previous group commit is done and the next one starts
No, why ? You only need one atomic fetch_and_store to copy the queue head to a local variable and reset the queue. Then you can call prepare_ordered or commit_ordered in the queue order without any mutex.
I am not sure if I understood your suggestion correctly. But what I considered with the above statement about "delaying calls to prepare_ordered()" is this: Just like the group leader thread runs commit_ordered() in a loop over the queue just after group_log_xid(), we could have it do a similar loop for prepare_ordered() just before group_log_xid(). But I choose to do it earlier, as soon as the transaction is put in the queue and commit order thereby defined. There can be quite a "long" time interval between these two events: the time it takes for the previous group_log_xid() (eg. an fsync()), plus sometimes one wants to add extra sleeps in group commit to group more transactions together. So that is the "delay" I was refering to: the time it takes from a transaction enters the queue, and until the prevous group commit is done and the new leader can obtain LOCK_group_commit and run the next queue. Since my main motivation for prepare_ordered() is to implement the InnoDB early lock release, it seemed important to not delay the call in this way. What do you think? Note that it is essential to release the queue lock after enqueueing, then wait for the LOCK_group_commit, then re-take the queue lock. Otherwise it prevents other threads from grouping up in the queue, which is the whole point.
Still, what will happen if the context switch will stop the non-leader thread between enqueue_atomic() and group_commit_ready=FALSE ?
Although, to fix that you don't need to chain locks, it could be probably enough to set group_commit_ready=FALSE before enqueue.
Yes, that is necessary, of course.
I mean that even for group_log_xid() you can create a scheme with conditions and mutexes that "bounces execution" to every transaction thread to do the commit in there. So it's not really a "consequence of the group_log_xid()."
Yes, you are right, it is not necessary. What I meant was I think it is a worthwhile optimisation.
Personally, I don't know how bad is it, as a limitation. I think it should be pretty easy for the engine to use THD and store the data there. As for my_error and similar things - well, we can either defer them or use pthread_setspecific to set current_thd and mysys_vars appropriately. Or let my_error report its error and after commit move the error from the current_thd to the appropriate THD. Or, really, we'll have error_reporting service eventually (hopefully soon) and it'll solve the my_error problem.
Anyway, if the engine uses pthread_getspecific or gcc thread local variables (__thread storage class - which is implemented not with pthread_getspecific but with some ELF magic that I didn't bother to read about) it will not work when we migrate the transaction from one thread to another. But, again, I don't know how much we should care about this case.
Ok, good, I also think it should be ok. I will document clearly that it is the case, and as specified currently, prepare_ordered() commit_ordered() cannot return an error anyway (such errors should be handled in prepare() or commit()).
- If we have consistent commit order, we can think about optimising commit to need only one fsync (for binlog); lost commits in storage engines can then be recovered from the binlog at crash recovery by re-playing against the engine from a particular point in the binlog.
Why do you need consistent commit order for that?
Hm. The case I have in mind is this: ... Now what will we do to recover? I guess we will have to replay A, B, and D but not C and E. So we need to get a _list_ of missing transactions from the engine somehow (and so also need something similar to unlog() against the
Yes, that's how XA recovery works, so I thought we can, kind of, reuse that.
The question in my mind is, when can the engine be allowed to forget that it committed a transaction (eg. when would we "unlog" it)? The best I can think of now is something like this: 1. For each transaction, at the point where that transaction becomes durable, the engine informs TC about it. 2. When TC has been informed from engines that all transactions before a certain point in the binlog have become durable, it sets that point as the starting point for crash recovery. 3. When TC has stored the starting point for crash recovery durably, it can "unlog" all transactions prior to that point to the engine. This feels fairly complex to me. But maybe there is a simpler way? In contrast, with consistent commit order, the engine just needs to remember the id of the last committed transaction. It can probably just write it in its transaction log during prepare. Still,
The other side of this is if there is really any performance gain in switching it off. My intension with the design was that there should not be any performance penalty from it.
Exactly. That's why I said "if it'll help" :)
The main performance bottleneck I am introducing is, I think, the serialisation of the commit_ordered() part of commit. Not just for some particular engine implementation, but for the interface. That is not a decision to be taken lightly. Of course, compared to InnoDB today, it's much better, as it gets rid of the InnoDB prepare_commit_mutex (which spans all the way from end of prepare() to end of what is effectively commit_ordered()), and also avoids taking LOCK_log over all of log_xid() in the binlog. But for something like NDB, I think serialised commit order would really hurt (if it even makes sense ...) Maybe the answer here is that engines can choose to support commit_ordered() or not (and NDB-like engines probably will not). And if not, there is just no support for consistent commit order. And if we implement the simple way to recover engines from binlog without fsync() in prepare() and commit(), then it will only work for engines supporting commit_ordered(). Later we could implement the more complex recovery without need for commit_ordered() support. - Kristian. (PS. I haven't forgotten any of your comments, I will get to updating the worklog descriptions eventually. I suppose I'm just a bit overwhelmed at the moment ;-)