Hi, Kristian! On Sep 06, Kristian Nielsen wrote:
However, as I revisited the algorithm, it occured to me that it is in any case better to wake up threads individually as soon as commit_ordered() has finished. This way, the first threads in the queue are free to continue doing useful work while we are still running commit_ordered() for the last threads.
So now the algorithm is something like this:
thd->ready= false lock(LOCK_prepare_ordered) old_queue= group_commit_queue thd->next= old_queue group_commit_queue= thd ht->prepare_ordered() unlock(LOCK_prepare_ordered)
if (old_queue == NULL) // leader? lock(LOCK_group_commit)
lock(LOCK_prepare_ordered) queue= reverse(group_commit_queue) group_commit_queue= NULL unlock(LOCK_prepare_ordered)
group_log_xid(queue)
lock(LOCK_commit_ordered) // but see below unlock(LOCK_group_commit) for thd2 in <queue> lock(thd2->LOCK_wakeup) thd2->ready= true signal(thd2->COND_wakeup) unlock(thd2->LOCK_wakeup) unlock(LOCK_commit_ordered) else lock (thd->LOCK_wakeup) while (!thd->ready) wait(COND_wakeup, LOCK_wakeup) unlock (thd->LOCK_wakeup)
cookie= xid_log_after()
Where in this algorithm you call ht->commit_ordered() ?
On the other hand, the algorithm I suggested earlier for START TRANSACTION WITH CONSISTENT SNAPSHOT used the LOCK_commit_ordered, and there might be other uses...
So I am not sure. I'd like to think more about it, or what do you think?
START TRANSACTION WITH CONSISTENT SNAPSHOT is a good reason to keep the mutex.
It would be possible to iterate over the queue to invoke prepare_ordered() in sequence from a single thread, just like group_log_xid() and commit_ordered(). But this would delay the calls until the previous group commit is done and the next one starts
No, why ? You only need one atomic fetch_and_store to copy the queue head to a local variable and reset the queue. Then you can call prepare_ordered or commit_ordered in the queue order without any mutex.
I am not sure if I understood your suggestion correctly. But what I considered with the above statement about "delaying calls to prepare_ordered()" is this:
Just like the group leader thread runs commit_ordered() in a loop over the queue just after group_log_xid(), we could have it do a similar loop for prepare_ordered() just before group_log_xid().
Yes.
But I choose to do it earlier, as soon as the transaction is put in the queue and commit order thereby defined.
There can be quite a "long" time interval between these two events: the time it takes for the previous group_log_xid() (eg. an fsync()), plus sometimes one wants to add extra sleeps in group commit to group more transactions together.
No. The long interval is *inside* the group_log_xid(), while you call prepare_ordered() *before* it. But anyway, the LOCK_prepare_ordered mutex is not going to be contented, so removing it by using a lock-free queue (that's what this second approach is about) will not bring any noticeable benefits.
The main performance bottleneck I am introducing is, I think, the serialisation of the commit_ordered() part of commit. Not just for some particular engine implementation, but for the interface. That is not a decision to be taken lightly.
Of course, compared to InnoDB today, it's much better, as it gets rid of the InnoDB prepare_commit_mutex (which spans all the way from end of prepare() to end of what is effectively commit_ordered()), and also avoids taking LOCK_log over all of log_xid() in the binlog.
But for something like NDB, I think serialised commit order would really hurt (if it even makes sense ...)
Maybe the answer here is that engines can choose to support commit_ordered() or not (and NDB-like engines probably will not). And if not, there is just no support for consistent commit order.
And if we implement the simple way to recover engines from binlog without fsync() in prepare() and commit(), then it will only work for engines supporting commit_ordered(). Later we could implement the more complex recovery without need for commit_ordered() support.
It's reasonable to say that if an engine does not implement commit_ordered() then it needs to take care of its own recovery and fsync both in prepare and commit. Regards, Sergei