Re: [Maria-developers] Architecture review of MWL#116 "Efficient group commit for binary log"

7 Sep 2010

      Hi, Kristian!

On Sep 06, Kristian Nielsen wrote:
...
However, as I revisited the algorithm, it occured to me that it is in
any case better to wake up threads individually as soon as
commit_ordered() has finished. This way, the first threads in the
queue are free to continue doing useful work while we are still
running commit_ordered() for the last threads.
So now the algorithm is something like this:
thd->ready= false
    lock(LOCK_prepare_ordered)
    old_queue= group_commit_queue
    thd->next= old_queue
    group_commit_queue= thd
    ht->prepare_ordered()
    unlock(LOCK_prepare_ordered)
if (old_queue == NULL) // leader?
        lock(LOCK_group_commit)
lock(LOCK_prepare_ordered)
        queue= reverse(group_commit_queue)
        group_commit_queue= NULL
        unlock(LOCK_prepare_ordered)
group_log_xid(queue)
lock(LOCK_commit_ordered)  // but see below
        unlock(LOCK_group_commit)
        for thd2 in <queue>
            lock(thd2->LOCK_wakeup)
            thd2->ready= true
            signal(thd2->COND_wakeup)
            unlock(thd2->LOCK_wakeup)
        unlock(LOCK_commit_ordered)
    else
        lock (thd->LOCK_wakeup)
        while (!thd->ready)
            wait(COND_wakeup, LOCK_wakeup)
        unlock (thd->LOCK_wakeup)
cookie= xid_log_after()
Where in this algorithm you call ht->commit_ordered() ?
...
On the other hand, the algorithm I suggested earlier for START
TRANSACTION WITH CONSISTENT SNAPSHOT used the LOCK_commit_ordered, and
there might be other uses...
So I am not sure. I'd like to think more about it, or what do you
think?
START TRANSACTION WITH CONSISTENT SNAPSHOT is a good reason to keep the
mutex.
...
...
...
It would be possible to iterate over the queue to invoke
prepare_ordered() in sequence from a single thread, just like
group_log_xid() and commit_ordered(). But this would delay the
calls until the previous group commit is done and the next one
starts
No, why ? You only need one atomic fetch_and_store to copy the queue
head to a local variable and reset the queue. Then you can call
prepare_ordered or commit_ordered in the queue order without any
mutex.
I am not sure if I understood your suggestion correctly. But what I
considered with the above statement about "delaying calls to
prepare_ordered()" is this:
Just like the group leader thread runs commit_ordered() in a loop over
the queue just after group_log_xid(), we could have it do a similar
loop for prepare_ordered() just before group_log_xid().
Yes.
...
But I choose to do it earlier, as soon as the transaction is put in
the queue and commit order thereby defined.
There can be quite a "long" time interval between these two events:
the time it takes for the previous group_log_xid() (eg. an fsync()),
plus sometimes one wants to add extra sleeps in group commit to group
more transactions together.
No.
The long interval is *inside* the group_log_xid(), while you call
prepare_ordered() *before* it.

But anyway, the LOCK_prepare_ordered mutex is not going to be contented,
so removing it by using a lock-free queue (that's what this second
approach is about) will not bring any noticeable benefits.
...
The main performance bottleneck I am introducing is, I think, the
serialisation of the commit_ordered() part of commit. Not just for
some particular engine implementation, but for the interface. That is
not a decision to be taken lightly.
Of course, compared to InnoDB today, it's much better, as it gets rid
of the InnoDB prepare_commit_mutex (which spans all the way from end
of prepare() to end of what is effectively commit_ordered()), and also
avoids taking LOCK_log over all of log_xid() in the binlog.
But for something like NDB, I think serialised commit order would
really hurt (if it even makes sense ...)
Maybe the answer here is that engines can choose to support
commit_ordered() or not (and NDB-like engines probably will not). And
if not, there is just no support for consistent commit order.
And if we implement the simple way to recover engines from binlog
without fsync() in prepare() and commit(), then it will only work for
engines supporting commit_ordered(). Later we could implement the more
complex recovery without need for commit_ordered() support.
It's reasonable to say that if an engine does not implement
commit_ordered() then it needs to take care of its own recovery and
fsync both in prepare and commit.

Regards,
Sergei

Re: [Maria-developers] Architecture review of MWL#116 "Efficient group commit for binary log"

Sergei Golubchik