Re: [Maria-developers] Review of MDEV-4506, parallel replication, part 1

23 Sep 2013

      Michael Widenius <monty@askmonty.org> writes:
...
What should happen if you kill a replication thread is that
replication should stop for that master.
Kristian> This needs more thought, I think ... certainly something looks not right.
After looking at the full code, I think that the logical way things
should work is:
'stop' is to be used when you want to nicely take done replication.
This means that the current commit groups should be given time to
finish.
thd->killed should mean that we should stop ASAP.
- All not commited things should abort.
This is needed in a 'panic shutdown' (like out soon-out-of- power) or
when trying to kill the replication thread when one notices that
something went horribly wrong (like ALTER TABLE stopping replication).
Ok, so I thought more about this.

There are two kinds of threads involved. One is the normal SQL thread (this is
the only thread involved in non-parallel replication). There is a one-to-one
correspondence between an SQL thread and the associated master connection.

It seems to make sense that KILL CONNECTION on the SQL thread would stop the
replication. I do not know whether this is how it works or not in current
replication, but whatever it is, we should just leave the behaviour the same.

The other kind of thread is the parallel replication worker thread. We have a
pool of those (the size of the pool specified by --slave-parallel-threads).
These worker threads are not associated with any particular master
connection. They are assigned to execute an event group at a time for
whichever master connection is in need for a new thread.

So a KILL QUERY or KILL CONNECTION on a worker thread should abort the
currently executing event - and it will, I assume, using the existing code for
killing a running query operation. This will fail the execution of the query,
and thus eventually stop the associated SQL thread and master connection, once
the error is propagated out of the worker thread up to the parent SQL thread.

But KILL of a worker thread should _not_ cause the thread to exit, I
think. The pool of parallel threads is just a thread pool, and for simplicity
in the first version, it has a fixed size. A KILL will disconnect the thread
from the replication connection it was servicing, but the thread must remain
alive, ready to serve another connection if needed.

So the rpt->stop is only about maintaining the thread pool for parallel
replication, not about anything to do with aborting any specific master
connection replication. In rpl_parallel_change_thread_count() we set rpt->stop
to ask all existing threads to exit, and afterwards spawn a new set of
threads. This can only happen when all replication SQL/IO threads are stopped
and no event execution is taking place.

So this code of mine is actually wrong, I think:

  while (!rpt->stop && !thd->killed)

It should just be while (!rpt->stop) { ... }.
And thd->killed should not be used at all to control thread
termination. Instead, it should be used (and is not currently, needs to be
fixed) in various places for event execution to allow aborting currently
executing event. This is needed around rpt_handle_event() I think, and also
around queue_for_group_commit().

For normal stop, this is handled mostly in the SQL thread. It waits with a
timeout for the current event group to finish normally, then does a hard kill
if needed. This needs to be extended to handle running things in a worker
thread, of course. The code is around sql_slave_killed(), I think.

Does that make sense?
...
...
...
+  mysql_mutex_lock(&LOCK_wait_commit);
+  waiting_for_commit= false;
+  mysql_cond_signal(&COND_wait_commit);
+  mysql_mutex_unlock(&LOCK_wait_commit);
+}
In this particular case it looks safe to move the cond signal out of
the mutex. I don't see how anyone could miss the signal.
Kristian> Probably. Generally I prefer to write it as above, to not have to make sure
Kristian> that moving it out is safe.
I agree that this is the way to do in general.
However for cases where we are waiting almost once per query, we need
to make things faster as this extra wakup will take notable resources!
Ok, no problem, I've changed it.

I should benchmark this, it keeps coming up. It's not clear to me which way
would be the fastest, seems it would depend on the implementation.
...
Hi!
Part 2 of review of parallel replication
=== added file 'sql/rpl_parallel.cc'
<cut>
+pthread_handler_t
+handle_rpl_parallel_thread(void *arg)
+{
<cut>
+      if (end_of_group)
+      {
+        in_event_group= false;
+
Add comment:
/*
  All events for this group has now been executed and logged (but not
  necessaerly synced).
  Inform the other event groups that are waiting for this thread that
  we are done.
*/
+        rgi->commit_orderer.unregister_wait_for_prior_commit();
+        thd->wait_for_commit_ptr= NULL;
+
+        /*
+          Record that we have finished, so other event groups will no
+          longer attempt to wait for us to commit.
+
+          We can race here with the next transactions, but that is fine, as
+          long as we check that we do not decrease last_committed_sub_id. If
+          this commit is done, then any prior commits will also have been
+          done and also no longer need waiting for.
+        */
+        mysql_mutex_lock(&entry->LOCK_parallel_entry);
+        if (entry->last_committed_sub_id < event_gtid_sub_id)
+        {
+          entry->last_committed_sub_id= event_gtid_sub_id;
+          mysql_cond_broadcast(&entry->COND_parallel_entry);
+        }
+        mysql_mutex_unlock(&entry->LOCK_parallel_entry);
+
Add:

...
/*
    Tell storage engines that they can stop waiting.
    See comments in sql_class.cc::wakeup_subsequent_commits2() for why we can't
    do this part of the wakeup in unregister_wait_for_prior_commit().
 */
I think there is some misunderstanding here. I added some comments that will
hopefully clarify, but the situation is a bit different than how I read this
suggestion.

The situation is that we have for example three threads, A, B, and C. We are
thread B at this point in the code. Thread B needed to wait for A to commit
first. C needs to wait for B to commit.

The call to unregister_wait_for_prior_commit() relates to B waiting for A, it
cannot be used in relation to C waiting for B, irrespectively of comments in
the function wakeup_subsequent_commits2() ...

The unregister_wait_for_prior_commit() removes B from the list of threads
waiting for A. This is usually redundant, because B already removed itself
with wait_for_prior_commit() during the commit step. However, if there is for
example an error we might not reach commit, and thus we need to remove the
dangling entry in A.

The wakeup_subsequent_commits() signals C that B is done. Again, this is often
redundant, because we already signalled C during B's commit step. However, C
could have registered to wait for B between B's commit and this place in the
code, so it is needed for correctness.

Note that in the cases where this is redundant, no extra mutexes are
taken. The unregister_wait_for_prior_commit() and wakeup_subsequent_commits()
functions are inlines that do just a single test of one field in the struct.
...
As far as I can see, rpt->free is always 0 here.
- It's set to 0 at the top of this loop and only set to 1 here.
Can we remove the rpt->free variable?
Yes, it looks like it. I have removed it. It was supposed to mark if the
worker thread was free to be reused for another event group, but this is
marked already by the thread being included in the rpt->pool->free_list.
...
Add:
/*
  The following loops wait until each thread is done it's work for each
  event group and then stops it when the thread has nothing to do.
Note that this may take some time as rpl_paralell:do_event() may request
  threads at the same time.
*/
I added this comment instead:

  /*
    Grab each old thread in turn, and signal it to stop.

    Note that since we require all replication threads to be stopped before
    changing the parallel replication worker thread pool, all the threads will
    be already idle and will terminate immediately.
  */

Hope that clarifies things.
...
+void
+rpl_parallel::wait_for_done()
+{
+  struct rpl_parallel_entry *e;
+  uint32 i;
+
+  for (i= 0; i < domain_hash.records; ++i)
+  {
+    e= (struct rpl_parallel_entry *)my_hash_element(&domain_hash, i);
+    mysql_mutex_lock(&e->LOCK_parallel_entry);
+    while (e->current_sub_id > e->last_committed_sub_id)
+      mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
+    mysql_mutex_unlock(&e->LOCK_parallel_entry);
+  }
+}
How to handle errors in the above case?
For example when we get an error and we had to abort things?
I think even if we have to abort event execution in a worker thread, we should
still increment e->last_committed_sub_id. So the wait here is ok. But we do
need code to be able to force all worker threads to stop.

An important piece of this puzzle is in sql_slave_killed(). I do not
understand the full details of this code yet, but I believe the idea is to
wait with some timeout for event groups to finish normally, then hard kill if
the timeout triggers (I vaguely remember that for InnoDB we can just kill and
rollback, while for MyISAM the timeout is necessary to avoid leaving half an
update executed).

We need to extend sql_slave_killed() to also handle waiting for each worker
thread to stop, with the timeout. Once we have that, it is hopefully clear how
to do wait_for_done(); maybe it can call sql_slave_killed(), or maybe it is no
longer needed (wait_for_done() is just a temporary thing I put in to get
things working for the benchmark).

We also need to extend the API around struct wait_for_commit to include error
handling. I suppose wait_for_prior_commit() should check for thd->killed when
it waits, and be extended to return an error if killed. And then also
wakeup_subsequent_commits() should be extended to take an error argument, so
that if B is waiting for A, and A gets an error, then A can call
wakeup_subsequent_commits() with error=true, which will cause B to wake up but
also get an error.

Then if we have a number of event groups processing in parallel,
A->B->C->D. And B ends up with an error. Then A can complete, but B will
signal an error to C (and then to D), so both B, C, and D will fail.

Then somewhere, perhaps around the end of apply_event_and_update_pos(), we
must check an error flag set for B, C, and D, and signal the error back to the
main SQL thread so that it can halt and print the error to the error log, as
usual.

Something like that, I think. Some work to do, but should be possible at least.
...
The other way would be to, on error, set last_committed_sub_id back to
it's last value.
...
What do you suggest?
Assuming setting the value back would be the easiest solution..
Yes (if I understood correctly what you meant with "setting back")...
I think we should update things the same way whether the transaction commits
successfully or fails, so we get the same wakeup of later transactions, but
with the extra error flag added to wait_for_commit so that later transactions
can see the failure and thus fail and rollback themselves. And then the error
can be propagated up the call stack, as normal.

Of course, this becomes more complex due to the need to be able to re-try a
transaction in case of certain temporary errors like deadlocks and so.
Hopefully something can be made to work.
...
+      /*
+        We are already executing something else in this domain. But the two
+        event groups were committed together in the same group commit on the
+        master, so we can still do them in parallel here on the slave.
+
+        However, the commit of this event must wait for the commit of the prior
+        event, to preserve binlog commit order and visibility across all
+        servers in the replication hierarchy.
+      */
+      rpl_parallel_thread *rpt= global_rpl_thread_pool.get_thread(e);
Here we could add a test if get_thread() is 0, which could happen on shutdown
or when killing a replication thread.
I do not think we need to do this. I made it so that the worker thread pool
can never change size without stopping all replication threads first, so we
avoid a lot of these issues. And as I wrote above, killing a worker thread
should terminate the event group that it is currently servicing (and the
associated SQL thread and master connection), but not remove the thread from
the pool.

(I can imagine that in a later version we want to implement a smarter worker
thread manager, but for the first version I tried to keep things simple, there
are plenty of other problems that we need to tackle...)
...
I think it's relatively safe to use get_thread() as a end-synchronize
point as we should never have anything half-done when we call
get_thread()).  As get_thread() can take a very long time and is not
called in many places, it's also sounds like a resonable thing to do.
Sorry, I do not understand. "End-synchronize point" for what?
...
+    /*
+      Events like ROTATE and FORMAT_DESCRIPTION. Do not run in worker thread.
+      Same for events not preceeded by GTID (we should not see those normally,
+      but they might be from an old master).
+    */
+    qev->rgi= serial_rgi;
+    rpt_handle_event(qev, NULL);
+    delete_or_keep_event_post_apply(rli, typ, qev->ev);
Why is delete_or_keep_event_post_apply() not part of rpt_handle_event()?
You always call these two together...
It is to share code with the non-parallel case.
When parallel replication is disabled, everything is done in the SQL thread in
slave.cc. Like this:

    exec_res= apply_event_and_update_pos(ev, thd, serial_rgi, NULL);
    delete_or_keep_event_post_apply(serial_rgi, typ, ev);

Or did you mean that I could move the call of
delete_or_keep_event_post_apply() into the function rpt_handle_event()?
That might be possible, though it was not 100% clear to me as there are a few
differences in how it is done in the different cases.

But in any case that part of the code needs more work. Especially with regard
to error handling, and related to updating the last executed event
position. Probably it makes sense to revisit this as part of that work.

 - Kristian.

Re: [Maria-developers] Review of MDEV-4506, parallel replication, part 1

Kristian Nielsen