Hi Monty,
I read through your latest patch for the MDEV-4506 task, parallel replication.
Just to re-cap, this is a work-in-progress patch, so these are just some
comments along the way that I hope will be useful.
> -static bool sql_slave_killed(THD* thd, Relay_log_info* rli)
> +static bool sql_slave_killed(rpl_group_info *rgi)
> -static Log_event* next_event(Relay_log_info* rli)
> +static Log_event* next_event(rpl_group_info *rgi)
This does not look right to me. Can you explain the reason to use rgi here
instead of rli? To me, rgi is about what happens in parallel in each worker
thread. But these functions are about what happens in the driver SQL thread,
fetching each event from the relay log and checking if the driver SQL thread
is done and can exit.
It looks from the patch like rgi is always passed the serial_rgi. I think it
would be better if it did not use rgi at all, we could pass in any extra
information needed directly.
Note that serial_rgi should _only_ be used in the case where parallel
replication is not activated and actual event execution happens inside the
main driver SQL thread.
Or did I mis something?
> Log_event::enum_skip_reason
> -Gtid_log_event::do_shall_skip(Relay_log_info *rli)
> +Gtid_log_event::do_shall_skip(rpl_group_info *rgi)
> {
> + Relay_log_info *rli= rgi->rli;
> /*
> An event skipped due to @@skip_replication must not be counted towards the
> number of events to be skipped due to @@sql_slave_skip_counter.
> @@ -6315,10 +6319,13 @@ Gtid_log_event::do_shall_skip(Relay_log_
> if (rli->slave_skip_counter > 0)
> {
> if (!(flags2 & FL_STANDALONE))
> + {
> thd->variables.option_bits|= OPTION_BEGIN;
> - return Log_event::continue_group(rli);
> + rgi->rli->set_flag(Relay_log_info::IN_TRANSACTION);
> + }
> + return Log_event::continue_group(rgi);
> }
> - return Log_event::do_shall_skip(rli);
> + return Log_event::do_shall_skip(rgi);
> }
This looks to me like it cannot work.
shall_skip() is called from apply_event_and_update_pos() which runs in each
worker thread, in parallel. But rli->slave_skip_counter needs to be checked
from the driver SQL thread, in the order that events are read from the relay
log.
Otherwise, if slave_skip_counter is 1 and we have transactions T1 and T2, T2
might run first and we would end up skipping T2 instead of T1 which is
wrong. Besides, it seems better to in as many cases as possible to skip events
by not queuing them for a worker thread in the first place.
I am wondering if this change came as part of a mechanical change from rli to
rgi when is_in_group() was moved from rli to rgi? That is why I had this ToDo
comment:
> - - Relay_log_info::is_in_group(). This needs to be handled correctly in all
> - callers. I think it needs to be split into two, one version in
> - Relay_log_info to be used from next_event() in slave.cc, one to be used in
> - per-transaction stuff.
I think what needs to be done here is that event skip handling needs to be
moved, so that it is done in exec_relay_log_event() rather than in
apply_event_and_update_pos(). Then it can be done always in the driver SQL
thread, both in the parallel and in the non-parallel case.
It is possible that we then may need the flag associated with is_in_group()
both in rli and in rgi, we will see when we do the change.
What do you think?
> @@ -71,6 +66,7 @@ rpt_handle_event(rpl_parallel_thread::qu
>
> /* ToDo: Access to thd, and what about rli, split out a parallel part? */
> mysql_mutex_lock(&rli->data_lock);
> + qev->ev->thd= thd;
Indeed, this seems to be needed.
Wow. I'm surprised that I have been able to get anything to work when this was
missing?!? Or is it because in most cases, THD is passed into Log_event::
functions directly, rather than using Log_event::thd ?
> @@ -1094,9 +1090,9 @@ bool Relay_log_info::is_until_satisfied(
> !replicate_same_server_id)
> DBUG_RETURN(FALSE);
> log_name= group_master_log_name;
> - log_pos= (!ev)? group_master_log_pos :
> - ((thd->variables.option_bits & OPTION_BEGIN || !ev->log_pos) ?
> - group_master_log_pos : ev->log_pos - ev->data_written);
> + log_pos= ((!ev)? group_master_log_pos :
> + (get_flag(IN_TRANSACTION) || !ev->log_pos) ?
> + group_master_log_pos : ev->log_pos - ev->data_written);
Yes, it is clear that we need something like this.
You told me on IRC that you still needed to work on this. Eg. I see you
started to add a part_of_next_statement() but did not implement this. I also
am worried how to make sure that we do not miss any cases here, and agree it
needs to be checked.
I will just mention that for GTID I already added
Log_event::is_part_of_group(), which seems related to your
part_of_next_statement()? Maybe they could use the same mechanism.
One outstanding issue with is_part_of_group is for row-based replication with
multi-table update statements. In such multi-table updates there will be two
(or more) Rows_log_event events, and there is a flag to tell which is the last
one. They need to be considered part of the same statement. I think this is
not yet correctly handled in the GTID code (MDEV-4985). Probably this case
also needs to be handled correctly here?
> - if ((rgi->thd->variables.option_bits & OPTION_BEGIN) && opt_using_transactions)
> + if ((rgi->rli->get_flag(IN_TRANSACTION)) && opt_using_transactions)
There is more to be done here than this. Because this runs in each parallel
worker thread, in parallel.
Wouldn't it be more correct actually to use rgi->thd here, actually? But in
any case, we need to think about how to handle old-style replication position
updates.
Since we already ensure that transactions commit in-order, maybe it is ok to
do the updates to position in parallel, as long as we do not decrement the
position. So if we have transactions T1 and T2 running in parallel, T1 will
commit before T2, but T2 may then race ahead and update replication position
before T1. That is ok (both T1 and T2 are committed), but then we need in T1
to check and see that position is already ahead, so we do not set it back to
T1 (that could cause duplication of T2 in case of slave restart).
And in the GTID case, where the replication position is not used in case of
slave restart, then it would be best not to update the master.info and
relay-log.info file after every transaction (to save the cost of write +
possibly fsync). But it would be best to still update it in-memory so that
SHOW SLAVE STATUS shows something meaningful.
> + Flags for the state of reading the relay log. Note that these are
> + bits
"Note that these are bit masks" is less ambiguous.
> + This is updated when a temporary table is created or drooped by
> + a replication thread.
"drooped" -> "dropped".
> + /**
> + Is the replication inside a group?
> +
> + Replication is inside a group if either:
> + - The IN_TRANSACTION flag is set, meaning we're inside a transaction
> + - The IN_STMT flag is set, meaning we're inside a statement
> +
> + @retval true Replication thread is currently inside a group
> + @retval false Replication thread is currently not inside a group
> + */
> + bool is_in_group() const {
> + return (m_flags & (IN_STMT | IN_TRANSACTION));
> + }
Do I understand correctly that this refers to the last event fetched from the
relay log, not the last event executed? These are different in the case of
parallel replication. I think the comment should be clarified to explicitly
say one way or the other.
Hm, I see now that this is not your comment (you just moved it), so it is ok,
but it would still be good if we could clarify the comment.
> -static bool sql_slave_killed(THD* thd, Relay_log_info* rli)
> +static bool sql_slave_killed(rpl_group_info *rgi)
> {
> /*
> - The transaction should always be binlogged if OPTION_KEEP_LOG is set
> - (it implies that something can not be rolled back). And such case
> - should be regarded similarly as modifing a non-transactional table
> - because retrying of the transaction will lead to an error or inconsistency
> - as well.
> - Example: OPTION_KEEP_LOG is set if a temporary table is created or dropped.
> + The transaction should always be binlogged if OPTION_KEEP_LOG is
> + set (it implies that something can not be rolled back). And such
> + case should be regarded similarly as modifing a
> + non-transactional table because retrying of the transaction will
> + lead to an error or inconsistency as well.
> +
> + Example: OPTION_KEEP_LOG is set if a temporary table is created
> + or dropped.
> */
> if ((thd->transaction.all.modified_non_trans_table ||
> (thd->variables.option_bits & OPTION_KEEP_LOG))
So checking thd->transaction is not meaningful in case of parallel
replication.
However, this is related to stopping the slave, which as we discussed before
is still work-in-progress. I think we can keep this code to handle the
non-parallel case, and then add a similar check in the worker threads (maybe
in a new function sql_worker_killed()). And maybe then add a check here that
all worker threads have stopped. I'm looking into this.
So this should be ok for now, just wanted to mention it.
- Kristian.