Sergei Golubchik <serg@mariadb.org> writes:
Here it is, below.
Thanks for review! I'll give comments inline where applicable, other points I will just implement as you suggested, once the overall API is decided.
* MySQL has a variable binlog_order_commits. we can introduce it too - it allows to cut lots of corners in binlogging code. And most users don't need strict binlog ordering most of the time.
I don't understand. AFAIU, binlog_order_commits mainly ensures that the InnoDB redo log has the same commit sequence as the binlog. This is mainly needed to be able to take a non-blocking XtraBackup (without FLUSH TABLES WITH READ LOCK) and use it as a slave. This does not seem relevant to the case at hand? My guess is, you have in mind a different option, that allows a slave to write its binlog in a different order than what it was on the master? In MySQL, there is no facility to force same binlog order on a slave as on a master, though when using single-treaded replication it will normally be the case anyway. Instead, there are various mechanisms in their parallel replication and in their GTID to deal with out-of-order binlogs between master and slave. Is this what you are suggesting? An option that allows the MariaDB slave to commit in different order than the master? The main problem with that is that it fundamentally breaks the design of MariaDB GTID. Also, it makes it hard to know where in the binlog to restart the slave after a slave crash? There are some interesting possibilities in this direction, but I'd like to understand better what you have in mind before starting to discuss in a random direction.
* I tried to ignore background killing code, because we're discussing whether it'll stay or not. And the discussion of the API is happening in another thread. So I was mostly looking at the general logic and at the binlog re-reading code, that we've already discussed on irc.
Agree.
=== modified file 'sql/sql_admin.cc' --- sql/sql_admin.cc 2014-03-27 20:32:53 +0000 +++ sql/sql_admin.cc 2014-06-10 17:48:32 +0000 @@ -912,7 +912,7 @@ static bool mysql_admin_table(THD* thd, protocol->store(operator_name, system_charset_info); if (result_code) // either mysql_recreate_table or analyze failed { - DBUG_ASSERT(thd->is_error() || thd->killed); + DBUG_ASSERT(thd->is_error());
why?
This is a revert of incorrect fix for MySQL Bug#58933. The original bug was that some code was returning an error if thd->killed, but it was not setting thd->is_error() with send_kill_message() or similar. Rather than fix the real bug, the Bug#58933 patch changed the assertion. I happened upon the real bug during my testing, and had to fix it to make my code work. I somehow noticed the incorrect fix for Bug#58933, and decided to change back the assertion to what it was originally, as the extra condition is no longer necessary.
=== modified file 'storage/heap/hp_write.c' --- storage/heap/hp_write.c 2014-03-26 21:25:38 +0000 +++ storage/heap/hp_write.c 2014-06-10 17:48:32 +0000 @@ -153,10 +153,10 @@ static uchar *next_free_record_pos(HP_SH (info->data_length + info->index_length >= info->max_table_size)) { DBUG_PRINT("error", - ("record file full. records: %u max_records: %lu " + ("record file full. records: %lu max_records: %lu " "data_length: %llu index_length: %llu " "max_table_size: %llu", - info->records, info->max_records, + (unsigned long)info->records, info->max_records,
Eh? Why did you add a cast from ulong to ulong?
Sorry, will fix (I think I got a compiler warning, and fixed it without looking up the type. Generally, I tend to just use %lu and (ulong) cast on values in debug printouts, to avoid problems with eg. %llu not being available everywhere etc. Sloppy, I suppose...)
=== modified file 'sql/sql_class.h' --- sql/sql_class.h 2014-04-25 10:58:31 +0000 +++ sql/sql_class.h 2014-06-10 17:48:32 +0000 @@ -1357,7 +1357,8 @@ enum enum_thread_type SYSTEM_THREAD_NDBCLUSTER_BINLOG= 8, SYSTEM_THREAD_EVENT_SCHEDULER= 16, SYSTEM_THREAD_EVENT_WORKER= 32, - SYSTEM_THREAD_BINLOG_BACKGROUND= 64 + SYSTEM_THREAD_BINLOG_BACKGROUND= 64, + SYSTEM_THREAD_SLAVE_BACKGROUND= 128,
is it your background killer thread?
Yes.
=== modified file 'storage/innobase/lock/lock0lock.cc' --- storage/innobase/lock/lock0lock.cc 2014-02-26 18:22:48 +0000 +++ storage/innobase/lock/lock0lock.cc 2014-06-10 17:48:32 +0000 @@ -1020,6 +1021,28 @@ lock_rec_has_to_wait( return(FALSE); }
+ if ((type_mode & LOCK_GAP || lock_rec_get_gap(lock2)) && + !thd_need_ordering_with(trx->mysql_thd, + lock2->trx->mysql_thd)) { + /* If the upper server layer has already decided on the + commit order between the transaction requesting the + lock and the transaction owning the lock, we do not + need to wait for gap locks. Such ordeering by the upper + server layer happens in parallel replication, where the + commit order is fixed to match the original order on the + master. + + Such gap locks are mainly needed to get serialisability + between transactions so that they will be binlogged in + the correct order so that statement-based replication + will give the correct results. Since the right order + was already determined on the master, we do not need + to enforce it again here (and doing so could lead to + occasional deadlocks). */
Please, clarify this comment. It is not clear whether this if() is required for correctness, for the code to work. Or it's merely an optimization, and these occasional deadlocks will only cause the transactiion to be reexecuted, but the end result will still be the same.
I believe, you meant the latter.
Yes, the latter. I will try to clarify, also give the actual examples with INSERT vs. DELETE and INSERT vs. UPDATE. Or do you think we should omit this part of the patch, not optimise this particular case, and instead just rely on the general fallback of detecting a deadlock and killing+retrying?
=== modified file 'sql/log.cc' --- sql/log.cc 2014-03-23 19:09:38 +0000 +++ sql/log.cc 2014-06-10 17:48:32 +0000 @@ -4067,7 +4068,30 @@ int MYSQL_BIN_LOG::purge_first_log(Relay DBUG_ASSERT(!strcmp(rli->linfo.log_file_name,rli->event_relay_log_name));
mysql_mutex_lock(&LOCK_index); - to_purge_if_included= my_strdup(rli->group_relay_log_name, MYF(0)); + + ir= rli->inuse_relaylog_list; + while (ir) + { + inuse_relaylog *next= ir->next;
are relay logs ordered in this list? That is, all newer relay logs are after the current one (ir), all older relay logs are before?
Yes, they should be, I think.
+ if (!ir->completed || ir->dequeued_count < ir->queued_count) + { + included= false; + break; + } + if (!included && 0 == strcmp(ir->name, rli->group_relay_log_name))
do we ever use this reverse style in the code (const == expr)? I don't remember that
I sometimes use it to make it clearer that it's "compare strings for equality". It was not clear if you wanted me to change it to !strcmp(...) ? Or if you were just asking?
=== modified file 'sql/log_event.cc' --- sql/log_event.cc 2014-04-25 10:58:31 +0000 +++ sql/log_event.cc 2014-06-10 17:48:32 +0000 @@ -4384,18 +4420,21 @@ Default database: '%s'. Query: '%s'", { DBUG_PRINT("info",("error ignored")); clear_all_errors(thd, const_cast<Relay_log_info*>(rli)); - thd->reset_killed(); + if (actual_error == ER_QUERY_INTERRUPTED || + actual_error == ER_CONNECTION_KILLED) + thd->reset_killed();
Why?
This code was clearing thd->killed flag unconditionally, after every executed Query_log_event. This causes a race. Suppose we detect a deadlock that T1 needs to wait for T2 just before the code executing T2 reaches this place. We set the kill flag for T2, but immedialy afterwards this flag gets cleared. Thus the kill of T2 is lost, and we end up in a deadlock. I got this during my stress testing, very occasionally the kill would be lost and my tests would hang. Had a bit of fun tracking it down, IIRC :-) This code runs if either there is no error, or there was an error that was explicitly requested to be ignored. So I changed it to only clear thd->killed if we are actually ignoring a killed error. (I wonder if that thd->reset_killed() is really needed at all? I suspect not, but I don't know how to check for sure. So I prefered to do a minimal change like this to be safe, as this is a change in a GA version).
@@ -7284,28 +7321,34 @@ int Xid_log_event::do_apply_event(rpl_gr bool res; int err; rpl_gtid gtid; - uint64 sub_id; + uint64 sub_id= 0; Relay_log_info const *rli= rgi->rli;
+ mysql_reset_thd_for_next_command(thd);
hmm. mysql_reset_thd_for_next_command() is called before the new sql statement, not at the end of the old one. So, you end up calling it twice. It is not a function to use when you only need to reset the error status. The same, I suppose, applies to the previous chunk, Gtid_log_event::do_apply_event
So my problem here is a lack of knowledge of how things should be done correctly. I think the issue here is that I added code that updates the mysql.gtid_slave_pos table, before doing the actual commit of the Xid_log_event. This table update opens tables, it can auto-commit a transaction, it can throw errors. Thus it does much of what is involved in "a new sql statement". Before I added this call, I got various assertions, like about ok status already being set if I gave an error, stuff like that. So if mysql_reset_thd_for_next_command() is not correct, what should I do instead?
+static void +convert_kill_to_deadlock_error(rpl_group_info *rgi) +{ + THD *thd= rgi->thd; + int err_code= thd->get_stmt_da()->sql_errno(); + + if ((err_code == ER_QUERY_INTERRUPTED || err_code == ER_CONNECTION_KILLED) && + rgi->killed_for_retry) + { + thd->clear_error(); + thd->get_stmt_da()->reset_diagnostics_area();
Why do you always do both clear_error() and reset_diagnostics_area()? The former normally includes the latter.
Just lack of knowledge of the right way to do things, I think. I will replace it with just thd->clear_error().
+ do + { + Log_event_type event_type; + Log_event *ev; + rpl_parallel_thread::queued_event *qev; + + /* The loop is here so we can try again the next relay log file on EOF. */ + for (;;) + { + old_offset= cur_offset; + ev= Log_event::read_log_event(&rlog, 0, + rli->relay_log.description_event_for_exec /* ToDo: this needs fixing */, + opt_slave_sql_verify_checksum);
I don't particularly like that there's a second binlog-reading loop here. It would be better to abstract the existing one, somehow, and reuse it here. But I suppose it would've been too difficult to do :(
Indeed, I dislike this as well, and it was the original reason I did not feel like trying to implement retry in parallel replication (before I found I needed it to handle deadlocks). Retry happens very rarely, and a completely different code path for reading binlog in this case is very prone to introducing new bugs due to insufficient testing. But yes, I also did not find a good way to avoid it :-( At least it turned out to be less code needed than I had feared initially. And most of it is already abstracted in called functions like read_log_event() and open_binlog(). Thanks, - Kristian.