Re: [Maria-developers] ee538938345: MDEV-21117: refine the server binlog-based recovery for semisync

16 Mar 2021

Hi, Andrei!

Few comments/questions below.

I still have troubles understanding the logic of your solution,
so I did not ask any questions about those your replies - as I'd be just
repeating the same thing over and over. After I'll start getting it
finally, I'll go over your reply again.

Meanwhile - minor comments/questions, not related to the main algorithm.

On Mar 04, Andrei Elkin wrote:
...
...
...
To facilitate the failover on the slave side conditions to accept
own events (having been discarded by the above recovery) are
relaxed to let so for the semisync slave that connects to master in
gtid mode. gtid_strict_mode is further recommended to secure from
inadvertent re-applying out of order gtids in general. Non-gtid
mode connected semisync slave would require
--replicate-same-server-id (mind --log-slave-updates must be OFF
then).
Sorry, I failed to understand this paragraph :(
'gtid_strict_mode is further recommended to secure
 from inadvertent re-applying out of order gtids in general.'
attempted to promote GTID connection with master,
which of course should not be of this patch's
business, but in a way it is 'cos the non-gtid's mode
(CHANGE MASTER TO ... master_use_gtid = NO) is covered by the patch
with a limitation of '--log-slave-updates must be OFF'.
The GTID-ON connection is covered, and without any `--replicate-same-server-id'.
This doesn't seem to clarify anything.
I mean the whole paragraph, starting from "To facilitate" and to "then)."

Could you try to rephrase it to make it clearer, please?
...
...
...
+--let $query1 = INSERT INTO t VALUES (20)
+--let $query2 = DELETE FROM t2 WHERE f = 666 /* no such record */
+--source binlog_truncate_active_log.inc
+
+--echo # Case B.
+# The inverted sequence ends up to truncate only $query2
+--let $this_search_pattern = Successfully truncated.*to remove transactions starting from GTID 0-1-10
+--let $query1 = DELETE FROM t2 WHERE f = 0
why not `= 666 /* no such record */` ?
is `= 0` important here?
Yes, there's a fine detail here.
'The inverted sequence ends up' to leave in binlog
$query1 = DELETE FROM t2 WHERE f = 0
which is as ineffective as 666. The patch recognizes the ineffectiveness
through a special value of engines involved counter in Gtid event.
In principle the truncation could
start from it rather than the following $query2.
But that would mean more complicated coding.
So the semantics of the zero is to be good, as we're so used to :-)!
Sorry, I didn't understand anything. What does the value in the WHERE
clause has to do with engines and where the binlog is truncated?
...
...
...
+--let $query2 = INSERT INTO t VALUES (20)
+--source binlog_truncate_active_log.inc
...
...
...
+--connection master2
+SET DEBUG_SYNC= "now WAIT_FOR master1_ready";
+if ($pre_q2)
is $pre_q2 a flag or a query string?
it's used as a flag, it's set as a query string.
It's actually both with the idea execute it when defined.
Legit, right?
legit, but it's not both. Were it both, you'd have

  if ($pre_q2) {
    eval $pre_q2;
  }

but you have, 

  if ($pre_q2) {
    CALL sp_blank_xa;
  }
...
...
...
+{
+  CALL sp_blank_xa;
+}
...

diff --git a/mysql-test/suite/binlog/t/binlog_truncate_multi_engine.test b/mysql-test/suite/binlog/t/binlog_truncate_multi_engine.test
new file mode 100644
index 00000000000..fe153e5703c
--- /dev/null
+++ b/mysql-test/suite/binlog/t/binlog_truncate_multi_engine.test
+#
+# ==== Implementation ====
+#
+# Steps:
+#    0 - Create two tables in innodb and rocksdb engines,
+#
+# In loop for A,B,C cases (below) do 1-5:
there's no loop now (which is good, don't add it please :)
[x] ?
...
...
...
+# The 1st trx binlogs, rotate binlog and hold on before committing at engine
+SET DEBUG_SYNC= "commit_after_release_LOCK_after_binlog_sync SIGNAL master1_ready WAIT_FOR master1_go";
you don't wait for master1_ready anywhere. intentional?
No, actually
SET DEBUG_SYNC= "now WAIT_FOR master1_ready";
must be run by below master2 (sim to how master3 waits for master2).
meaning [x] ?
...
...
...
+--source include/have_binlog_format_row.inc
+--let $rpl_topology=1->2
+--source include/rpl_init.inc
isn't this a normal master-slave.inc topology?
Initially yes.
I am guessing you mean why not to
  --source it.
The test heavily uses 'server_`index`' format to automate
replication direction.
as far as I remember, master-slave.inc also defines server_1 and
server_2 connections.
...
...
...
diff --git a/sql/sql_class.h b/sql/sql_class.h
index 140394fefc1..87fa88d4c89 100644
--- a/sql/sql_class.h
+++ b/sql/sql_class.h
@@ -4632,7 +4632,7 @@ class THD :public Statement,
   LF_PINS *tdc_hash_pins;
   LF_PINS *xid_hash_pins;
   bool fix_xid_hash_pins();
-
+  bool is_1pc_ro_trans;
I don't like how it looks, it's in THD (for every single connection
and more) and it's only used in some exotic cases. It's set in two
places around ha_commit_one_phase(), when it would be better to set
it inside ha_commit_one_phase(), in one place only.
But I cannot suggest how to get rid of it yet as I don't understand
what it's for, I've asked a question about it below, in
Gtid_log_event::Gtid_log_event
The flag is to help sorting out ineffective (no engine data change)
STATEMENT binlog-format logged transaction. I thought it's practical
enough to face and important to handle.
Through an example,
let the table `t' be initially empty, and
let binlog contain
trx1: INSERT INTO t SET   a = 1;   /* slave acked, trx got committed in engine */
trx2: INSERT INTO t SET   a = 2;   /* not acked, trx remains prepared in engine */
trx3: DELETE FROM t WHERE a = 0;   /* ineffective, does not have Xid event */
prior a crash with their states as commented.
The truncate trx should be trx2 as the trx3's status
suggests it's harmless for involving into
truncation.
However we have yet to distinguish it from unsafe (to truncate)
event group ("transaction") that also ends with COMMIT query not Xid event.
Hmm. I see. But thd isn't a universal dumpster for passing
arguments to the function, don't use it like that.

Please, pass this "read-only transaction with no xid" flag to the
Gtid_log_event constructor as an argument. Or recalculate it inside the
constructor, it's not such an expensive thing to do.
...
One rather drastic solution could be to not binlog DELETE altogether.
I did not look into this with much of interest earlier, but how about it?
Perhaps it'd more like a bug fixes than breaking legacy.
No, I suspect it's too drastic
...
...
...
/* Members related to temporary tables. */
 public:
   /* Opened table states. */
diff --git a/sql/slave.cc b/sql/slave.cc
index 28e08e32346..9b3f73e5341 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -6944,7 +6945,9 @@ static int queue_event(Master_info* mi,const char* buf, ulong event_len)
   }
   else
   if ((s_id == global_system_variables.server_id &&
-       !mi->rli.replicate_same_server_id) ||
+       (!mi->rli.replicate_same_server_id &&
+        !(semisync_recovery= (rpl_semi_sync_slave_enabled &&
+                              mi->using_gtid != Master_info::USE_GTID_NO)))) ||
How can "semisync recovery" code path reach queue_event()?
This is a failover now, after recovery.
The slave which is ex-master tries to adopt "own" (generated by it) events.
1. don't call it "semisync recovery" then
2. add a comment explaining this case (it's a rare and not obvious one)
3. why does it have to be specially recognized anyway?
...
...
...
event_that_should_be_ignored(buf) ||
       /*
         the following conjunction deals with IGNORE_SERVER_IDS, if set
diff --git a/sql/log_event.h b/sql/log_event.h
index 8a342cb5cd3..1588ab85104 100644
--- a/sql/log_event.h
+++ b/sql/log_event.h
@@ -482,6 +482,16 @@ class String;
 */
 #define LOG_EVENT_IGNORABLE_F 0x80
+/**
+   @def LOG_EVENT_ACCEPT_OWN_F
+
+   Flag sets by the gtid-mode connected semisync slave for
+   the same server_id ("own") events which the slave must not have
+   in its state. Typically such events were never committed by
+   their originator (this server) and discared at its crash recovery
sorry, I failed to understand that
This should've been explained above.
The slave which is ex-master truncated at recovery some of its ("own") transactions.
The flag serves to help recognize that at execution of the replication events.
1. yes, this is clearer, thanks. may be you should rewrite your comment
putting this explanation in it?

2. why does it have to be specially recognized anyway?

...
...
...
3. also it's somewhat an unfounded assumption that only r/w engines
will have the transaction prepared. But we can make it a fact, if we
won't prepare r/o engines at all (in ha_prepare).
[?]
Sorry, I don't get it.
Do you mean that an r/o engine (branch) might need to be counted by
`ha_count_rw()`? But then it would be contributing with binlogged
events of its branch... I can't imagine that.
I mean the following. Say, you have two engines participating in a
transaction, like in  INSERT innodb_table SELECT * FROM rocksdb_table;
One is r/w, the other is r/o. It's a 2pc transaction, the server does
prepare for everyone, then commit for everyone.

I don't think there is any guarantee that prepare for an r/o engine will
be a no-op. It is completely up to the engine what to do in this case,
it's possible that the engine will optimize it to a no-op, but it is not
a certainty.

What I think we should do is not to call prepare() for r/o engines at
all. Just skip straight to commit/rollback on the second phase.
...
...
...
diff --git a/sql/handler.cc b/sql/handler.cc
index 6792e80b8fe..247ab7267bf 100644
--- a/sql/handler.cc
+++ b/sql/handler.cc
@@ -1245,6 +1245,24 @@ int ha_prepare(THD *thd)
   DBUG_RETURN(error);
 }
+/*
+  Returns counted number of
+  read-write recoverable transaction participants.
+*/
+uint ha_count_rw(THD *thd, bool all)
the name doesn't match the implementation, please, rename
[?]
Please help here.
I could not find what specifically you mean.
It
   returns rw_ha_count
which I think as 'read-write recoverable transaction participants.'
I read the name is "a number of rw engines in the ha_info",
it does not mention "recoverable", and I from the name I woulnd't have
guessed. May be ha_count_rw_2pc ?
...
...
...
+{
+  unsigned rw_ha_count= 0;
+  THD_TRANS *trans=all ? &thd->transaction.all : &thd->transaction.stmt;
+
+  for (Ha_trx_info * ha_info= trans->ha_list; ha_info;
+       ha_info= ha_info->next())
+  {
+    if (ha_info->is_trx_read_write() && ha_info->ht()->recover)
+      ++rw_ha_count;
+  }
+  return rw_ha_count;
+}
+
 /**
   Check if we can skip the two-phase commit.
@@ -1960,8 +1982,122 @@ struct xarecover_st
   XID *list;
   HASH *commit_list;
   bool dry_run;
+  MEM_ROOT *mem_root;
+  bool error;
 };
+/**
+  Inserts a new hash member.
+
+  returns a successfully created and inserted @c xid_recovery_member
+           into hash @c hash_arg,
+           or NULL.
+*/
+static xid_recovery_member*
+xid_member_insert(HASH *hash_arg, my_xid xid_arg, MEM_ROOT *ptr_mem_root)
+{
+  xid_recovery_member *member= (xid_recovery_member*)
+    alloc_root(ptr_mem_root, sizeof(xid_recovery_member));
+  if (!member)
+    return NULL;
+
+  member->xid= xid_arg;
+  member->in_engine_prepare= 1;
+  member->decided_to_commit= false;
+
+  return my_hash_insert(hash_arg, (uchar*) member) ? NULL : member;
+}
+
+/*
+  Inserts a new or updates an existing hash member to increment
+  the member's prepare counter.
+
+  returns false  on success,
+           true   otherwise.
+*/
+static bool xid_member_replace(HASH *hash_arg, my_xid xid_arg,
+                               MEM_ROOT *ptr_mem_root)
+{
+  xid_recovery_member* member;
+  if ((member= (xid_recovery_member *)
+       my_hash_search(hash_arg, (uchar *)& xid_arg, sizeof(xid_arg))))
+    member->in_engine_prepare++;
+  else
+    member= xid_member_insert(hash_arg, xid_arg, ptr_mem_root);
+
+  return member == NULL;
+}
+
+/*
+  Hash iterate function to complete with commit or rollback as decided
+  (typically at binlog recovery processing) in member->in_engine_prepare.
+*/
+static my_bool xarecover_do_commit_or_rollback(void *member_arg,
+                                               void *hton_arg)
+{
+  xid_recovery_member *member= (xid_recovery_member*) member_arg;
+  handlerton *hton= (handlerton*) hton_arg;
+  xid_t x;
+  my_bool rc;
+
+  x.set(member->xid);
+  rc= member->decided_to_commit ?  hton->commit_by_xid(hton, &x) :
+    hton->rollback_by_xid(hton, &x);
+
+  DBUG_ASSERT(rc || member->in_engine_prepare > 0);
+
+  if (!rc)
+  {
I don't think you can trust rc on that.
if it's non-zero, it's an error all right.
but it's a bit of a stretch to presume that
nonexisting xid is always an error.
also commit_by_xid returns int, not my_bool.
[?]
I mean, if an engine returns success when you ask it to commit or
rollback an unknown xid - this is not a bug, right? Commit means "make
all changes by this xid persistent", if there are no changes, there is
nothing to do, so doing nothing and reporting success isn't exactly
wrong. Same for rollback.
...
I'd rather to stick with the patch and require from any engine to
return \cite{xa spec}
[XAER_NOTA] The specified XID is not known by the resource manager
What'd you say?
Okay. I'd say I was wrong, if the standard requires a resource manager
to return an error for an unknown XID, then you surely can rely on that
:)

Sorry for confusion.
...
...
...
+
+  return false;
+}
+
+static my_bool xarecover_do_count_in_prepare(void *member_arg,
+                                          void *ptr_count)
+{
+  xid_recovery_member *member= (xid_recovery_member*) member_arg;
+  if (member->in_engine_prepare)
+  {
+    *(uint*) ptr_count += member->in_engine_prepare;
This is a rather weird semantics.
it's kind of a number of transactions times number of engines.
if one transaction wasn't committed in two engines and another
transaction wasn't rolled back in one engine, the counter will be 3.
how are you going to explain this to users?
So `ptr_count` actually is for transaction branches, not transactions.
But the ultimate warning speaks erronously the latter
+      sql_print_warning("Could not complete %u number of transactions in "
                            "engines.", count_in_prepare);
So I'd convert the counter to count the transactions:
(*(uint*) ptr_count) ++
[?]
yes, that'd be easier to understand, thanks
...
...
...
diff --git a/sql/log.cc b/sql/log.cc
index 8073f09ab88..a9808ed8823 100644
--- a/sql/log.cc
+++ b/sql/log.cc
...
+    char buf[21];
+
+    longlong10_to_str(ptr_gtid->seq_no, buf, 10);
+    sql_print_information("Successfully truncated binlog file:%s "
+                          "to pos:%llu to remove transactions starting from "
+                          "GTID %u-%u-%s", file_name, pos,
+                          ptr_gtid->domain_id, ptr_gtid->server_id, buf);
+  }
+  if (!(error= init_io_cache(&cache, file, IO_SIZE, WRITE_CACHE,
+                            (my_off_t) pos, 0, MYF(MY_WME|MY_NABP))))
+  {
+    /*
+      Write Stop_log_event to ensure clean end point for the binary log being
+      truncated.
+    */
I'm not sure about it. The less you touch the corrupted binlog the better.
simply truncating it would be enough, wouldn't it?
Yes, it would.
But I somewhat preferred a tidy-up style.
[?]
If you feel strong to remove it will be done.
I feel it'd be safer not to write anything into the corrupted binlog.
...
...
...
+    Stop_log_event se;
+    se.checksum_alg= cs_alg;
+    if ((error= write_event(&se, NULL, &cache)))
+    {
+      sql_print_error("Failed to write stop event to "
+                      "binary log. Errno:%d",
+                      (cache.error == -1) ? my_errno : error);
+      goto end;
+    }
+    clear_inuse_flag_when_closing(cache.file);
+    if ((error= flush_io_cache(&cache)) ||
+        (error= mysql_file_sync(file, MYF(MY_WME|MY_SYNC_FILESIZE))))
+    {
+      sql_print_error("Faild to write stop event to "
+                      "binary log. Errno:%d",
+                      (cache.error == -1) ? my_errno : error);
+    }
+  }
+  else
+      sql_print_error("Failed to initialize binary log "
+                      "cache for writing stop event. Errno:%d",
+                      (cache.error == -1) ? my_errno : error);
1. I don't see why you need to sync after every operation.
Yep, there's one fsync before Stop event writing, redundant if Stop will
stay in the final patch.
As far as I understand, if Stop is removed, you shouldn't need any syncs
at all, just truncate and close.
...
...
...
+  int  next_binlog_or_round(int& round,
+                            const char *last_log_name,
+                            const char *binlog_checkpoint_name,
+                            LOG_INFO *linfo, MYSQL_BIN_LOG *log);
+  bool is_safe()
what do you mean "safe" ?
Safe to truncate condition is explained elsewhere to mean there are no
non-transactional group of events within the truncated tail.
Earlier mentioned ineffective COMMIT-query ended "transactions" are safe.
may be is_safe_to_truncate() then?

...
...
...
...
+        if (!truncate_validated)
+        {
+          DBUG_ASSERT(round <= 2);
+          /*
+            Either truncate was not set or was reset, else
+            it gets incremented, otherwise it may only set to an earlier
+            offset only at the turn out of the 1st round.
sorry, I cannot parse that :(
The comments are for the assert in a branch that acts when the truncate position
(gtid) is not yet finalized.
Let me annotate under each assert's line.
...
...
+          */
+          DBUG_ASSERT(truncate_gtid.seq_no == 0 ||
'truncate was not set or was reset' ^, that is truncate_gtid.
The 'was reset' is actually not present in the assert - that'd be
reflected with
  truncate_reset_done == true
...
...
+                      last_gtid_coord >= binlog_truncate_coord ||
'it gets incremented' ^
why does that mean "it gets incremented" ?
...
...
...
+                      (round == 2 && truncate_set_in_1st));
And if it's neither of the above there must've been truncate candidate
set in the 1st round.
...
...
...
...
switch (typ)
       {
+      case FORMAT_DESCRIPTION_EVENT:
+        if (round > 1)
+        {
+          if (fdle != fdle_arg)
+            delete fdle;
+          fdle= (Format_description_log_event*) ev;
why is it needed? all binlog files should have the same
Format_description_log_event anyway.
Actually contrary to what was stated (the patch removed that claim) in
the parent revision of this commit, the FD:s of the
recovery binlog files may have different checksums. I.e
set @@global.binlog_checksum = value
can be issued to rotate binlog to a new hot file having different checksum value
while the binlog *CheckPoint* file is behind. So this is possible to
have at recovery:
B_1(CP, checksum = NONE), B_2(hot, checksum = CRC32)
which also manifests a normal use case bug I have not reported any bug (though wanted to).
okay, checksum can differ. But how does it affect
Format_description_log_event ?

Regards,
Sergei
VP of MariaDB Server Engineering
and security@mariadb.org

    

Re: [Maria-developers] ee538938345: MDEV-21117: refine the server binlog-based recovery for semisync

Sergei Golubchik