Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

31 May 2021

Hi, Andrei!

On May 27, Andrei Elkin wrote:
...
...
...
I had a rather sobering followup, slipped communicating to you
though, sorry.
It's actually turns out not to be easy. A sequence of execution events
E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
does not make E1 such as Innodb committed persistently in file.  At
recovery in E1 trx may be found in prepared state, unless before the
crash I'd do something like Binlog CheckPoint (BCP) notification request
and wait of it.
I don't understand, what a binlog checkpoint has to do with it?
So E1=Innodb.
In the above case in order to create partially committed trx
  Innodb.commit(trx)
must be followed with flushing trx to disk, which is typically
asynchronous.
I mentioned BCP 'cos it ensures a trx (actually the last in a binlog
file) is flushed.
With --innodb-flush-log-at-trx-commit=1 it should be perfectly
synnchronous, should it not?
...
...
...
In this just a selected engine would be requested, unlike in BCP
event case that is generated when both (in this case) have
persistently committed. (That is we even can't involve BCP event to
wait for, and the involvement would require some simulation as
well).
So I propose to keep the current method leaving the "honest" crash
to RQG that Alice has been training the patch with.
Alice was loading the server with transactions and doing `kill -9`
at some point. Many interesting cases that we discuss (for example,
the one above) happen in a very small time window, so I don't know
whether random killing could provide an adequate test coverage.
No doubt it's technically possible to wait for the flushing and crash,
but still does not look easy.
Consider if we defer this sort of mtr test to MDEV-18959?
But we've spent numerous hours and emails discussing and you've spend
countless days and weeks implementing various tricky corner cases are
are and won't be ever tested.

Here's an idea of how to test it.
We're only interested in crashes during commit. So
* when starting a commit - atomically increment some in_commit counter
* on leaving the commit - decrement it
* on signal (some unused one) - crash the server if in_commit > 0
* after the crash - get the stack traces from the core to see where in the commit did it crash.

I've attached a small patch to illustrate this idea.

Hopefully, it'll cover everything, but if not, then after analyzing
stack traces we'll see where it did not crash, and move atomic
increment/decrements to be around this area only.
...
...
...
...
...

diff --git a/sql/handler.h b/sql/handler.h
--- a/sql/handler.h
+++ b/sql/handler.h
@@ -873,6 +874,15 @@ typedef struct xid_t XID;
 /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
 uint get_sql_xid(XID *xid, char *buf);
+/* struct for semisync slave binlog truncate recovery */
+struct xid_recovery_member
+{
+  my_xid xid;
+  uint in_engine_prepare;  // number of engines that have xid prepared
+  bool decided_to_commit;
+  std::pair<uint, my_off_t> binlog_coord; // semisync recovery binlog offset
wouldn't it be clearer to have a struct with named members?
I am not sure if there's a need for naming. But we do need
comparison that std::pair provides for granted.
I could consider (a) new typedef(s) to represent the binlog id and
offset in the file to make it more readable.
You say?
...
in fact, I'm somewhat surprised there's no such struct for binlog coords
already.
I had started that back at HB implementation actually, but that struct
did not attact much of attention.
...
I'd still suggest a structure, like
struct binlog_coords_t {
  uint binlog_num;
  my_off_t offset;
};
Well, with this, like I said about std:pair, we'd also have to have
  bool operator < (const binlog_coords_t)
like

  bool operator <(const binlog_coords_t c)
  { return binlog_num < c.binlog_num || binlog_num == c.binlog_num && offset < c.offset; }
...
I seriously rate refuse to reuse as a deadly sin of programming.
(Unlike normal sins it may be soften by realization though :-)).
Let's waive it away, right :-)?
Hoping on that `72a59e4b068` already defined the following:
...
+/*
+  Compound binlog-id and byte offset of transaction's first event
+  in a sequence (e.g the recovery sequence) of binlog files.
+  Binlog_offset(0,0) is the minimum value to mean
+  the first byte of the first binlog file.
+*/
+typedef std::pair<Binlog_file_id, my_off_t> Binlog_offset;
This is better than writing std::pair<uint, my_off_t> everywhere.
Not as clear as a binlog_coords_t from above, but _perhaps_
it'll do, let's see how it'll be used.

Regards,
Sergei
VP of MariaDB Server Engineering
and security@mariadb.org

    

Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

Sergei Golubchik