Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync
Hi, Andrei! On May 27, Andrei Elkin wrote:
I had a rather sobering followup, slipped communicating to you though, sorry.
It's actually turns out not to be easy. A sequence of execution events
E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
does not make E1 such as Innodb committed persistently in file. At recovery in E1 trx may be found in prepared state, unless before the crash I'd do something like Binlog CheckPoint (BCP) notification request and wait of it.
I don't understand, what a binlog checkpoint has to do with it?
So E1=Innodb. In the above case in order to create partially committed trx Innodb.commit(trx) must be followed with flushing trx to disk, which is typically asynchronous.
I mentioned BCP 'cos it ensures a trx (actually the last in a binlog file) is flushed.
With --innodb-flush-log-at-trx-commit=1 it should be perfectly synnchronous, should it not?
In this just a selected engine would be requested, unlike in BCP event case that is generated when both (in this case) have persistently committed. (That is we even can't involve BCP event to wait for, and the involvement would require some simulation as well).
So I propose to keep the current method leaving the "honest" crash to RQG that Alice has been training the patch with.
Alice was loading the server with transactions and doing `kill -9` at some point. Many interesting cases that we discuss (for example, the one above) happen in a very small time window, so I don't know whether random killing could provide an adequate test coverage.
No doubt it's technically possible to wait for the flushing and crash, but still does not look easy. Consider if we defer this sort of mtr test to MDEV-18959?
But we've spent numerous hours and emails discussing and you've spend countless days and weeks implementing various tricky corner cases are are and won't be ever tested. Here's an idea of how to test it. We're only interested in crashes during commit. So * when starting a commit - atomically increment some in_commit counter * on leaving the commit - decrement it * on signal (some unused one) - crash the server if in_commit > 0 * after the crash - get the stack traces from the core to see where in the commit did it crash. I've attached a small patch to illustrate this idea. Hopefully, it'll cover everything, but if not, then after analyzing stack traces we'll see where it did not crash, and move atomic increment/decrements to be around this area only.
diff --git a/sql/handler.h b/sql/handler.h --- a/sql/handler.h +++ b/sql/handler.h @@ -873,6 +874,15 @@ typedef struct xid_t XID; /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */ uint get_sql_xid(XID *xid, char *buf);
+/* struct for semisync slave binlog truncate recovery */ +struct xid_recovery_member +{ + my_xid xid; + uint in_engine_prepare; // number of engines that have xid prepared + bool decided_to_commit; + std::pair<uint, my_off_t> binlog_coord; // semisync recovery binlog offset
wouldn't it be clearer to have a struct with named members?
I am not sure if there's a need for naming. But we do need comparison that std::pair provides for granted.
I could consider (a) new typedef(s) to represent the binlog id and offset in the file to make it more readable.
You say?
in fact, I'm somewhat surprised there's no such struct for binlog coords already.
I had started that back at HB implementation actually, but that struct did not attact much of attention.
I'd still suggest a structure, like
struct binlog_coords_t { uint binlog_num; my_off_t offset; };
Well, with this, like I said about std:pair, we'd also have to have bool operator < (const binlog_coords_t)
like bool operator <(const binlog_coords_t c) { return binlog_num < c.binlog_num || binlog_num == c.binlog_num && offset < c.offset; }
I seriously rate refuse to reuse as a deadly sin of programming. (Unlike normal sins it may be soften by realization though :-)).
Let's waive it away, right :-)?
Hoping on that `72a59e4b068` already defined the following: ... +/* + Compound binlog-id and byte offset of transaction's first event + in a sequence (e.g the recovery sequence) of binlog files. + Binlog_offset(0,0) is the minimum value to mean + the first byte of the first binlog file. +*/ +typedef std::pair<Binlog_file_id, my_off_t> Binlog_offset;
This is better than writing std::pair<uint, my_off_t> everywhere. Not as clear as a binlog_coords_t from above, but _perhaps_ it'll do, let's see how it'll be used. Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (1)
-
Sergei Golubchik