Hi Andrei, Here next part of my review. First, I want to try to really explain a general problem I see several places in these patches, here is one example:
-static XID_cache_element *xid_cache_search(THD *thd, XID *xid) +XID_cache_element *xid_cache_search(THD *thd, XID *xid) { DBUG_ASSERT(thd->xid_hash_pins); XID_cache_element *element= (XID_cache_element*) lf_hash_search(&xid_cache, thd->xid_hash_pins, xid->key(), xid->key_length()); if (element) + { + /* + parallel slave XA-"complete" will acquire its actual xid after + it's ready for ordered commit. Meanwhile the transaction's handler + will use a dummy prepared xid element. + + TODO: lsu = zero. + */ + if ((!(thd->rgi_slave && thd->rgi_slave->is_parallel_exec) || + thd->transaction->xid_state.is_dummy_XA()))
We _really_ need to try as hard as possible to avoid these random conditions all over the codebase, like here: thd->rgi_slave && thd->rgi_slave->is_parallel_exec This xid_cache_search() function is a simple function, with a simple purpose. It looks up an XID, and "acquires" it so it doesn't get removed by another thread in parallel. That's it. But now you add all this complexity, of "oh, but except if it's a parallel replication thread that does the lookup, or maybe if the thread has a dummy XA state, then something I don't quite follow... This kind of stuff makes the code impossible to understand and maintain. The further away from the main replication code we get, the more wrong it is to have these kinds of conditions. This is a low-level function for external XA management, so checking for thd->rgi_slave->is_parallel_exec is _very_ wrong. And it looks completely unnecessary. The callers you add in your patch already check the same condition, for example:
+ if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec) + { + DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA()); + + auto xs= xid_cache_search(thd, xid);
So you can just add a different function that does the "dummy XA" thing, and call that instead. And then the new function can be explained in the comments what it does. *Please* try to understand this issue, and try to avoid the problem in all places in the XA code (and elsewhere). It's really important, the replication code is notoriously difficult to maintain as it is. It's absolutely my main concern about this whole XA work. The XA feature can work well or poorly, it's not a widely used feature, I can live with it. But having the replication code maintainable and understandable is critical for users going forward. Think of going into a Bakery shop. Imagine if I said to the baker: "My name is Kristian, and I'm currently renovating my living room". The baker would then have to decide (or guess) that this means I will be wanting a loaf of bread and 5 cookies. That seems very complicated and error prone, if all customers were like that, doesn't it? Isn't it much preferable if each customer just tells what she or he wants? Let's make the functions in our code the same way, caller tells what it wants, function delivers. Next, on a high level, I looked at the MDEV-21777 and tried to understand the plans for fixing that the (GTID) slave is no longer crash safe with XA. I think the idea is the following. Given a user external XA xid X. From X we construct a new xid, say X_PREPARE or X_COMPLETE (for XA PREPARE respectively XA COMMIT/ROLLBACK). Then we replicate the XA PREPARE like this: XA START X ... apply events from master XA END X XA START X_PREPARE INSERT INTO mysql.gtid_slave_pos VALUES (...) XA END X_PREPARE XA PREPARE X_PREPARE # This is not binlogged # Fsync the X_PREPARE trx inside InnoDB XA PREPARE X # Binlogged as normal # Fsync the XA PREPARE X inside InnoDB XA COMMIT X_PREPARE The XA COMMIT X is similar, except there are no events to apply initially from the master, and we use X_COMPLETE instead of X_PREPARE. Then in case of slave crash, after normal crash recovery (InnoDB and if enabled, binlog), a slave recovery takes place. For every xid X_PREPARE found prepared inside InnoDB, check if X is also found prepared. If so, commit X_PREPARE, else roll it back. And vice versa, for every xid X_COMPLETE found, roll it back if X is found prepared in InnoDB, else commit it. Is that correctly understood? One reason I'm asking is I'm wondering how this will affect the scheduling of XA in parallel replication? Like, in case of duplicate XID, the X_PREPARE or X_COMPLETE could conflict between two transactions, so maybe the wakeup_subsequent_commits can only be done after the XA COMMIT of the X_PREPARE and X_COMMIT? Which I think would require changes to the current patch. But then another thought follow from this. If we anyway need to construct new XIDs based on external user XID, like X->X_PREPARE. Then we either need to "steal" some of the space available for users (64+64 bytes gtrid/qual), or we could add extra space internally, not available/visible to users. And if we do this, could we append a unique identifier to each external XID (again invisible outside), just a running number? It seems this could simplify a number of cases where we have to handle re-use of the same XID. With a unique number appended, there would be no duplicate XIDs, simplifying the logic of binlog recovery, parallel slave scheduling, and possibly other cases. What do you think? Next, my comments from reading more through the patch:
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index c26263401b8..1ddf520af59 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc
@@ -2455,6 +2451,28 @@ rpl_parallel_entry::choose_thread(rpl_group_info *rgi, bool *did_enter_cond, return thr; }
+/** + The class implements rpl_parallel_entry::xap_window. + See @c handle_xa_prepera_duplicate_xid its main user. +*/ +class Sliding_window + :public Dynamic_array<std::pair<std::size_t, uint64>>
I still don't understand why you want to use Dynamic_array. From what I can see, you never resize it, it's only allocated once with N elements. Isn't it just adding unnecessary complexity, when a simple array would do exactly the same job?
+{ +public: + uint32 cxap_lhs, cxap_rhs; + Sliding_window(PSI_memory_key psi_key, uint prealloc) : + Dynamic_array<std::pair<std::size_t, uint64>>(psi_key, prealloc) + { + DBUG_ASSERT(prealloc == max_size()); + + elements(max_size());
Here, you allocate the number of elements. Why then does the code use max_size() everywhere, and not simply elements()?
+ cxap_lhs= cxap_rhs= 0; + for (size_t i= 0; i < max_size(); i++) + at(i)= {0, 0}; + }
@@ -2798,6 +2818,107 @@ abandon_worker_thread(THD *thd, rpl_parallel_thread *cur_thread, mysql_cond_signal(&cur_thread->COND_rpl_thread); }
+/** + Check the concurrency status of an input @c xid with ones + that are already in progress. + Any new @c xid of XA-prepare (@c is_xap is true then) is appended to + a sliding window. It's released "naturally" as the window is designed + as circular buffer.
Don't you need to also append an XA COMMIT to the window? Otherwise, consider the following sequence: 1: XA PREPARE 'a' 2: <unrelared transaction> 3: ... ... 100: XA COMMIT 'a' 101: XA PREPARE 'a' Ie. assume that the XA PREPARE is far back and has slid out of the window (assume less that 100 worker threads), don't you still need to block the XA PREPARE in (101) from starting in parallel with the XA COMMIT in (100) ?
+static bool +handle_xa_prepare_duplicate_xid(rpl_parallel_entry *e, XID *xid, bool is_xap)
<cut>
+ std::size_t xid_hash= + my_hash_sort(&my_charset_bin, xid->key(), xid->key_length());
Why use a hash key here, and not simply compare the actual XIDs? (I think it's fine, the occasional hash collision should not be a problem, just maybe a short comment explaining why a hash is used would be in order, as naively it would seem simpler just to compare the XIDs?).
@@ -3046,6 +3168,18 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, new_gco= true; force_switch_flag= 0; gco= e->current_gco; + /* + Take care of duplicate xids in multiple XA_prepare_log_event:s, + XA-"complete" should not race its XA-prepare parent either. + When the current transaction's xid has been seen and its transaction may + still be in process this event group gets flagged for synchronization, i.e + to wait for prior commits at its execution start. + */ + if ((gtid_flags & Gtid_log_event::FL_PREPARED_XA) && + (is_dup_xac= handle_xa_prepare_duplicate_xid(e, >id_ev->xid, + gtid_flags & + Gtid_log_event::FL_PREPARED_XA))) + gtid_flags &= ~Gtid_log_event::FL_ALLOW_PARALLEL;
This won't work in "aggressive" mode, will it? Because FL_ALLOW_PARALLEL is only used in optimistic mode, not agressive. Besides, you already have another check for is_dup_xac below:
+ rgi->speculation= + is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ? + rpl_group_info::SPECULATE_WAIT : speculation;
So you can just have a single condition here, right? if (is_dup_xac) rgi->speculation= rpl_group_info::SPECULATE_WAIT But another thing, the handle_xa_prepare_duplicate_xid() takes an argument is_xap, but you always only pass this as true, so it's redundant. I'm guessing that originally you also had something that would make the XA COMMIT wait for the XA PREPARE, but this was then later changed to the use of dummy_element. If so, you should remove the parameter. If you had used the xa_prepare_log_event() for the dependency PREPARE->COMMIT also, then the XA COMMIT could simply have registrered a wait_for_prior_commit() on the precise event group of the XA PREPARE, that seems to be simple. I'm wondering why you thought it better to have a dummy xid for the XA COMMIT? I'll need to study the logic in more detail to understand it, but it looks to me that what will happen is that the code will binlog the XA COMMIT (or ROLLBACK) without any checks, and then an error will occur in the storage engine if the xid is not in prepared state in the engine - is that correct? Well, except that xacommit_handlerton() silently ignores any errors from the engines. Does this mean that problems around this will silently be ignored by replication? Or is there some other place that such errors will be caught later? It's not clear to me. For example, since the mysql.gtid_slave_pos update is not crash safe, we could have a crash lead to double apply of the XA COMMIT, will this silently succeed, and binlog the XA COMMIT twice? What if the user (incorrectly) did a manual XA COMMIT, and the real XA COMMIT then gets replicated? This is an error case, but can we be sure we will not get an assertion or even a crash somewhere in the code?
@@ -3114,7 +3249,14 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, if (gtid_flags & Gtid_log_event::FL_DDL) force_switch_flag= group_commit_orderer::FORCE_SWITCH; } - rgi->speculation= speculation; + /* + When DDL flagged XA sees an ongoing one with "duplicate" xid + its execution will have to wait for the xid owner full completion. + SPECULATE_WAIT provides that, regardless of the parallel mode. + */ + rgi->speculation= + is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ? + rpl_group_info::SPECULATE_WAIT : speculation;
This I don't understand at all. How can an XA PREPARE/COMMIT/ROLLBACK be marked as "ddl" ?
@@ -3131,7 +3273,7 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, Remember the count that marks the end of the previous batch of event groups that run in parallel, and allocate a new gco. */ - uint64 count= e->count_queued_event_groups; + auto count= e->count_queued_event_groups;
(This is a trivial thing in the big picture, but better avoid unrelated changes in patches, and better leave the real type to make the code more readable). Let me send this off for now, and then next I'll try to understand the part with the dummy xid and some logic around has_binlog_hton()... - Kristian.