Starting on review of XA patches in bb-10.6-MDEV-31949
Hi Andrei, It's not an easy review of the XA patches in bb-10.6-MDEV-31949. One thing that would have helped is a clear design document that lists all the different concerns and points that need to be considered around recovery, backup, replication, possible conflicts, etc. This could then be referenced by the reviewer to see if a given concern was considered, and if so what the resolution is. But I'll try as best I can with what we have. As I'm trying to understand the different issues, I found making test cases to check my understanding helped. I have pushed some to the branch knielsen_mdev31949_review . They seem to show some problems in the current patch: mysql-test/suite/rpl/t/rpl_recovery_xa.test - A trx is left in "XA PREPARED" state after crash recovery, even though the binlogged event group is incomplete (it should have been rolled back). - A trx is committed after crash recovery, even though it is not binlogged (it should have been rolled back). - An incomplete XA event group is output by mysqlbinlog with a "ROLLBACK" at the end which gives an error (should probably be XA ROLLBACK). mysql-test/suite/rpl/t/rpl_recovery_xa2.test - An XA PREPAREd transaction is lost after recovery when it was binlogged in an earlier binlog file. (I think this might be the issue that's referenced as a ToDo with Xid_list_log_event in patch?) mysql-test/suite/mariabackup/slave_provision_xa.test - An XA PREPARE does not update the binlog position inside InnoDB. So a slave provisioned with mariabackup from this starting position can fail with duplicate apply of XA PREPARE. Some other initial remarks follow. Since there's a desire to progress with this quickly, I thought it would be useful to send these early, so you can start looking/discussions before I complete building an understanding of the whole design and code picture.
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 17e512bf34e..a4cdc72e18c 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -17144,8 +17144,16 @@ innobase_commit_by_xid( }
if (trx_t* trx = trx_get_trx_by_xid(xid)) { + THD *thd = current_thd; + if (thd) + thd_binlog_pos(thd, &trx->mysql_log_file_name, + &trx->mysql_log_offset);
I guess this is necessary to get the correct binlog position stored inside innodb to match the binlogged XA COMMIT event, right? But will it work if called during crash recovery? During crash recovery, we could actually update the position, but only if this is trx is the last event group in the old binlog. Maybe that's better handled elsewhere (there are currently other cases where the binlog position in InnoDB can be wrong immediately after restart until the first new transaction is committed).
+ if (trx->mysql_log_offset > 0) + trx->flush_log_later = true;
But I'm guessing this is the motivation. If called from within binlog group commit, you want to disable fsync in InnoDB. But if called from recovery, or when binlog is disabled, this condition will be false, and fsync will be done. If so, that's a _really_ obfuscated way to do it :-/ It surely needs a very clear comment explaining what is going on. And it is very fragile. What if thd_binlog_pos() is changed at some point, to slightly change the conditions under which it returns 0 or not? That could silently break this code. It would be much better if this could be done similar to how normal commits are done. We have the commit_ordered handlerton which does the fast part of commit-to-memory. And then the slow part which does the rest in commit handlerton. So we could have commit_by_xid_ordered() and commit_by_xid(). The problem is that commit_by_xid_ordered() needs to release the xid, so it becomes harder to get hold of the transaction (which the commit handlerton just finds from the thd). Maybe commit_by_xid_ordered() would need to return some trx handle. What part of the commit is skipped by this in InnoDB? Is it _only_ flushing the log up to the LSN of the commit record, or are there more things? How is another storage engine supposed to handle this? This also needs to be documented somewhere, eg. in comments on the handlerton calls. This also means that the --innodb-flush-log-at-trx-commit=3 is not supported for XA COMMIT when it's done from a different session that the XA PREPARE. For example, with --innodb-flush-log-at-trx-commit=3 but --sync-binlog=0, normal transactions are still durable in InnoDB. But external XA transactions will not be. This could be solved if the commit_by_xid() was split into a fast and slow part similar to normal commit. Probably you'll say that this works for now, and perhaps that's true. But at least it really _really_ needs some detailed comments explaining what is going on, and what should be kept in mind if modifying/improving this code in the future, otherwise it will be impossible for the next developer to understand.
diff --git a/sql/log.cc b/sql/log.cc index e54d4087d46..7e3dbeaef29 100644 --- a/sql/log.cc +++ b/sql/log.cc
@@ -9365,18 +9381,84 @@ TC_LOG::run_commit_ordered(THD *thd, bool all) all ? thd->transaction->all.ha_list : thd->transaction->stmt.ha_list;
mysql_mutex_assert_owner(&LOCK_commit_ordered); - for (; ha_info; ha_info= ha_info->next()) + + if (!ha_info)
Your patch puts a lot of XA/binlog logic into run_commit_ordered(). But that's not appropriate. The run_commit_ordered() function has the purpose solely to call the commit_ordered() handlerton in each engine, and it's also called from TC_LOG_MMAP (that's why it's TC_LOG::, not MYSQL_BIN_LOG::). Instead make a new function that handles the part of binlog commit that needs to be done under LOCK_commit_ordered. Be sure to comment it that this runs under this highly contended mutex, and needs to do as little work as possible. This function can then call run_commit_ordered(). It can also be used to share a few bits of code that are now duplicated between the opt_optimize_thread_scheduling and !opt_optimize_thread_scheduling cases, like: ++num_commits; run_commit_ordered(...) entry->thd->wakeup_subsequent_commits() if (entry->queued_by_other) { ... } (if it makes sense to do so in the actual code, only if makes things simpler, that's easier to decide when working with the actual code). There sesems to be 3 new cases: - XA PREPARE - XA COMMIT from same session as XA PREPARE - XA COMMIT of detached XA transaction The logic to decide this was already done earlier, when the event group was queued for group commit, eg. to decide which end_event to use. For example there is no need to call xid_cache_search() here, that can be done earlier (and it's also wrong to call my_error() here, as we're in the wrong thread context). So the logic about what to do should be placed higher up in the call stack, ideally where we already know, eg. binlog_commit_by_xid(). And errors such as ER_XAER_NOTA can be thrown already there, and the decision on what to do put consisely as a flag into the struct group_commit_entry, so the logic during binlog group commit becomes as simple as possible. (It might even make sense to simply store a function pointer to the method that should be called to handle this; normal, attached XA COMMIT, detached XA COMMIT, XA PREPARE. But again it's easier when actually working with the code to decide if this is simpler than a couple flags or case on enum value).
+ if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec) + { + DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA()); + + auto xs= xid_cache_search(thd, xid); + if(!xs) + { + my_error(ER_XAER_NOTA, MYF(0)); + return; + } + else + { + thd->transaction->xid_state.xid_cache_element= xs; + } + DBUG_ASSERT(thd->transaction->xid_state.is_recovered()); + } + plugin_foreach(NULL, commit ? xacommit_handlerton : xarollback_handlerton, MYSQL_STORAGE_ENGINE_PLUGIN, &xaop); xid_cache_delete(thd);
Why this check for thd->rgi_slave->is_parallel_exec here deep in binlog group commit? That seems very wrong, and again no comment whatsoever explaining what is going on, so again I have to try to guess... I'm guessing that it's something with calling xid_cache_delete() which needs to find thd->transaction->xid_state.xid_cache_element ? But this could be done much better in a higher-level part of the code that's specific to xa and (parallel) replication, not here deep down in binlog group commit code. And why should this be needed only for parallel replication, not for non-parellel replication or non-replication SQL? I don't understand. A couple remarks about the tests I pushed to knielsen_mdev31949_review: I think the failures in rpl.rpl_recovery_xa I linked are from duplicate XID in XA PREPARE ; XA COMMIT ; XA PREPARE. There comes an ambiguity between whether the transaction prepared in the engine at recovery is from the first XA PREPARE (so it should be committed) or the second (so it should be rolled back). This happens when the second XA PREPARE is in the engine but not binlogged. One way to fix could actually be to use the binlog position stored inside InnoDB. If the XA COMMIT is before this binlog position, then we know the first transaction became durable in InnoDB, so we know we have to roll back the current prepared transaction in the engine. Otherwise we should commit it. I see the recovery code already uses binlog positions to decide the fate of pending XA transactions in the engines, but I did not yet fully understand what is going on there. Another way could be to binlog an extra event before the second XA PREPARE with the duplicate XID. This event would mean "This xid is now durable committed in the engine", so we would know not to commit any prepared trx in the engine with the same xid. This event would only be needed in case of XA PREPARE with a duplicate xid. This though requires the prior XA COMMIT to fsync inside InnoDB first, I think. Another test failure in rpl.slave_provision_xa I think happens because the XA PREPARE does not update the binlog position inside innodb. Maybe the way to solve this problem is to have the slave be idempotent, and silently ignore any duplicate XA PREPARE if the xid is already prepared? About the "ToDo" of using some Xid_list_log_event. I think this is a way during recovery to get the state of the XA PREPARE/COMMIT's in earlier binlog files before the last binlog checkpoint, if I understood things correctly? Another way to do this is to delay the binlog checkpoints until all XA PREPARE in a binlog file have been XA COMMIT'ted (or XA ROLLBACK'ed), so that they will all be scanned during recovery. I actually implemented this in my branch knielsen_mdev32020. The advantage of this is that we may anyway need this to provide the XA PREPARE events to a slave set up from a logical backup (like mysqldump), to avoid purging XA PREPARE events too early. And then it's a bit simpler to simply scan an extra file rather than both scan and read Xid_list_log_event. But I think Xid_list_log_event could work fine as well, just mentioning it for you to consider. That's it for now, I will continue trying to understand what is going on and complete the review. - Kristian.
Hi Andrei, Here next part of my review. First, I want to try to really explain a general problem I see several places in these patches, here is one example:
-static XID_cache_element *xid_cache_search(THD *thd, XID *xid) +XID_cache_element *xid_cache_search(THD *thd, XID *xid) { DBUG_ASSERT(thd->xid_hash_pins); XID_cache_element *element= (XID_cache_element*) lf_hash_search(&xid_cache, thd->xid_hash_pins, xid->key(), xid->key_length()); if (element) + { + /* + parallel slave XA-"complete" will acquire its actual xid after + it's ready for ordered commit. Meanwhile the transaction's handler + will use a dummy prepared xid element. + + TODO: lsu = zero. + */ + if ((!(thd->rgi_slave && thd->rgi_slave->is_parallel_exec) || + thd->transaction->xid_state.is_dummy_XA()))
We _really_ need to try as hard as possible to avoid these random conditions all over the codebase, like here: thd->rgi_slave && thd->rgi_slave->is_parallel_exec This xid_cache_search() function is a simple function, with a simple purpose. It looks up an XID, and "acquires" it so it doesn't get removed by another thread in parallel. That's it. But now you add all this complexity, of "oh, but except if it's a parallel replication thread that does the lookup, or maybe if the thread has a dummy XA state, then something I don't quite follow... This kind of stuff makes the code impossible to understand and maintain. The further away from the main replication code we get, the more wrong it is to have these kinds of conditions. This is a low-level function for external XA management, so checking for thd->rgi_slave->is_parallel_exec is _very_ wrong. And it looks completely unnecessary. The callers you add in your patch already check the same condition, for example:
+ if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec) + { + DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA()); + + auto xs= xid_cache_search(thd, xid);
So you can just add a different function that does the "dummy XA" thing, and call that instead. And then the new function can be explained in the comments what it does. *Please* try to understand this issue, and try to avoid the problem in all places in the XA code (and elsewhere). It's really important, the replication code is notoriously difficult to maintain as it is. It's absolutely my main concern about this whole XA work. The XA feature can work well or poorly, it's not a widely used feature, I can live with it. But having the replication code maintainable and understandable is critical for users going forward. Think of going into a Bakery shop. Imagine if I said to the baker: "My name is Kristian, and I'm currently renovating my living room". The baker would then have to decide (or guess) that this means I will be wanting a loaf of bread and 5 cookies. That seems very complicated and error prone, if all customers were like that, doesn't it? Isn't it much preferable if each customer just tells what she or he wants? Let's make the functions in our code the same way, caller tells what it wants, function delivers. Next, on a high level, I looked at the MDEV-21777 and tried to understand the plans for fixing that the (GTID) slave is no longer crash safe with XA. I think the idea is the following. Given a user external XA xid X. From X we construct a new xid, say X_PREPARE or X_COMPLETE (for XA PREPARE respectively XA COMMIT/ROLLBACK). Then we replicate the XA PREPARE like this: XA START X ... apply events from master XA END X XA START X_PREPARE INSERT INTO mysql.gtid_slave_pos VALUES (...) XA END X_PREPARE XA PREPARE X_PREPARE # This is not binlogged # Fsync the X_PREPARE trx inside InnoDB XA PREPARE X # Binlogged as normal # Fsync the XA PREPARE X inside InnoDB XA COMMIT X_PREPARE The XA COMMIT X is similar, except there are no events to apply initially from the master, and we use X_COMPLETE instead of X_PREPARE. Then in case of slave crash, after normal crash recovery (InnoDB and if enabled, binlog), a slave recovery takes place. For every xid X_PREPARE found prepared inside InnoDB, check if X is also found prepared. If so, commit X_PREPARE, else roll it back. And vice versa, for every xid X_COMPLETE found, roll it back if X is found prepared in InnoDB, else commit it. Is that correctly understood? One reason I'm asking is I'm wondering how this will affect the scheduling of XA in parallel replication? Like, in case of duplicate XID, the X_PREPARE or X_COMPLETE could conflict between two transactions, so maybe the wakeup_subsequent_commits can only be done after the XA COMMIT of the X_PREPARE and X_COMMIT? Which I think would require changes to the current patch. But then another thought follow from this. If we anyway need to construct new XIDs based on external user XID, like X->X_PREPARE. Then we either need to "steal" some of the space available for users (64+64 bytes gtrid/qual), or we could add extra space internally, not available/visible to users. And if we do this, could we append a unique identifier to each external XID (again invisible outside), just a running number? It seems this could simplify a number of cases where we have to handle re-use of the same XID. With a unique number appended, there would be no duplicate XIDs, simplifying the logic of binlog recovery, parallel slave scheduling, and possibly other cases. What do you think? Next, my comments from reading more through the patch:
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index c26263401b8..1ddf520af59 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc
@@ -2455,6 +2451,28 @@ rpl_parallel_entry::choose_thread(rpl_group_info *rgi, bool *did_enter_cond, return thr; }
+/** + The class implements rpl_parallel_entry::xap_window. + See @c handle_xa_prepera_duplicate_xid its main user. +*/ +class Sliding_window + :public Dynamic_array<std::pair<std::size_t, uint64>>
I still don't understand why you want to use Dynamic_array. From what I can see, you never resize it, it's only allocated once with N elements. Isn't it just adding unnecessary complexity, when a simple array would do exactly the same job?
+{ +public: + uint32 cxap_lhs, cxap_rhs; + Sliding_window(PSI_memory_key psi_key, uint prealloc) : + Dynamic_array<std::pair<std::size_t, uint64>>(psi_key, prealloc) + { + DBUG_ASSERT(prealloc == max_size()); + + elements(max_size());
Here, you allocate the number of elements. Why then does the code use max_size() everywhere, and not simply elements()?
+ cxap_lhs= cxap_rhs= 0; + for (size_t i= 0; i < max_size(); i++) + at(i)= {0, 0}; + }
@@ -2798,6 +2818,107 @@ abandon_worker_thread(THD *thd, rpl_parallel_thread *cur_thread, mysql_cond_signal(&cur_thread->COND_rpl_thread); }
+/** + Check the concurrency status of an input @c xid with ones + that are already in progress. + Any new @c xid of XA-prepare (@c is_xap is true then) is appended to + a sliding window. It's released "naturally" as the window is designed + as circular buffer.
Don't you need to also append an XA COMMIT to the window? Otherwise, consider the following sequence: 1: XA PREPARE 'a' 2: <unrelared transaction> 3: ... ... 100: XA COMMIT 'a' 101: XA PREPARE 'a' Ie. assume that the XA PREPARE is far back and has slid out of the window (assume less that 100 worker threads), don't you still need to block the XA PREPARE in (101) from starting in parallel with the XA COMMIT in (100) ?
+static bool +handle_xa_prepare_duplicate_xid(rpl_parallel_entry *e, XID *xid, bool is_xap)
<cut>
+ std::size_t xid_hash= + my_hash_sort(&my_charset_bin, xid->key(), xid->key_length());
Why use a hash key here, and not simply compare the actual XIDs? (I think it's fine, the occasional hash collision should not be a problem, just maybe a short comment explaining why a hash is used would be in order, as naively it would seem simpler just to compare the XIDs?).
@@ -3046,6 +3168,18 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, new_gco= true; force_switch_flag= 0; gco= e->current_gco; + /* + Take care of duplicate xids in multiple XA_prepare_log_event:s, + XA-"complete" should not race its XA-prepare parent either. + When the current transaction's xid has been seen and its transaction may + still be in process this event group gets flagged for synchronization, i.e + to wait for prior commits at its execution start. + */ + if ((gtid_flags & Gtid_log_event::FL_PREPARED_XA) && + (is_dup_xac= handle_xa_prepare_duplicate_xid(e, >id_ev->xid, + gtid_flags & + Gtid_log_event::FL_PREPARED_XA))) + gtid_flags &= ~Gtid_log_event::FL_ALLOW_PARALLEL;
This won't work in "aggressive" mode, will it? Because FL_ALLOW_PARALLEL is only used in optimistic mode, not agressive. Besides, you already have another check for is_dup_xac below:
+ rgi->speculation= + is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ? + rpl_group_info::SPECULATE_WAIT : speculation;
So you can just have a single condition here, right? if (is_dup_xac) rgi->speculation= rpl_group_info::SPECULATE_WAIT But another thing, the handle_xa_prepare_duplicate_xid() takes an argument is_xap, but you always only pass this as true, so it's redundant. I'm guessing that originally you also had something that would make the XA COMMIT wait for the XA PREPARE, but this was then later changed to the use of dummy_element. If so, you should remove the parameter. If you had used the xa_prepare_log_event() for the dependency PREPARE->COMMIT also, then the XA COMMIT could simply have registrered a wait_for_prior_commit() on the precise event group of the XA PREPARE, that seems to be simple. I'm wondering why you thought it better to have a dummy xid for the XA COMMIT? I'll need to study the logic in more detail to understand it, but it looks to me that what will happen is that the code will binlog the XA COMMIT (or ROLLBACK) without any checks, and then an error will occur in the storage engine if the xid is not in prepared state in the engine - is that correct? Well, except that xacommit_handlerton() silently ignores any errors from the engines. Does this mean that problems around this will silently be ignored by replication? Or is there some other place that such errors will be caught later? It's not clear to me. For example, since the mysql.gtid_slave_pos update is not crash safe, we could have a crash lead to double apply of the XA COMMIT, will this silently succeed, and binlog the XA COMMIT twice? What if the user (incorrectly) did a manual XA COMMIT, and the real XA COMMIT then gets replicated? This is an error case, but can we be sure we will not get an assertion or even a crash somewhere in the code?
@@ -3114,7 +3249,14 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, if (gtid_flags & Gtid_log_event::FL_DDL) force_switch_flag= group_commit_orderer::FORCE_SWITCH; } - rgi->speculation= speculation; + /* + When DDL flagged XA sees an ongoing one with "duplicate" xid + its execution will have to wait for the xid owner full completion. + SPECULATE_WAIT provides that, regardless of the parallel mode. + */ + rgi->speculation= + is_dup_xac && (gco->flags & group_commit_orderer::FORCE_SWITCH) ? + rpl_group_info::SPECULATE_WAIT : speculation;
This I don't understand at all. How can an XA PREPARE/COMMIT/ROLLBACK be marked as "ddl" ?
@@ -3131,7 +3273,7 @@ rpl_parallel::do_event(rpl_group_info *serial_rgi, Log_event *ev, Remember the count that marks the end of the previous batch of event groups that run in parallel, and allocate a new gco. */ - uint64 count= e->count_queued_event_groups; + auto count= e->count_queued_event_groups;
(This is a trivial thing in the big picture, but better avoid unrelated changes in patches, and better leave the real type to make the code more readable). Let me send this off for now, and then next I'll try to understand the part with the dummy xid and some logic around has_binlog_hton()... - Kristian.
Hi Andrei, Here is the last part of my review of the changes in branch bb-10.6-MDEV-31949:
--- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg, flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ? FL_PREPARED_XA : FL_COMPLETED_XA; xid.set(xid_state.get_xid()); + + // multi-engine external completion is unaware of the prepared engine #. + extra_engines= thd->transaction->all.ha_list ? + ha_count_rw_2pc(thd_arg, + thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
I cannot understand the comment. I assume it's about that an XA COMMIT of a detached xid (different session that the XA PREPARE, or replication slave) - right? But what is the consequences of not knowing extra_engines in this case? Is the extra_engines used for anything than the optional truncate of the binlog after master crash?
@@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit() thd->lex->xid= &xid; if (!one_phase) { - if ((res= thd->wait_for_prior_commit())) + if (thd->is_current_stmt_binlog_disabled() && + (res= thd->wait_for_prior_commit()))
Why? Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event groups? That makes sense, but then shouldn't the wait_for_prior_commit() be somewhere in handler.cc (or maybe xa.cc)? Look at Xid_log_event::do_commit(), it doesn't have any wait_for_prior_commit(), that is in ha_commit_one_phase(). (The wait-for-prior-commit facility is designed as a server-level feature, not a replication-level, it could in principle be used outside of replication, even though it not currently (IIRC).)
diff --git a/sql/sql_array.h b/sql/sql_array.h index 8610e971016..94c6d86ac05 100644 --- a/sql/sql_array.h +++ b/sql/sql_array.h @@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
size_t size() const { return array.elements; }
+ size_t max_size() const { return array.max_element; } +
Not that it hurts that much, but I think you can avoid adding this.
--- a/sql/log.cc +++ b/sql/log.cc @@ -85,6 +85,7 @@ const char *log_bin_index= 0; const char *log_bin_basename= 0;
MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period); +XID_cache_element *xid_cache_search(THD *thd, XID *xid);
This duplicates the same declaration in sql/xa.h So probably this can be removed now?
@@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc) }
if (cache_mngr->trx_cache.empty() && - (thd->transaction->xid_state.get_state_code() != XA_PREPARED || - !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() && - thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write()))) + (!(is_preparing_xa(thd) || is_prepared_xa(thd)) || + !is_binlog_read_write(thd)))
This binlog_commit function started out as the "commit" method for the binlog handlerton. But that is not how it is used anymore, now it is called directly from several different places, including some places specific to external user XA. A large part of the function has become complex conditions like this that are basically impossible for the developer (or reviewer) to understand exactly why this set of conditions is the correct one. Again, the function is trying to guess what the caller wants, depending on the state of various thd properties etc. Instead, the function should be split into multiple functions. One that is called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then each version should be able to greatly reduce the number of conditions needed. And if any logic is shared between the different versions, this can then be put in smaller functions that are called from each version.
@@ -11050,10 +11124,30 @@ bool Recovery_context::decide_or_assess(xid_recovery_member *member, int round, } else if (member->in_engine_prepare < last_gtid_engines) { - DBUG_ASSERT(member->in_engine_prepare > 0); + DBUG_ASSERT(member->in_engine_prepare > 0 || + /* XA event over xid that was completed in engines */ + (member->xid == 0 && + member->xa_binlog_state >= xid_recovery_member::XA_PREPARE)); + if (member->in_engine_prepare > 0 && + member->xa_binlog_state == xid_recovery_member::XA_PREPARE) + {
The function decide_or_assess() is hard to understand. I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA COMMIT/ROLLBACK. So again here, I think this should be split into 3 different functions, one for each case (or maybe two, one for internal XID and one for external XA, whatever makes the more sense). I don't understand the comment "XA event over xid that was completed in engines". But I remember seeing something in log_event_server.cc that sets extra_engines to 0 in the external user XA case. So is that the case here, that last_gtid_engines==0 marks this is user XA? That would be much clearer if the logic for internal and external XA was not mixed in the same function.
+ /* + the first of the two options below can occur when round 1 turns to 2 + so event sequences across binlog files are like the following + g_current(B_0, xid), ... g_last(B^*, xid), + where `g`:s are xa event groups associated with the same member via + common `xid`. As elsewhere B_0 is the 2nd round's first and, + B^* the 1st round binlogs. + */ if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos))
I'm trying to understand what is going on here. I think this is refering to the way binlog files are scanned? First we scan the most recent binlog file. At the end of it, we may find that the latest binlog checkpoint is in an earlier file, then we continue to scan those older files. So for example we may scan in order: master-bin.000004, master-bin.000002, master-bin.000003 This means that we may see the XA PREPARE and XA COMMIT out-of-order as we go from the first scan to the remaining - "round 1 turns to 2". But I don't understand that this is "the first of the two options below": member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in master-bin.000004 ("round 1 turns to 2"), then I would think we would have member->binlog_coord > last_gtid_coord So it would be the *second* of the options below? What am I missing here? Also, I would suggest a simplification here. After we do the initial scan in round 1, if we find that we need to scan more files, then reset all the state and hashes. And then do a full scan of all necessary binlog files in-order. In my example, that would be: master-bin.000004 <reset state> master-bin.000002 master-bin.000003 master-bin.000004 This way, the logic becomes simpler, as we don't need logic to handle seeing events out-of-order. This will mean that occasionally we will need to scan one more binlog file, but I think that is a good trade-off. The recovery code is notoriously difficult to test and get correct, because it's an unusual occurence to need recovery in the first place, and very hard to reproduce any problems. And robust and correct recovery is the whole point of having XA in the first palce. So making the code as simple as possible is crucial. The extra cost is small: - For one, recovery is a rare event - Re-scanning only costs CPU, the file is already read from storage into OS buffers. - Normally, the binlog checkpoint completes shortly after rotation, so the last binlog file that must be rescanned will be short. So simply redo the scan from the correct point when a "round 2" is needed, to avoid any logic to handle seeing events out-of-order during recovery.
@@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev, last_gtid_no2pc= false; last_gtid_standalone= (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false; - if (do_truncate && last_gtid_standalone) + /* a standalone XA-COMPLETE is truncation-safe */ + if (do_truncate && last_gtid_standalone + && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA)) update_binlog_unsafe_coord_if_needed(linfo);
I don't understand the comment here. What is "a standalone XA-COMPLETE" ? What does it mean to be "truncation-safe"? Please reword the comment to be comprehensible.
+ /* + the user XA events are decided similary to the normal transaction, + difference is just in the timing which is Gtid event read now. + */
I don't understand this comment. Please reformulate to make it understandable. What does normal transactions do, and what does user XA do? How does the timing differ, and why?
+ sql_print_warning("GTID %u-%u-%s XA transaction state % " + "in binlog file:%s pos:%llu is inconsistent " + "with the former same state; " + "consider filtered binary loggig.", + ctx.last_gtid.domain_id, ctx.last_gtid.server_id, + buf, is_xac ? "'COMPLETE'" : "'PREPARE'", + linfo->log_file_name, ctx.last_gtid_coord.second);
Typo: s/loggig/logging/ Also, the phrase "the former same state" is not understandable (at least to me), what does it mean? Please reformulate.
@@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *)) there's no need to rollback here as all transactions must be rolled back already */ -void ha_close_connection(THD* thd) +void ha_close_connection(THD* thd, bool skip_binlog)
I tried to understand what the purpose of this change is. ha_close_connection() is now called from two new places related to replication of XA PREPARE. And in one place with this mysterious skip_binlog=true condition. My guess is that this is to detach the XA PREPARE from the current THD? So that it becomes visible in XA RECOVER to other threads, and (maybe more important) to make the THD available to run another transaction? One of the things InnoDB does in close_connection is: trx_disconnect_prepared(). But I think the purpose of the close_connection handlerton call is different - it is to free things at the end of a client connection? So it can do more than detach the current external XA - which is presumably the reason for the need to hack it to not call the specific binlog_close_connection()? So this shouldn't hijack the close_connection to mean "detach_xa_trx". It looks like we need an explicit handlerton call for detaching an xa transaction, that can then be used instead of this hack in ha_close_connection().
-static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, - void *arg) +my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg) { handlerton *hton= plugin_hton(plugin); - if (hton->recover) + if (hton->recover && hton != binlog_hton) { hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
-int ha_commit_or_rollback_by_xid(XID *xid, bool commit) +int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd) {
+ if (!skip_binlog) + { + // when binlog is ON start from its transaction branch if (commit) binlog_commit_by_xid(binlog_hton, xid); else binlog_rollback_by_xid(binlog_hton, xid);
I think the purpose if this is to ensure that commit/rollback of XA calls into the binlog code first (to binlog before doing it in the engines). But as I think Serg also remarked already, this should not be needed. The binlog is the (internal) transaction coordinator when it is enabled, we have a dedicated interface for that, class TC_LOG. This is used elsewhere in handler.cc to avoid special cases for whether the binlog is enabled or not. So it looks like there needs to be extra virtual functions in TC_LOG for the XA commit/rollback, and these can then be called on the tc_log object and do what is required. Then some logic in handler.cc can be omitted/simplified. ---- I think with this, I got through all of the code changes in branch bb-10.6-MDEV-31949 as of commit 1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f from January 20, 2024. So this concludes the current review. I have not looked at the test cases in this round of review. - Kristian.
participants (1)
-
Kristian Nielsen