Hi Andrei, Here is the last part of my review of the changes in branch bb-10.6-MDEV-31949:
--- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg, flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ? FL_PREPARED_XA : FL_COMPLETED_XA; xid.set(xid_state.get_xid()); + + // multi-engine external completion is unaware of the prepared engine #. + extra_engines= thd->transaction->all.ha_list ? + ha_count_rw_2pc(thd_arg, + thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
I cannot understand the comment. I assume it's about that an XA COMMIT of a detached xid (different session that the XA PREPARE, or replication slave) - right? But what is the consequences of not knowing extra_engines in this case? Is the extra_engines used for anything than the optional truncate of the binlog after master crash?
@@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit() thd->lex->xid= &xid; if (!one_phase) { - if ((res= thd->wait_for_prior_commit())) + if (thd->is_current_stmt_binlog_disabled() && + (res= thd->wait_for_prior_commit()))
Why? Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event groups? That makes sense, but then shouldn't the wait_for_prior_commit() be somewhere in handler.cc (or maybe xa.cc)? Look at Xid_log_event::do_commit(), it doesn't have any wait_for_prior_commit(), that is in ha_commit_one_phase(). (The wait-for-prior-commit facility is designed as a server-level feature, not a replication-level, it could in principle be used outside of replication, even though it not currently (IIRC).)
diff --git a/sql/sql_array.h b/sql/sql_array.h index 8610e971016..94c6d86ac05 100644 --- a/sql/sql_array.h +++ b/sql/sql_array.h @@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
size_t size() const { return array.elements; }
+ size_t max_size() const { return array.max_element; } +
Not that it hurts that much, but I think you can avoid adding this.
--- a/sql/log.cc +++ b/sql/log.cc @@ -85,6 +85,7 @@ const char *log_bin_index= 0; const char *log_bin_basename= 0;
MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period); +XID_cache_element *xid_cache_search(THD *thd, XID *xid);
This duplicates the same declaration in sql/xa.h So probably this can be removed now?
@@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc) }
if (cache_mngr->trx_cache.empty() && - (thd->transaction->xid_state.get_state_code() != XA_PREPARED || - !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() && - thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write()))) + (!(is_preparing_xa(thd) || is_prepared_xa(thd)) || + !is_binlog_read_write(thd)))
This binlog_commit function started out as the "commit" method for the binlog handlerton. But that is not how it is used anymore, now it is called directly from several different places, including some places specific to external user XA. A large part of the function has become complex conditions like this that are basically impossible for the developer (or reviewer) to understand exactly why this set of conditions is the correct one. Again, the function is trying to guess what the caller wants, depending on the state of various thd properties etc. Instead, the function should be split into multiple functions. One that is called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then each version should be able to greatly reduce the number of conditions needed. And if any logic is shared between the different versions, this can then be put in smaller functions that are called from each version.
@@ -11050,10 +11124,30 @@ bool Recovery_context::decide_or_assess(xid_recovery_member *member, int round, } else if (member->in_engine_prepare < last_gtid_engines) { - DBUG_ASSERT(member->in_engine_prepare > 0); + DBUG_ASSERT(member->in_engine_prepare > 0 || + /* XA event over xid that was completed in engines */ + (member->xid == 0 && + member->xa_binlog_state >= xid_recovery_member::XA_PREPARE)); + if (member->in_engine_prepare > 0 && + member->xa_binlog_state == xid_recovery_member::XA_PREPARE) + {
The function decide_or_assess() is hard to understand. I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA COMMIT/ROLLBACK. So again here, I think this should be split into 3 different functions, one for each case (or maybe two, one for internal XID and one for external XA, whatever makes the more sense). I don't understand the comment "XA event over xid that was completed in engines". But I remember seeing something in log_event_server.cc that sets extra_engines to 0 in the external user XA case. So is that the case here, that last_gtid_engines==0 marks this is user XA? That would be much clearer if the logic for internal and external XA was not mixed in the same function.
+ /* + the first of the two options below can occur when round 1 turns to 2 + so event sequences across binlog files are like the following + g_current(B_0, xid), ... g_last(B^*, xid), + where `g`:s are xa event groups associated with the same member via + common `xid`. As elsewhere B_0 is the 2nd round's first and, + B^* the 1st round binlogs. + */ if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos))
I'm trying to understand what is going on here. I think this is refering to the way binlog files are scanned? First we scan the most recent binlog file. At the end of it, we may find that the latest binlog checkpoint is in an earlier file, then we continue to scan those older files. So for example we may scan in order: master-bin.000004, master-bin.000002, master-bin.000003 This means that we may see the XA PREPARE and XA COMMIT out-of-order as we go from the first scan to the remaining - "round 1 turns to 2". But I don't understand that this is "the first of the two options below": member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in master-bin.000004 ("round 1 turns to 2"), then I would think we would have member->binlog_coord > last_gtid_coord So it would be the *second* of the options below? What am I missing here? Also, I would suggest a simplification here. After we do the initial scan in round 1, if we find that we need to scan more files, then reset all the state and hashes. And then do a full scan of all necessary binlog files in-order. In my example, that would be: master-bin.000004 <reset state> master-bin.000002 master-bin.000003 master-bin.000004 This way, the logic becomes simpler, as we don't need logic to handle seeing events out-of-order. This will mean that occasionally we will need to scan one more binlog file, but I think that is a good trade-off. The recovery code is notoriously difficult to test and get correct, because it's an unusual occurence to need recovery in the first place, and very hard to reproduce any problems. And robust and correct recovery is the whole point of having XA in the first palce. So making the code as simple as possible is crucial. The extra cost is small: - For one, recovery is a rare event - Re-scanning only costs CPU, the file is already read from storage into OS buffers. - Normally, the binlog checkpoint completes shortly after rotation, so the last binlog file that must be rescanned will be short. So simply redo the scan from the correct point when a "round 2" is needed, to avoid any logic to handle seeing events out-of-order during recovery.
@@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev, last_gtid_no2pc= false; last_gtid_standalone= (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false; - if (do_truncate && last_gtid_standalone) + /* a standalone XA-COMPLETE is truncation-safe */ + if (do_truncate && last_gtid_standalone + && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA)) update_binlog_unsafe_coord_if_needed(linfo);
I don't understand the comment here. What is "a standalone XA-COMPLETE" ? What does it mean to be "truncation-safe"? Please reword the comment to be comprehensible.
+ /* + the user XA events are decided similary to the normal transaction, + difference is just in the timing which is Gtid event read now. + */
I don't understand this comment. Please reformulate to make it understandable. What does normal transactions do, and what does user XA do? How does the timing differ, and why?
+ sql_print_warning("GTID %u-%u-%s XA transaction state % " + "in binlog file:%s pos:%llu is inconsistent " + "with the former same state; " + "consider filtered binary loggig.", + ctx.last_gtid.domain_id, ctx.last_gtid.server_id, + buf, is_xac ? "'COMPLETE'" : "'PREPARE'", + linfo->log_file_name, ctx.last_gtid_coord.second);
Typo: s/loggig/logging/ Also, the phrase "the former same state" is not understandable (at least to me), what does it mean? Please reformulate.
@@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *)) there's no need to rollback here as all transactions must be rolled back already */ -void ha_close_connection(THD* thd) +void ha_close_connection(THD* thd, bool skip_binlog)
I tried to understand what the purpose of this change is. ha_close_connection() is now called from two new places related to replication of XA PREPARE. And in one place with this mysterious skip_binlog=true condition. My guess is that this is to detach the XA PREPARE from the current THD? So that it becomes visible in XA RECOVER to other threads, and (maybe more important) to make the THD available to run another transaction? One of the things InnoDB does in close_connection is: trx_disconnect_prepared(). But I think the purpose of the close_connection handlerton call is different - it is to free things at the end of a client connection? So it can do more than detach the current external XA - which is presumably the reason for the need to hack it to not call the specific binlog_close_connection()? So this shouldn't hijack the close_connection to mean "detach_xa_trx". It looks like we need an explicit handlerton call for detaching an xa transaction, that can then be used instead of this hack in ha_close_connection().
-static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, - void *arg) +my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg) { handlerton *hton= plugin_hton(plugin); - if (hton->recover) + if (hton->recover && hton != binlog_hton) { hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
-int ha_commit_or_rollback_by_xid(XID *xid, bool commit) +int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd) {
+ if (!skip_binlog) + { + // when binlog is ON start from its transaction branch if (commit) binlog_commit_by_xid(binlog_hton, xid); else binlog_rollback_by_xid(binlog_hton, xid);
I think the purpose if this is to ensure that commit/rollback of XA calls into the binlog code first (to binlog before doing it in the engines). But as I think Serg also remarked already, this should not be needed. The binlog is the (internal) transaction coordinator when it is enabled, we have a dedicated interface for that, class TC_LOG. This is used elsewhere in handler.cc to avoid special cases for whether the binlog is enabled or not. So it looks like there needs to be extra virtual functions in TC_LOG for the XA commit/rollback, and these can then be called on the tc_log object and do what is required. Then some logic in handler.cc can be omitted/simplified. ---- I think with this, I got through all of the code changes in branch bb-10.6-MDEV-31949 as of commit 1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f from January 20, 2024. So this concludes the current review. I have not looked at the test cases in this round of review. - Kristian.