Kristian, Using the chance to forwarding the reply that did not make to the developers list, let me comment on binlog_commit() that I might hastily rate as a big subject.
Hi Andrei,
Here is the last part of my review of the changes in branch bb-10.6-MDEV-31949:
--- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -3324,6 +3324,11 @@ Gtid_log_event::Gtid_log_event(THD *thd_arg, uint64 seq_no_arg, flags2|= thd->lex->sql_command == SQLCOM_XA_PREPARE ? FL_PREPARED_XA : FL_COMPLETED_XA; xid.set(xid_state.get_xid()); + + // multi-engine external completion is unaware of the prepared engine #. + extra_engines= thd->transaction->all.ha_list ? + ha_count_rw_2pc(thd_arg, + thd_arg->in_multi_stmt_transaction_mode()) - 1 : 0;
I cannot understand the comment. I assume it's about that an XA COMMIT of a detached xid (different session that the XA PREPARE, or replication slave) - right?
Correct.
But what is the consequences of not knowing extra_engines in this case? Is the extra_engines used for anything than the optional truncate of the binlog after master crash?
In the multi-engine case when the # of prepared in Engine is greater than in the value in GTID event then the trx is partially committed, so is decided for to be completed.
@@ -4175,7 +4173,8 @@ int XA_prepare_log_event::do_commit() thd->lex->xid= &xid; if (!one_phase) { - if ((res= thd->wait_for_prior_commit())) + if (thd->is_current_stmt_binlog_disabled() && + (res= thd->wait_for_prior_commit()))
Why?000 Is it to allow XA_PREPARE_LOG_EVENT to group commit with other event groups? That makes sense, but then shouldn't the wait_for_prior_commit() be somewhere in handler.cc (or maybe xa.cc)? Look at Xid_log_event::do_commit(), it doesn't have any wait_for_prior_commit(), that is in ha_commit_one_phase().
Right.
(The wait-for-prior-commit facility is designed as a server-level feature, not a replication-level, it could in principle be used outside of replication, even though it not currently (IIRC).)
I think I understood what you mean which is follow the ha_commit_one_phase() pattern. [todo]
diff --git a/sql/sql_array.h b/sql/sql_array.h index 8610e971016..94c6d86ac05 100644 --- a/sql/sql_array.h +++ b/sql/sql_array.h @@ -172,6 +172,8 @@ template <class Elem> class Dynamic_array
size_t size() const { return array.elements; }
+ size_t max_size() const { return array.max_element; } +
Not that it hurts that much, but I think you can avoid adding this.
Left over of a cleanup from the previous version. [todo]
--- a/sql/log.cc +++ b/sql/log.cc @@ -85,6 +85,7 @@ const char *log_bin_index= 0; const char *log_bin_basename= 0;
MYSQL_BIN_LOG mysql_bin_log(&sync_binlog_period); +XID_cache_element *xid_cache_search(THD *thd, XID *xid);
This duplicates the same declaration in sql/xa.h So probably this can be removed now?
[todo]
@@ -2186,16 +2202,13 @@ int binlog_commit(THD *thd, bool all, bool ro_1pc) }
if (cache_mngr->trx_cache.empty() && - (thd->transaction->xid_state.get_state_code() != XA_PREPARED || - !(thd->ha_data[binlog_hton->slot].ha_info[1].is_started() && - thd->ha_data[binlog_hton->slot].ha_info[1].is_trx_read_write()))) + (!(is_preparing_xa(thd) || is_prepared_xa(thd)) || + !is_binlog_read_write(thd)))
This binlog_commit function started out as the "commit" method for the binlog handlerton. But that is not how it is used anymore, now it is called directly from several different places, including some places specific to external user XA.
A large part of the function has become complex conditions like this that are basically impossible for the developer (or reviewer) to understand exactly why this set of conditions is the correct one. Again, the function is trying to guess what the caller wants, depending on the state of various thd properties etc.
Instead, the function should be split into multiple functions. One that is called to handle XA PREPARE; one for binlog_commit_by_xid(), etc. Then each version should be able to greatly reduce the number of conditions needed.
And if any logic is shared between the different versions, this can then be put in smaller functions that are called from each version.
[my initial response was: Your proposal leads to a significant refactoring, which I add only one hunk. I'd really consider that as a separate exercise I think I've understood. Sure, binlog_prepare() that logs XA-PREPARE can call a common function with binlog_commit() that originally intended to log only normal trx:s. I can look into that. [todo]
@@ -11050,10 +11124,30 @@ bool Recovery_context::decide_or_assess(xid_recovery_member *member, int round, } else if (member->in_engine_prepare < last_gtid_engines) { - DBUG_ASSERT(member->in_engine_prepare > 0); + DBUG_ASSERT(member->in_engine_prepare > 0 || + /* XA event over xid that was completed in engines */ + (member->xid == 0 && + member->xa_binlog_state >= xid_recovery_member::XA_PREPARE)); + if (member->in_engine_prepare > 0 && + member->xa_binlog_state == xid_recovery_member::XA_PREPARE) + {
The function decide_or_assess() is hard to understand. I can see that it's called in 3 cases: XID_LOG_EVENT, XA PREPARE, XA COMMIT/ROLLBACK. So again here, I think this should be split into 3 different functions, one for each case (or maybe two, one for internal XID and one for external XA, whatever makes the more sense).
I don't understand the comment "XA event over xid that was completed in engines". But I remember seeing something in log_event_server.cc that sets extra_engines to 0 in the external user XA case.
So is that the case here, that last_gtid_engines==0 marks this is user XA? That would be much clearer if the logic for internal and external XA was not mixed in the same function.
+ /* + the first of the two options below can occur when round 1 turns to 2 + so event sequences across binlog files are like the following + g_current(B_0, xid), ... g_last(B^*, xid), + where `g`:s are xa event groups associated with the same member via + common `xid`. As elsewhere B_0 is the 2nd round's first and, + B^* the 1st round binlogs. + */ if (reset_truncate_coord(member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos))
I'm trying to understand what is going on here. I think this is refering to the way binlog files are scanned? First we scan the most recent binlog file. At the end of it, we may find that the latest binlog checkpoint is in an earlier file, then we continue to scan those older files. So for example we may scan in order:
master-bin.000004, master-bin.000002, master-bin.000003
This means that we may see the XA PREPARE and XA COMMIT out-of-order as we go from the first scan to the remaining - "round 1 turns to 2".
But I don't understand that this is "the first of the two options below":
member->binlog_coord < last_gtid_coord ? member->binlog_coord.second : pos
If last_gtid_coord is in master-bin.000002, but member->binlog_coord is in master-bin.000004 ("round 1 turns to 2"), then I would think we would have
member->binlog_coord > last_gtid_coord
So it would be the *second* of the options below? What am I missing here?
Also, I would suggest a simplification here. After we do the initial scan in round 1, if we find that we need to scan more files, then reset all the state and hashes. And then do a full scan of all necessary binlog files in-order. In my example, that would be:
master-bin.000004 <reset state> master-bin.000002 master-bin.000003 master-bin.000004
This way, the logic becomes simpler, as we don't need logic to handle seeing events out-of-order.
This will mean that occasionally we will need to scan one more binlog file, but I think that is a good trade-off. The recovery code is notoriously difficult to test and get correct, because it's an unusual occurence to need recovery in the first place, and very hard to reproduce any problems. And robust and correct recovery is the whole point of having XA in the first palce. So making the code as simple as possible is crucial.
The extra cost is small:
- For one, recovery is a rare event - Re-scanning only costs CPU, the file is already read from storage into OS buffers. - Normally, the binlog checkpoint completes shortly after rotation, so the last binlog file that must be rescanned will be short.
So simply redo the scan from the correct point when a "round 2" is needed, to avoid any logic to handle seeing events out-of-order during recovery.
@@ -11182,7 +11311,9 @@ void Recovery_context::process_gtid(int round, Gtid_log_event *gev, last_gtid_no2pc= false; last_gtid_standalone= (gev->flags2 & Gtid_log_event::FL_STANDALONE) ? true : false; - if (do_truncate && last_gtid_standalone) + /* a standalone XA-COMPLETE is truncation-safe */ + if (do_truncate && last_gtid_standalone + && !(gev->flags2 & Gtid_log_event::FL_COMPLETED_XA)) update_binlog_unsafe_coord_if_needed(linfo);
I don't understand the comment here. What is "a standalone XA-COMPLETE" ? What does it mean to be "truncation-safe"? Please reword the comment to be comprehensible.
+ /* + the user XA events are decided similary to the normal transaction, + difference is just in the timing which is Gtid event read now. + */
I don't understand this comment. Please reformulate to make it understandable. What does normal transactions do, and what does user XA do? How does the timing differ, and why?
+ sql_print_warning("GTID %u-%u-%s XA transaction state % " + "in binlog file:%s pos:%llu is inconsistent " + "with the former same state; " + "consider filtered binary loggig.", + ctx.last_gtid.domain_id, ctx.last_gtid.server_id, + buf, is_xac ? "'COMPLETE'" : "'PREPARE'", + linfo->log_file_name, ctx.last_gtid_coord.second);
Typo: s/loggig/logging/
Also, the phrase "the former same state" is not understandable (at least to me), what does it mean? Please reformulate.
I also have to be short now in replying to the rest, which I'll expand on tomorrow morning. Now just this.
@@ -909,14 +910,16 @@ ha_commit_checkpoint_request(void *cookie, void (*pre_hook)(void *)) there's no need to rollback here as all transactions must be rolled back already */ -void ha_close_connection(THD* thd) +void ha_close_connection(THD* thd, bool skip_binlog)
I tried to understand what the purpose of this change is. ha_close_connection() is now called from two new places related to replication of XA PREPARE. And in one place with this mysterious skip_binlog=true condition.
My guess is that this is to detach the XA PREPARE from the current THD?
The reason is described in 6cbb63f592 commit name II.1. ``` OTHER CHANGES, to ha_close_connection() are done to keep the binlog hton connection alive for the end of XA-prepare's binlogging on the parallel slave. Other than the binlog hton:s are disconnected from the slave transaction handler in TC_LOG::run_commit_ordered() which makes the prepared transaction engine objects available to binlog-commit-order followers. ``` In other words XAP_k in run_commit_ordered() releases with ha_close_connection(thd, true) all resources that the following XAC_k+i need. The latter does not need XAP_k to release binlog which it still need while being is this stack of TC_LOG::run_commit_ordered MYSQL_BIN_LOG::trx_group_commit_leader MYSQL_BIN_LOG::write_transaction_to_binlog_events MYSQL_BIN_LOG::write_transaction_to_binlog binlog_flush_cache binlog_commit_flush_xa_prepare binlog_commit binlog_prepare prepare_or_error ha_prepare
So that it becomes visible in XA RECOVER to other threads, and (maybe more important) to make the THD available to run another transaction? One of the things InnoDB does in close_connection is: trx_disconnect_prepared().
But I think the purpose of the close_connection handlerton call is different
And you must have meant 'ha_close_connection', right?
- it is to free things at the end of a client connection?
Well, with the ha- prefix, it's to disconnect THD from all transaction branches/htons.
So it can do more than detach the current external XA - which is presumably the reason for the need to hack it to not call the specific binlog_close_connection()?
So this shouldn't hijack the close_connection to mean "detach_xa_trx". It looks like we need an explicit handlerton call for detaching an xa transaction, that can then be used instead of this hack in ha_close_connection().
I need to digest that in the morning. Generally I don't see a problem to conduct hton release flexibly.
-static my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, - void *arg) +my_bool xacommit_handlerton(THD *unused1, plugin_ref plugin, void *arg) { handlerton *hton= plugin_hton(plugin); - if (hton->recover) + if (hton->recover && hton != binlog_hton) { hton->commit_by_xid(hton, ((struct xahton_st *)arg)->xid);
-int ha_commit_or_rollback_by_xid(XID *xid, bool commit) +int ha_commit_or_rollback_by_xid(XID *xid, bool commit, THD *thd) {
+ if (!skip_binlog) + { + // when binlog is ON start from its transaction branch if (commit) binlog_commit_by_xid(binlog_hton, xid); else binlog_rollback_by_xid(binlog_hton, xid);
I think the purpose if this is to ensure that commit/rollback of XA calls into the binlog code first (to binlog before doing it in the engines).
But as I think Serg also remarked already, this should not be needed. The binlog is the (internal) transaction coordinator when it is enabled, we have a dedicated interface for that, class TC_LOG. This is used elsewhere in handler.cc to avoid special cases for whether the binlog is enabled or not.
So it looks like there needs to be extra virtual functions in TC_LOG for the XA commit/rollback, and these can then be called on the tc_log object and do what is required.
I need to process that tomorrow.
Then some logic in handler.cc can be omitted/simplified.
----
I think with this, I got through all of the code changes in branch bb-10.6-MDEV-31949 as of commit 1bdc9bc8077c79e55cb4dbfc7cc47a193d17d54f from January 20, 2024. So this concludes the current review. I have not looked at the test cases in this round of review.
Thanks again! Cheers, Andrei