Hi Andrei, It's not an easy review of the XA patches in bb-10.6-MDEV-31949. One thing that would have helped is a clear design document that lists all the different concerns and points that need to be considered around recovery, backup, replication, possible conflicts, etc. This could then be referenced by the reviewer to see if a given concern was considered, and if so what the resolution is. But I'll try as best I can with what we have. As I'm trying to understand the different issues, I found making test cases to check my understanding helped. I have pushed some to the branch knielsen_mdev31949_review . They seem to show some problems in the current patch: mysql-test/suite/rpl/t/rpl_recovery_xa.test - A trx is left in "XA PREPARED" state after crash recovery, even though the binlogged event group is incomplete (it should have been rolled back). - A trx is committed after crash recovery, even though it is not binlogged (it should have been rolled back). - An incomplete XA event group is output by mysqlbinlog with a "ROLLBACK" at the end which gives an error (should probably be XA ROLLBACK). mysql-test/suite/rpl/t/rpl_recovery_xa2.test - An XA PREPAREd transaction is lost after recovery when it was binlogged in an earlier binlog file. (I think this might be the issue that's referenced as a ToDo with Xid_list_log_event in patch?) mysql-test/suite/mariabackup/slave_provision_xa.test - An XA PREPARE does not update the binlog position inside InnoDB. So a slave provisioned with mariabackup from this starting position can fail with duplicate apply of XA PREPARE. Some other initial remarks follow. Since there's a desire to progress with this quickly, I thought it would be useful to send these early, so you can start looking/discussions before I complete building an understanding of the whole design and code picture.
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 17e512bf34e..a4cdc72e18c 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -17144,8 +17144,16 @@ innobase_commit_by_xid( }
if (trx_t* trx = trx_get_trx_by_xid(xid)) { + THD *thd = current_thd; + if (thd) + thd_binlog_pos(thd, &trx->mysql_log_file_name, + &trx->mysql_log_offset);
I guess this is necessary to get the correct binlog position stored inside innodb to match the binlogged XA COMMIT event, right? But will it work if called during crash recovery? During crash recovery, we could actually update the position, but only if this is trx is the last event group in the old binlog. Maybe that's better handled elsewhere (there are currently other cases where the binlog position in InnoDB can be wrong immediately after restart until the first new transaction is committed).
+ if (trx->mysql_log_offset > 0) + trx->flush_log_later = true;
But I'm guessing this is the motivation. If called from within binlog group commit, you want to disable fsync in InnoDB. But if called from recovery, or when binlog is disabled, this condition will be false, and fsync will be done. If so, that's a _really_ obfuscated way to do it :-/ It surely needs a very clear comment explaining what is going on. And it is very fragile. What if thd_binlog_pos() is changed at some point, to slightly change the conditions under which it returns 0 or not? That could silently break this code. It would be much better if this could be done similar to how normal commits are done. We have the commit_ordered handlerton which does the fast part of commit-to-memory. And then the slow part which does the rest in commit handlerton. So we could have commit_by_xid_ordered() and commit_by_xid(). The problem is that commit_by_xid_ordered() needs to release the xid, so it becomes harder to get hold of the transaction (which the commit handlerton just finds from the thd). Maybe commit_by_xid_ordered() would need to return some trx handle. What part of the commit is skipped by this in InnoDB? Is it _only_ flushing the log up to the LSN of the commit record, or are there more things? How is another storage engine supposed to handle this? This also needs to be documented somewhere, eg. in comments on the handlerton calls. This also means that the --innodb-flush-log-at-trx-commit=3 is not supported for XA COMMIT when it's done from a different session that the XA PREPARE. For example, with --innodb-flush-log-at-trx-commit=3 but --sync-binlog=0, normal transactions are still durable in InnoDB. But external XA transactions will not be. This could be solved if the commit_by_xid() was split into a fast and slow part similar to normal commit. Probably you'll say that this works for now, and perhaps that's true. But at least it really _really_ needs some detailed comments explaining what is going on, and what should be kept in mind if modifying/improving this code in the future, otherwise it will be impossible for the next developer to understand.
diff --git a/sql/log.cc b/sql/log.cc index e54d4087d46..7e3dbeaef29 100644 --- a/sql/log.cc +++ b/sql/log.cc
@@ -9365,18 +9381,84 @@ TC_LOG::run_commit_ordered(THD *thd, bool all) all ? thd->transaction->all.ha_list : thd->transaction->stmt.ha_list;
mysql_mutex_assert_owner(&LOCK_commit_ordered); - for (; ha_info; ha_info= ha_info->next()) + + if (!ha_info)
Your patch puts a lot of XA/binlog logic into run_commit_ordered(). But that's not appropriate. The run_commit_ordered() function has the purpose solely to call the commit_ordered() handlerton in each engine, and it's also called from TC_LOG_MMAP (that's why it's TC_LOG::, not MYSQL_BIN_LOG::). Instead make a new function that handles the part of binlog commit that needs to be done under LOCK_commit_ordered. Be sure to comment it that this runs under this highly contended mutex, and needs to do as little work as possible. This function can then call run_commit_ordered(). It can also be used to share a few bits of code that are now duplicated between the opt_optimize_thread_scheduling and !opt_optimize_thread_scheduling cases, like: ++num_commits; run_commit_ordered(...) entry->thd->wakeup_subsequent_commits() if (entry->queued_by_other) { ... } (if it makes sense to do so in the actual code, only if makes things simpler, that's easier to decide when working with the actual code). There sesems to be 3 new cases: - XA PREPARE - XA COMMIT from same session as XA PREPARE - XA COMMIT of detached XA transaction The logic to decide this was already done earlier, when the event group was queued for group commit, eg. to decide which end_event to use. For example there is no need to call xid_cache_search() here, that can be done earlier (and it's also wrong to call my_error() here, as we're in the wrong thread context). So the logic about what to do should be placed higher up in the call stack, ideally where we already know, eg. binlog_commit_by_xid(). And errors such as ER_XAER_NOTA can be thrown already there, and the decision on what to do put consisely as a flag into the struct group_commit_entry, so the logic during binlog group commit becomes as simple as possible. (It might even make sense to simply store a function pointer to the method that should be called to handle this; normal, attached XA COMMIT, detached XA COMMIT, XA PREPARE. But again it's easier when actually working with the code to decide if this is simpler than a couple flags or case on enum value).
+ if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec) + { + DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA()); + + auto xs= xid_cache_search(thd, xid); + if(!xs) + { + my_error(ER_XAER_NOTA, MYF(0)); + return; + } + else + { + thd->transaction->xid_state.xid_cache_element= xs; + } + DBUG_ASSERT(thd->transaction->xid_state.is_recovered()); + } + plugin_foreach(NULL, commit ? xacommit_handlerton : xarollback_handlerton, MYSQL_STORAGE_ENGINE_PLUGIN, &xaop); xid_cache_delete(thd);
Why this check for thd->rgi_slave->is_parallel_exec here deep in binlog group commit? That seems very wrong, and again no comment whatsoever explaining what is going on, so again I have to try to guess... I'm guessing that it's something with calling xid_cache_delete() which needs to find thd->transaction->xid_state.xid_cache_element ? But this could be done much better in a higher-level part of the code that's specific to xa and (parallel) replication, not here deep down in binlog group commit code. And why should this be needed only for parallel replication, not for non-parellel replication or non-replication SQL? I don't understand. A couple remarks about the tests I pushed to knielsen_mdev31949_review: I think the failures in rpl.rpl_recovery_xa I linked are from duplicate XID in XA PREPARE ; XA COMMIT ; XA PREPARE. There comes an ambiguity between whether the transaction prepared in the engine at recovery is from the first XA PREPARE (so it should be committed) or the second (so it should be rolled back). This happens when the second XA PREPARE is in the engine but not binlogged. One way to fix could actually be to use the binlog position stored inside InnoDB. If the XA COMMIT is before this binlog position, then we know the first transaction became durable in InnoDB, so we know we have to roll back the current prepared transaction in the engine. Otherwise we should commit it. I see the recovery code already uses binlog positions to decide the fate of pending XA transactions in the engines, but I did not yet fully understand what is going on there. Another way could be to binlog an extra event before the second XA PREPARE with the duplicate XID. This event would mean "This xid is now durable committed in the engine", so we would know not to commit any prepared trx in the engine with the same xid. This event would only be needed in case of XA PREPARE with a duplicate xid. This though requires the prior XA COMMIT to fsync inside InnoDB first, I think. Another test failure in rpl.slave_provision_xa I think happens because the XA PREPARE does not update the binlog position inside innodb. Maybe the way to solve this problem is to have the slave be idempotent, and silently ignore any duplicate XA PREPARE if the xid is already prepared? About the "ToDo" of using some Xid_list_log_event. I think this is a way during recovery to get the state of the XA PREPARE/COMMIT's in earlier binlog files before the last binlog checkpoint, if I understood things correctly? Another way to do this is to delay the binlog checkpoints until all XA PREPARE in a binlog file have been XA COMMIT'ted (or XA ROLLBACK'ed), so that they will all be scanned during recovery. I actually implemented this in my branch knielsen_mdev32020. The advantage of this is that we may anyway need this to provide the XA PREPARE events to a slave set up from a logical backup (like mysqldump), to avoid purging XA PREPARE events too early. And then it's a bit simpler to simply scan an extra file rather than both scan and read Xid_list_log_event. But I think Xid_list_log_event could work fine as well, just mentioning it for you to consider. That's it for now, I will continue trying to understand what is going on and complete the review. - Kristian.