[Maria-developers] MWL#116: Efficient group commit, ready for code review
Hi Serg, Please find attached the patch for code review of MWL#116: Efficient group commit. I went through all your comments in the architecture review and updated the worklog description and the code accordingly. I hope I got everything, otherwise I hope you have the patience to point out any omissions once again! The patch is pushed here if you want to see the complete code: lp:~maria-captains/maria/mariadb-5.1-mwl116 Once the code passes code review, I think the plan is to include this feature in MariaDB 5.3. Please let me know if you need anything else from me for the review. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Please find attached the patch for code review of MWL#116: Efficient group commit.
Of course I forgot to attach the patch, trying again. - Kristian.
I went through all your comments in the architecture review and updated the worklog description and the code accordingly. I hope I got everything, otherwise I hope you have the patience to point out any omissions once again!
The patch is pushed here if you want to see the complete code:
lp:~maria-captains/maria/mariadb-5.1-mwl116
Once the code passes code review, I think the plan is to include this feature in MariaDB 5.3.
Please let me know if you need anything else from me for the review.
Hi!
"Kristian" == Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Kristian> Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Please find attached the patch for code review of MWL#116: Efficient group commit.
=== added file 'mysql-test/suite/binlog/t/binlog_ioerr.test' --- mysql-test/suite/binlog/t/binlog_ioerr.test 1970-01-01 00:00:00 +0000 +++ mysql-test/suite/binlog/t/binlog_ioerr.test 2010-10-01 08:16:44 +0000
<cut>
+# Actually the output from this currently shows a bug. +# The injected IO error leaves partially written transactions in the binlog in +# the form of stray "BEGIN" events. +# These should disappear from the output if binlog error handling is improved.
Do you have a worklog number for the above. If yes, please add it to the source too. <cut>
--- mysql-test/t/group_commit_binlog_pos.test 1970-01-01 00:00:00 +0000 +++ mysql-test/t/group_commit_binlog_pos.test 2010-10-28 19:01:32 +0000 <cut>
+# Check that the binlog position reported by InnoDB is the correct one +# for the end of the second transaction (as can be checked with +# mysqlbinlog). +let $MYSQLD_DATADIR= `SELECT @@datadir`; +--exec grep 'InnoDB: Last MySQL binlog file position' $MYSQLD_DATADIR/../../log/mysqld.1.err | tail -1
Is there another way to get the position without using grep ? (as grep probably don't work on windows) As we already have regexp in mysqltest, it should be trivial to do our own simple version of grep there. Let's see if we can get someone to implement the grep. Another option would be to store the startup position in some user variables (assuming this would be useful for anyone else).
--- sql/handler.cc 2010-10-20 10:58:43 +0000
@@ -1075,7 +1077,7 @@ ha_check_and_coalesce_trx_read_only(THD */ int ha_commit_trans(THD *thd, bool all) { /* #ifdef USING_TRANSACTIONS - if (ha_info) + if (!ha_info) { + /* Free resources and perform other cleanup even for 'empty' transactions. */ + if (is_real_trans) + thd->transaction.cleanup(); + DBUG_RETURN(0); + }
Thanks for cleaning up the indentation levels ! (It's always better to handle the 'if-not-active-then-exit' cases first and not have them in the else part) <cut>
+ /* + Sic: we know that prepare() is not NULL since otherwise + trans->no_2pc would have been set. + */ + if ((err= ht->prepare(ht, thd, all))) + my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); + status_var_increment(thd->status_var.ha_prepare_count); + + if (err) + goto err; +
The above is better written as: status_var_increment(thd->status_var.ha_prepare_count); if ((err= ht->prepare(ht, thd, all))) { my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); goto err; }
+ if (ht->prepare_ordered) + need_prepare_ordered= TRUE; + if (ht->commit_ordered) + need_commit_ordered= TRUE;
Another way to do this would be: need_prepare_ordered|= ht->prepare_ordered; need_commit_ordered|= ht->commit_ordered; Less code and faster...
+ } + DBUG_EXECUTE_IF("crash_commit_after_prepare", DBUG_ABORT(););
Should be DBUG_SUICIDE() after you merge with lastest 5.1 <cut>
+ DBUG_EXECUTE_IF("crash_commit_before_unlog", DBUG_ABORT();); + tc_log->unlog(cookie, xid);
Latest 5.1 code tests value for unlog, so you need to check after merge.
+ + DBUG_EXECUTE_IF("crash_commit_after", DBUG_ABORT();); + goto end; + + /* Come here if error and we need to rollback. */ +err: + if (!error) + error= 1;
Add comment: error= 1; /* transaction was rolled back */ I would also remove the test if error is set as we are doing a rollback here and in this case the only allowed result is 1. <cut>
@@ -1839,7 +1876,13 @@ int ha_start_consistent_snapshot(THD *th { bool warn= true;
+ /* + Holding the LOCK_commit_ordered mutex ensures that for any transaction + we either see it committed in all engines, or in none. + */
A slightly better comment is: /* Holding the LOCK_commit_ordered mutex ensures that we get a the same snapshot for all engines (including the binary log). This allows us among other things to do backups with START TRANSACTION WITH CONSISTENT SNAPSHOT and have a consistent binlog position */
+ pthread_mutex_lock(&LOCK_commit_ordered); plugin_foreach(thd, snapshot_handlerton, MYSQL_STORAGE_ENGINE_PLUGIN, &warn); + pthread_mutex_unlock(&LOCK_commit_ordered);
/* Same idea as when one wants to CREATE TABLE in one engine which does not
=== modified file 'sql/handler.h' --- sql/handler.h 2010-10-20 10:58:43 +0000 +++ sql/handler.h 2010-11-01 14:15:03 +0000 @@ -659,9 +659,100 @@ struct handlerton NOTE 'all' is also false in auto-commit mode where 'end of statement' and 'real commit' mean the same event. */
<cut>
+ Not that like prepare(), commit_ordered() is only called when 2-phase + commit takes place. Ie. when no binary log and only a single engine + participates in a transaction, one commit() is called, no + commit_orderd(). So engines must be prepared for this.
commit_orderd -> commit_ordered()
+ + The calls to commit_ordered() in multiple parallel transactions is + guaranteed to happen in the same order in every participating + handler. This can be used to ensure the same commit order among multiple + handlers (eg. in table handler and binlog). So if transaction T1 calls + into commit_ordered() of handler A before T2, then T1 will also call + commit_ordered() of handler B before T2. + + Engines that implement this method should during this call make the + transaction visible to other transactions, thereby making the order of + transaction commits be defined by the order of commit_ordered() calls. + + The intension is that commit_ordered() should do the minimal amount of
intension -> intention <cut>
+ When 2-phase commit is not used (eg. only one engine (and no binlog) in + transaction), prepare() is not called and in such cases prepare_ordered() + also is not called.
-> When 2-phase commit is not used (eg. only one engine (and no binlog) in transaction), nether prepare() or prepare_ordered() is called.
=== modified file 'sql/log.cc' --- sql/log.cc 2010-10-19 13:58:35 +0000 +++ sql/log.cc 2010-11-01 14:15:03 +0000 @@ -38,6 +38,7 @@ #endif
#include <mysql/plugin.h> +#include "debug_sync.h"
/* max size of the log message */ #define MAX_LOG_BUFFER_SIZE 1024 @@ -154,7 +155,7 @@ class binlog_trx_data { public: binlog_trx_data() : at_least_one_stmt_committed(0), incident(FALSE), m_pending(0), - before_stmt_pos(MY_OFF_T_UNDEF) + before_stmt_pos(MY_OFF_T_UNDEF), commit_bin_log_file_pos(0), using_xa(0)
Change using_xa(0) -> using_xa(FALSE) (Just to be consistent with:
@@ -208,11 +209,13 @@ public: ... before_stmt_pos= MY_OFF_T_UNDEF; incident= FALSE; trans_log.end_of_file= max_binlog_cache_size; + using_xa= FALSE; + commit_bin_log_file_pos= 0; DBUG_ASSERT(empty()); }
@@ -1394,103 +1408,131 @@ static int binlog_close_connection(handl }
/* - End a transaction. + End a transaction, writing events to the binary log.
SYNOPSIS - binlog_end_trans() + binlog_flush_trx_cache()
thd The thread whose transaction should be ended trx_data Pointer to the transaction data to use - end_ev The end event to use, or NULL - all True if the entire transaction should be ended, false if - only the statement transaction should be ended. + end_ev The end event to use (COMMIT, ROLLBACK, or commit XID)
<cut>
- If 'end_ev' is NULL, the transaction is a rollback of only - transactional tables, so the transaction cache will be truncated - to either just before the last opened statement transaction (if - 'all' is false), or reset completely (if 'all' is true). + This can be to commit a transaction, with a COMMIT query event or an XA + commit XID event. But it can also be to rollback a transaction with a + ROLLBACK query event, used for rolling back transactions which also + contain updates to non-transactional tables. */
Please add back documentation for 'all' as this is still a parameter. Looking at the code, end_ev can still be NULL. It would be good to document this case too (or never call this with end_ev). See comment about this later in the code... <cut>
+static LEX_STRING const write_error_msg= + { C_STRING_WITH_LEN("error writing to the binary log") }; +
Please move to the start of the file (so we have all variables declared in one place).
+#ifndef DBUG_OFF +static ulong opt_binlog_dbug_fsync_sleep= 0; +#endif
Add also above to start of file.
bool MYSQL_BIN_LOG::flush_and_sync() { int err=0, fd=log_file.file; @@ -3956,6 +4003,11 @@ bool MYSQL_BIN_LOG::flush_and_sync() { sync_binlog_counter= 0; err=my_sync(fd, MYF(MY_WME)); +#ifndef DBUG_OFF + ulong usec_sleep= opt_binlog_dbug_fsync_sleep; + if (usec_sleep > 0) + my_sleep(usec_sleep); +#endif
Why not simple do: if (opt_binlog_dbug_fsync_sleep > 0) my_sleep(opt_binlog_dbug_fsync_sleep) as this is only for debugging, the above is safe
@@ -4335,8 +4379,6 @@ bool MYSQL_BIN_LOG::write(Log_event *eve if (thd->binlog_flush_pending_rows_event(end_stmt)) DBUG_RETURN(error);
- pthread_mutex_lock(&LOCK_log); -
ok to remove, but....
#endif /* USING_TRANSACTIONS */ DBUG_PRINT("info",("event type: %d",event_info->get_type_code())); + if (file == &log_file) + pthread_mutex_lock(&LOCK_log);
Here we must also check if is_open() is still true, as we did the earlier test without the lock_log mutex. <cut>
+int MYSQL_BIN_LOG::write_cache(IO_CACHE *cache) { - Mutex_sentry sentry(lock_log ? &LOCK_log : NULL); -
Please remove also the class Mutex_sentry, as it's not used anymore. It would however be good to add here: safe_mutex_assert_owner(&LOCK_log);
@@ -4778,26 +4812,22 @@ int query_error_code(THD *thd, bool not_ return error; }
-bool MYSQL_BIN_LOG::write_incident(THD *thd, bool lock) +bool MYSQL_BIN_LOG::write_incident(THD *thd) { uint error= 0; DBUG_ENTER("MYSQL_BIN_LOG::write_incident");
add also here:
Incident incident= INCIDENT_LOST_EVENTS; Incident_log_event ev(thd, incident, write_error_msg); - if (lock) - pthread_mutex_lock(&LOCK_log); + + pthread_mutex_lock(&LOCK_log);
You need to add a test if 'is_open()' is true here. <cut>
+bool +MYSQL_BIN_LOG::write_transaction_to_binlog_events(group_commit_entry *entry) +{
<cut>
+ + if (!entry->error) + return 0;
Add 'likely() around the above test to make it clear why this is not handled in the following switch.
+ + switch (entry->error) + { ...
<cut>
+MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader) +{
<cut>
+ + /* Skip log_xid for transactions without xid, marked by NULL end_event. */ + if (!current->end_event) + continue;
We talked about ignoring these.
/* + We only bother to write to the binary log if there is anything + to write. */ + if (my_b_tell(cache) > 0) + {
Can't we ignore transactions with cache == 0 on the upper level ? There is no reason I can come up with why we put these in the queue.
+ current->error= write_transaction(current); + if (current->error) + current->commit_errno= errno;
You could, if you want, change to use: if ((current->error= write_transaction(current))) current->commit_errno= errno; (I find this much faster to read)
+ write_count++; + }
+ trx_data->commit_bin_log_file_pos= + log_file.pos_in_file + (log_file.write_pos - log_file.write_buffer);
Please add a macro in mys_sys: my_b_write_tell(info) info->pos_in_file + (info->write_pos - info->write_buffer) and use this instead. <cut>
+ current= queue; + while (current != NULL) { + DEBUG_SYNC(leader->thd, "commit_loop_entry_commit_ordered"); + ++num_commits; + if (current->trx_data->using_xa && !current->error) + run_commit_ordered(current->thd, current->all); + + /* + Careful not to access current->next after waking up the other thread! As + it may change immediately after wakeup. + */ + group_commit_entry *next= current->next;
Please add variable declaration at start of loop. <cut>
+int +MYSQL_BIN_LOG::write_transaction(group_commit_entry *entry) +{ + binlog_trx_data *trx_data= entry->trx_data; + IO_CACHE *cache= &trx_data->trans_log; + /* + Log "BEGIN" at the beginning of every transaction. Here, a transaction is + either a BEGIN..COMMIT block or a single statement in autocommit mode. The + event was constructed in write_transaction_to_binlog(), in the thread + running the transaction. + + Now this Query_log_event has artificial log_pos 0. It must be + adjusted to reflect the real position in the log. Not doing it + would confuse the slave: it would prevent this one from + knowing where he is in the master's binlog, which would result + in wrong positions being shown to the user, MASTER_POS_WAIT + undue waiting etc. + */
The above old comment doesn't make sense anymore in this place of the code. Please remove or move it to a better place. (I tried to find a better place, but couldn't find it)
+ if (entry->begin_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + DBUG_EXECUTE_IF("crash_before_writing_xid", + { + if ((write_cache(cache))) + DBUG_PRINT("info", ("error writing binlog cache")); + else + flush_and_sync(); + + DBUG_PRINT("info", ("crashing before writing xid")); + abort(); + }); + + if (write_cache(cache)) + return ER_ERROR_ON_WRITE; + + if (entry->end_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + if (entry->incident_event && entry->incident_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + if (cache->error) // Error on read + return ER_ERROR_ON_READ;
It would be better to test this after 'write_cache'. There is no point in writing anymore if this fails....
+static my_bool mutexes_inited; +pthread_mutex_t LOCK_prepare_ordered; +pthread_mutex_t LOCK_commit_ordered;
Move to start of file.
+ +void +TC_init() +{ + my_pthread_mutex_init(&LOCK_prepare_ordered, MY_MUTEX_INIT_SLOW, + "LOCK_prepare_ordered", MYF(0)); + my_pthread_mutex_init(&LOCK_commit_ordered, MY_MUTEX_INIT_SLOW, + "LOCK_commit_ordered", MYF(0)); + mutexes_inited= TRUE; +} + +void +TC_destroy() +{ + if (mutexes_inited) + { + pthread_mutex_destroy(&LOCK_prepare_ordered); + pthread_mutex_destroy(&LOCK_commit_ordered); + mutexes_inited= FALSE; + } +} + +void +TC_LOG::run_prepare_ordered(THD *thd, bool all) +{ + Ha_trx_info *ha_info= + all ? thd->transaction.all.ha_list : thd->transaction.stmt.ha_list; + + for (; ha_info; ha_info= ha_info->next()) + { + handlerton *ht= ha_info->ht(); + if (!ht->prepare_ordered) + continue; + safe_mutex_assert_owner(&LOCK_prepare_ordered);
Move the safe_mutex_assert_owner to start of function. It's enough to test it once and it should be taken in all cases.
+ ht->prepare_ordered(ht, thd, all); + } +} +
Add empty line here (two empty lines between functions)
+void +TC_LOG::run_commit_ordered(THD *thd, bool all) +{ + Ha_trx_info *ha_info= + all ? thd->transaction.all.ha_list : thd->transaction.stmt.ha_list; + + for (; ha_info; ha_info= ha_info->next()) + { + handlerton *ht= ha_info->ht(); + if (!ht->commit_ordered) + continue; + safe_mutex_assert_owner(&LOCK_commit_ordered);
Move to start of function
+ ht->commit_ordered(ht, thd, all); + DEBUG_SYNC(thd, "commit_after_run_commit_ordered"); + } +} + Add empty line here (two empty lines between functions)
+int TC_LOG_MMAP::log_and_order(THD *thd, my_xid xid, bool all, + bool need_prepare_ordered, + bool need_commit_ordered) +{ + int cookie; + struct commit_entry entry; + bool is_group_commit_leader; + LINT_INIT(is_group_commit_leader); + + if (need_prepare_ordered) + { + pthread_mutex_lock(&LOCK_prepare_ordered); + run_prepare_ordered(thd, all); + if (need_commit_ordered) + { + /* + Must put us in queue so we can run_commit_ordered() in same sequence + as we did run_prepare_ordered(). + */ + thd->clear_wakeup_ready(); + entry.thd= thd; + commit_entry *previous_queue= commit_ordered_queue; + entry.next= previous_queue; + commit_ordered_queue= &entry; + is_group_commit_leader= (previous_queue == NULL); + } + pthread_mutex_unlock(&LOCK_prepare_ordered); + } + + if (xid) + cookie= log_one_transaction(xid); + else + cookie= 0;
It's easier and faster to do: cookie= 0; if (xid) cookie= log_one_transaction(xid);
@@ -5965,30 +6374,66 @@ void TC_LOG_BINLOG::close() pthread_cond_destroy (&COND_prep_xids); }
+/* + Do a binlog log_xid() for a group of transactions, linked through + thd->next_commit_ordered. */ -int TC_LOG_BINLOG::log_xid(THD *thd, my_xid xid) +int +TC_LOG_BINLOG::log_and_order(THD *thd, my_xid xid, bool all, + bool need_prepare_ordered __attribute__((unused)), + bool need_commit_ordered __attribute__((unused))) { - DBUG_ENTER("TC_LOG_BINLOG::log"); - Xid_log_event xle(thd, xid); - binlog_trx_data *trx_data= + int err; + DBUG_ENTER("TC_LOG_BINLOG::log_and_order"); + + binlog_trx_data *const trx_data= (binlog_trx_data*) thd_get_ha_data(thd, binlog_hton); - /* - We always commit the entire transaction when writing an XID. Also - note that the return value is inverted. - */ - DBUG_RETURN(!binlog_end_trans(thd, trx_data, &xle, TRUE)); + + trx_data->using_xa= TRUE; + if (xid)
The above looks strange. How can we have xa without an xid ? If this is possible, please add a comment before setting 'using_xa' If it's not possible that xid is 0, then add an assert.
+ { + Xid_log_event xid_event(thd, xid); + err= binlog_flush_trx_cache(thd, trx_data, &xid_event, all); + } + else + err= binlog_flush_trx_cache(thd, trx_data, NULL, all);
Shouldn't the above be: binlog_truncate_trx_cache() At least all other cases where we before called binlog_flush_trx_cache(thd, trx_data, NULL, all) was replaced with binlog_truncate_trx_cache(). If the call is right, this should be explained in the function comment for binlog_flush_trx_cache().
+ + DBUG_RETURN(!err); }
<cut>
+/* + Once an XID is committed, it is safe to rotate the binary log, as it can no + longer be needed during crash recovery. + + This function is called to mark an XID this way. It needs to decrease the + count of pending XIDs, and signal the log rotator thread when it reaches zero. +*/ +void +TC_LOG_BINLOG::mark_xid_done() { + DBUG_ENTER("TC_LOG_BINLOG::mark_xid_done"); pthread_mutex_lock(&LOCK_prep_xids); DBUG_ASSERT(prepared_xids > 0); if (--prepared_xids == 0) { pthread_cond_signal(&COND_prep_xids); } pthread_mutex_unlock(&LOCK_prep_xids); DBUG_VOID_RETURN;
An even fast way to do this is: pthread_mutex_lock(&LOCK_prep_xids); DBUG_ASSERT(prepared_xids > 0); signal= !--prepared_xids; pthread_mutex_unlock(&LOCK_prep_xids); if (signal) pthread_cond_signal(&COND_prep_xids); DBUG_VOID_RETURN; This is faster as it will not wake up the thread and have it sleep again on the mutex...
+void TC_LOG_BINLOG::unlog(ulong cookie, my_xid xid) +{ + DBUG_ENTER("TC_LOG_BINLOG::unlog"); + if (xid) + mark_xid_done(); + rotate_and_purge(0); // as ::write_transaction_to_binlog() did not rotate
Note for when doing merge: Newest 5.1 code returns value of rotate_and_purge().
+static ulonglong binlog_status_var_num_commits; +static ulonglong binlog_status_var_num_group_commits;
Move variables to start of file
+ + +/* + Copy out current values of status variables, for SHOW STATUS or + information_schema.global_status. + + This is called only under LOCK_status, so we can fill in a static array. +*/ +void +TC_LOG_BINLOG::set_status_variables() +{ + ulonglong num_commits, num_group_commits; + + pthread_mutex_lock(&LOCK_commit_ordered); + num_commits= this->num_commits; + num_group_commits= this->num_group_commits; + pthread_mutex_unlock(&LOCK_commit_ordered); + + binlog_status_var_num_commits= num_commits; + binlog_status_var_num_group_commits= num_group_commits;
Why not set the above variables directly ? (Having to store local variables over the pthread_mutex_unlock() call has a higher overhead than setting the global variables directly). <cut>
=== modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2010-08-27 14:12:44 +0000 +++ sql/sql_class.cc 2010-10-28 09:35:46 +0000
+void +THD::wait_for_wakeup_ready() +{ + pthread_mutex_lock(&LOCK_wakeup_ready); + while (!wakeup_ready) + pthread_cond_wait(&COND_wakeup_ready, &LOCK_wakeup_ready); + pthread_mutex_unlock(&LOCK_wakeup_ready); +} + +void +THD::signal_wakeup_ready() +{ + pthread_mutex_lock(&LOCK_wakeup_ready); + wakeup_ready= true; + pthread_cond_signal(&COND_wakeup_ready); + pthread_mutex_unlock(&LOCK_wakeup_ready);
A faster version is:
+ pthread_mutex_lock(&LOCK_wakeup_ready); + wakeup_ready= true; + pthread_mutex_unlock(&LOCK_wakeup_ready); + pthread_cond_signal(&COND_wakeup_ready);
You need the lock around wakup_read to ensure that wait_for_wakeup_ready() doesn't miss the setting of the variable.
=== modified file 'storage/xtradb/handler/ha_innodb.cc' --- storage/xtradb/handler/ha_innodb.cc 2010-10-19 15:03:26 +0000
<cut>
@@ -11619,11 +11643,6 @@ static MYSQL_SYSVAR_ENUM(adaptive_checkp "Enable/Disable flushing along modified age. (none, reflex, [estimate])", NULL, innodb_adaptive_checkpoint_update, 2, &adaptive_checkpoint_typelib);
-static MYSQL_SYSVAR_ULONG(enable_unsafe_group_commit, srv_enable_unsafe_group_commit, - PLUGIN_VAR_RQCMDARG, - "Enable/Disable unsafe group commit when support_xa=OFF and use with binlog or other XA storage engine.", - NULL, NULL, 0, 0, 1, 0); -
We probably would need to have the variable available (and do nothing) and mark it deprecated. (Just to ensure that MariaDB starts with a my.cnf file made for MySQL). <cut> Regards, Monty
Hi Monty, Thanks for your review! I generally followed your suggestions, except for this:
+ if (entry->begin_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + DBUG_EXECUTE_IF("crash_before_writing_xid", + { + if ((write_cache(cache))) + DBUG_PRINT("info", ("error writing binlog cache")); + else + flush_and_sync(); + + DBUG_PRINT("info", ("crashing before writing xid")); + abort(); + }); + + if (write_cache(cache)) + return ER_ERROR_ON_WRITE; + + if (entry->end_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + if (entry->incident_event && entry->incident_event->write(&log_file)) + return ER_ERROR_ON_WRITE; + + if (cache->error) // Error on read + return ER_ERROR_ON_READ;
It would be better to test this after 'write_cache'. There is no point in writing anymore if this fails....
Are you sure? Couldn't this result in a half-written event group (missing COMMIT or incident event) in cases where it wouldn't happen before? Anyway, this is unchanged from the existing code, I am reluctant to change it as part of the group commit patch ... ---- Otherwise I fixed everything you pointed out. Below a few clarifying comments in case you are interested.
+ trx_data->using_xa= TRUE; + if (xid)
The above looks strange. How can we have xa without an xid ?
After some investigation, I found this can happen in XA COMMIT ... ONE PHASE. I added a comment. It was actually a bug, in this case of XA COMMIT ... ONE PHASE it should use a COMMIT query event (not NULL). Fixed; this also removes the case of end_event==NULL that you commented on in other parts of the code.
/* + We only bother to write to the binary log if there is anything + to write. */ + if (my_b_tell(cache) > 0) + {
Can't we ignore transactions with cache == 0 on the upper level ? There is no reason I can come up with why we put these in the queue.
It can happen for example for this: SET sql_log_bin = 0; BEGIN; INSERT INTO innodb_table VALUES (3); INSERT INTO pbxt_table VALUES (4); COMMIT; In this case, there is nothing to write to the binlog. However, the code is still invoked, as binlog needs to act as a transaction coordinator for 2-phase commit between the two storage engines. (This is btw. unchanged in the patch compared to the original code).
The above is better written as:
status_var_increment(thd->status_var.ha_prepare_count); if ((err= ht->prepare(ht, thd, all))) { my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); goto err; }
I did it like this, to avoid swapping status_var_increment() and ht->prepare(): err= ht->prepare(ht, thd, all) status_var_increment(thd->status_var.ha_prepare_count); if (err) { my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); goto err; }
+# Check that the binlog position reported by InnoDB is the correct one +# for the end of the second transaction (as can be checked with +# mysqlbinlog). +let $MYSQLD_DATADIR= `SELECT @@datadir`; +--exec grep 'InnoDB: Last MySQL binlog file position' $MYSQLD_DATADIR/../../log/mysqld.1.err | tail -1
Is there another way to get the position without using grep ?
Another option would be to store the startup position in some user variables (assuming this would be useful for anyone else).
Yes, I think this is a good idea. I added a worklog for it: MWL#191.
(as grep probably don't work on windows)
As we already have regexp in mysqltest, it should be trivial to do our own simple version of grep there. Let's see if we can get someone to implement the grep.
Actually there should be several ways to get `grep` and `tail` on Windows, including at least one made by Microsoft and several that are Free Software. I would personally prefer to not re-invent the wheel, but rely on existing packages for this. We could have a `--source include/have_posix_tools' so we do not fail the test if the posix commands are not available. Anyway, in this case, the MWL#191 is the best solution I think. I added to the worklog to fix the test to not use grep, and until then I will just put `--source include/not_windows.inc' with a reference to MWL#191.
+# Actually the output from this currently shows a bug. +# The injected IO error leaves partially written transactions in the binlog in +# the form of stray "BEGIN" events. +# These should disappear from the output if binlog error handling is improved.
Do you have a worklog number for the above. If yes, please add it to the source too.
Done (it is MySQL Bug#37148 and WL#1790).
+int +MYSQL_BIN_LOG::write_transaction(group_commit_entry *entry) +{ + binlog_trx_data *trx_data= entry->trx_data; + IO_CACHE *cache= &trx_data->trans_log; + /* + Log "BEGIN" at the beginning of every transaction. Here, a transaction is + either a BEGIN..COMMIT block or a single statement in autocommit mode. The + event was constructed in write_transaction_to_binlog(), in the thread + running the transaction. + + Now this Query_log_event has artificial log_pos 0. It must be + adjusted to reflect the real position in the log. Not doing it + would confuse the slave: it would prevent this one from + knowing where he is in the master's binlog, which would result + in wrong positions being shown to the user, MASTER_POS_WAIT + undue waiting etc. + */
The above old comment doesn't make sense anymore in this place of the code. Please remove or move it to a better place. (I tried to find a better place, but couldn't find it)
Moved the first part to write_transaction_to_binlog(); the last part I deleted as there is a similar better comment in write_cache(). - Kristian.
participants (2)
-
Kristian Nielsen
-
Michael Widenius