- developers - lists.mariadb.org

Re: [PATCH] MDEV-21322: Report slave progress to the master
by Kristian Nielsen 05 Jun '24

05 Jun '24

Kristian Nielsen via commits <commits(a)lists.mariadb.org> writes: > From: Brandon Nesterenko <brandon.nesterenko(a)mariadb.com> > > This patch extends the command `SHOW REPLICA HOSTS` with three > columns: Hi Brandon, please find my review of MDEV-21322 below. > This patch extends the command `SHOW REPLICA HOSTS` with three > columns: > > 1) Gtid_State_Sent. This represents that latest GTIDs sent to the > replica in each domain. It will always be populated, regardless of > the semi-sync status (i.e. asynchronous connections will still > update this column with the latest GTID state sent to the replica). > > 2) Gtid_State_Ack. For semi-synchronous connections (only), this > column represents the last GTID in each domain that the replica has > acknowledged. Use "Pos" instead of "State", eg. Gtid_Pos_Sent and Gtid_Pos_Ack. To be consistent with eg. gtid_binlog_pos, which has last GTID per domain_id; vs. gtid_binlog_state, which has GTID per every (domain_id, server_id) combination. > @@ -2272,6 +2292,30 @@ send_event_to_slave(binlog_send_info *info, Log_event_type event_type, > return "Failed to run hook 'after_send_event'"; > } > > + if (info->thd->slave_info) > + { > + strncpy(info->thd->slave_info->gtid_state_sent.log_file, > + info->log_file_name + info->dirlen, > + strlen(info->log_file_name) - info->dirlen); > + info->thd->slave_info->gtid_state_sent.log_pos= pos; > + } Isn't there missing synchronisation here against concurrent access from another thread? I guess we need LOCK_thd_data around the strncpy(). Maybe we can avoid taking LOCK_thd_data except in the uncommon case where the gile name changes. It would be best to avoid a mutex lock/unlock for _every_ event sent to a slave. The position doesn't need locking I think, you can use an atomic std::memory_order_relaxed mostly for documentation if you like. > @@ -125,8 +114,11 @@ int THD::register_slave(uchar *packet, size_t packet_length) > if (check_access(this, PRIV_COM_REGISTER_SLAVE, any_db.str, NULL,NULL,0,0)) > return 1; > if (!(si= (Slave_info*)my_malloc(key_memory_SLAVE_INFO, sizeof(Slave_info), > - MYF(MY_WME)))) > + MYF(MY_WME|MY_ZEROFILL)))) > return 1; > + memset(si->gtid_state_sent.log_file, '\0', FN_REFLEN); > + memset(si->gtid_state_ack.log_file, '\0', FN_REFLEN); I think it's good to add the MY_ZEROFILL here, but then the two memset() are redundant and can be removed. > @@ -1235,6 +1251,14 @@ int Repl_semi_sync_master::update_sync_header(THD* thd, unsigned char *packet, > *need_sync= sync; > > l_end: > + if (is_on()) > + { > + thd->slave_info->sync_status= > + sync ? thd->slave_info->sync_status= > + Slave_info::SYNC_STATE_SEMI_SYNC_ACTIVE > + : thd->slave_info->sync_status= > + Slave_info::SYNC_STATE_SEMI_SYNC_STALE; > + } There's something wierd here, double assignment of thd->slave_info->sync_status, probably you just meant: thd->slave_info->sync_status= sync ? Slave_info::SYNC_STATE_SEMI_SYNC_ACTIVE : Slave_info::SYNC_STATE_SEMI_SYNC_STALE; More importantly, update_sync_header() is called just _before_ writing the event on the socket to the connected slave. Is that really the right place to do so? Shouldn't the status be updated only _after_ the event has been sent? Or does it not matter, because the actual TCP sending is asynchroneous anyway? In that case, should add a comment explaining why updating the status here is ok. > --- a/mysql-test/suite/rpl/r/rpl_mixed_ddl_dml.result > +++ b/mysql-test/suite/rpl/r/rpl_mixed_ddl_dml.result > @@ -11,8 +11,8 @@ n > 2002 > connection master; > show slave hosts; > -Server_id Host Port Master_id > -2 127.0.0.1 9999 1 > +Server_id Host Port Master_id Gtid_State_Sent Gtid_State_Ack Sync_Status > +2 127.0.0.1 9999 1 0-1-2 Asynchronous Just a remark that this output could be non-deterministic unless the test guarantees that the state will be stable at this point. So please check that the test will produce stable output; or if you already did so, then great! > +--echo # state, and Gtid_State_Sent/Ack should start, and only Gtid_State_Sent Typo, s/should start/should start empty/ ? > +--connection server_2 > +--source include/stop_slave.inc > +set global rpl_semi_sync_slave_enabled = 1; > +--source include/start_slave.inc > + > +--connection server_1 > +let $status_var= Rpl_semi_sync_master_clients; > +let $status_var_value= 1; > +source include/wait_for_status_var.inc; > + > +--replace_result $SLAVE_MYPORT SLAVE_PORT $SERVER_MYPORT_3 SLAVE2_PORT $DEFAULT_MASTER_PORT DEFAULT_PORT > +SHOW REPLICA HOSTS; Hm, won't this output be non-deterministic? From a quick look in the code, Rpl_semi_sync_master_clients is incremented from the start of mysql_binlog_send(), but the state is only set from update_sync_header() when the first event is sent to the slave. So won't there be a (small) window where the state has not yet switched to ACTIVE when this SHOW SLAVE HOSTS runs? So better wait for the state to become ACTIVE instead. Or for the expected value in Gtid_Pos_Sent, where that is non-empty. Also, better use SHOW SLAVE HOSTS here, for consistency with rest of test suite. > +--let $nextval= `SELECT max(a)+1 from t1` > +--eval insert into t1 values ($nextval) Is there a reason here to use separate --let and --eval? And not just: INSERT INTO t1 SELECT max(a)+1 FROM t1; > +--echo # Ensuring master gtid_binlog_pos matches Gtid_State_Ack > +--let $gtid_ack= query_get_value(show slave hosts, Gtid_State_Ack, 1) > +if (`SELECT strcmp("$master_gtid","$gtid_ack") != 0`) > +{ > + --echo # Master gtid_binlog_pos: $master_gtid > + --echo # Gtid_State_Ack: $gtid_ack > + --die Master's gtid_binlog_pos should match Gtid_State_Ack, but doesn't Again, sync_with_master_gtid.inc on the slave will not guarantee that the master saw the ack yet, even though it will be a rare race when it doesn't happen. So you'd need to wait for the expected value here with wait_condition.inc or somilar instead, I think. Probably the same in subsequent parts of the test case. > +--echo # > +--echo # 21322.5: When connecting a new slave (server_id 3) which initially has > +--echo # semi-sync disabled, SHOW SLAVE HOSTS on the master should show its > +--echo # Sync_Status as asynchronous (while server_id 2 is still semi-sync > +--echo # active). Sorry, it's a very long and complex test case, I didn't have the time today to carefully read and check every detail in it after this point. Please carefully check all the cases for possible races like described above. Replication tests are notorious for having non-deterministic failures (as I'm sure you've experienced first-hand), and it's important to try to avoid that as much as possible. A general remark on the test, the output of Gtid_Pos_Sent and Gtid_Pos_Ack are full GTID positions (multiple domain_id). But there is no test of having more than one domain in the output. I think you need to extend the test to test at least one case where there are multiple domains. And also need to test when the GTID state in the binlog has multiple server_id (though maybe the test case already does this, I didn't check myself). - Kristian.

2 3

Re: eba212af92d: MDEV-9101 Limit size of created disk temporary files and tables
by Sergei Golubchik 08 May '24

08 May '24

Hi, Monty, On Apr 18, Michael Widenius wrote: > commit eba212af92d > Author: Michael Widenius <monty(a)mariadb.org> > Date: Thu Mar 14 17:59:00 2024 +0100 > > diff --git a/sql/json_table.cc b/sql/json_table.cc > index 6713e796eb7..da5ab0c450a 100644 > --- a/sql/json_table.cc > +++ b/sql/json_table.cc > @@ -734,7 +734,8 @@ bool Create_json_table::finalize(THD *thd, TABLE *table, > > table->db_stat= HA_OPEN_KEYFILE; > if (unlikely(table->file->ha_open(table, table->s->path.str, O_RDWR, > - HA_OPEN_TMP_TABLE | HA_OPEN_INTERNAL_TABLE))) > + HA_OPEN_TMP_TABLE | HA_OPEN_INTERNAL_TABLE | > + HA_OPEN_SIZE_TRACKING))) shouldn't HA_OPEN_INTERNAL_TABLE always imply HA_OPEN_SIZE_TRACKING automatically? > DBUG_RETURN(true); > > table->set_created(); > diff --git a/storage/maria/ma_statrec.c b/storage/maria/ma_statrec.c > index d8a8b0a05d7..ff0c6c7a617 100644 > --- a/storage/maria/ma_statrec.c > +++ b/storage/maria/ma_statrec.c > @@ -45,6 +45,13 @@ my_bool _ma_write_static_record(MARIA_HA *info, const uchar *record) > my_errno=HA_ERR_RECORD_FILE_FULL; > return(2); > } > + if (info->s->tracked && are there temporary files that you don't want to track? are there non-temporary files that you want to track? > + _ma_update_tmp_file_size(&info->s->track_data, > + MY_ALIGN(info->s->state.state.data_file_length+ > + info->s->base.pack_reclength, > + MARIA_TRACK_INCREMENT_SIZE))) > + return(2); > + > if (info->opt_flag & WRITE_CACHE_USED) > { /* Cash in use */ > if (my_b_write(&info->rec_cache, record, > diff --git a/storage/maria/ma_page.c b/storage/maria/ma_page.c > index 5881456a69a..0877c966d1c 100644 > --- a/storage/maria/ma_page.c > +++ b/storage/maria/ma_page.c > @@ -417,6 +417,13 @@ my_off_t _ma_new(register MARIA_HA *info, int level, > DBUG_RETURN(HA_OFFSET_ERROR); > } > share->state.state.key_file_length+= block_size; > + if (info->s->tracked && > + _ma_update_tmp_file_size(&share->track_index, > + share->state.state.key_file_length)) shouldn't you do it before share->state.state.key_file_length is increased? Currently it seems that you're failing the operation, but the file will stay increased same pattern everywhere else > + { > + mysql_mutex_unlock(&share->intern_lock); > + DBUG_RETURN(HA_OFFSET_ERROR); > + } > /* Following is for not transactional tables */ > info->state->key_file_length= share->state.state.key_file_length; > mysql_mutex_unlock(&share->intern_lock); > diff --git a/storage/maria/ma_delete_all.c b/storage/maria/ma_delete_all.c > index f355d0da3e8..ed2087d3091 100644 > --- a/storage/maria/ma_delete_all.c > +++ b/storage/maria/ma_delete_all.c > @@ -132,9 +132,16 @@ int maria_delete_all_rows(MARIA_HA *info) > if (_ma_flush_table_files(info, MARIA_FLUSH_DATA|MARIA_FLUSH_INDEX, > FLUSH_IGNORE_CHANGED, FLUSH_IGNORE_CHANGED) || > mysql_file_chsize(info->dfile.file, 0, 0, MYF(MY_WME)) || > - mysql_file_chsize(share->kfile.file, share->base.keystart, 0, MYF(MY_WME))) > + mysql_file_chsize(share->kfile.file, share->base.keystart, 0, > + MYF(MY_WME))) should be > 0 here, shouldn't it? Both for kfile and dfile. > goto err; > > + if (info->s->tracked) > + { > + _ma_update_tmp_file_size(&info->s->track_data, 0); > + _ma_update_tmp_file_size(&info->s->track_index, share->base.keystart); but this shouldn't update the size if mysql_file_chsize returned -1 > + } > + > if (_ma_initialize_data_file(share, info->dfile.file)) > goto err; > > diff --git a/sql/uniques.cc b/sql/uniques.cc > index 36725e80a6b..44d93bb0f88 100644 > --- a/sql/uniques.cc > +++ b/sql/uniques.cc > @@ -720,7 +720,7 @@ bool Unique::merge(TABLE *table, uchar *buff, size_t buff_size, > /* Open cached file for table records if it isn't open */ > if (! my_b_inited(outfile) && > open_cached_file(outfile, mysql_tmpdir, TEMP_PREFIX, DISK_CHUNK_SIZE, > - MYF(MY_WME))) > + MYF(MY_WME | MY_TRACK | MY_TRACK_WITH_LIMIT))) as far as I can see, open_cached_file is only used for temporary files. it should apply MY_TRACK | MY_TRACK_WITH_LIMIT internally, it'll be less error-prone > return 1; > > bzero((char*) &sort_param,sizeof(sort_param)); > diff --git a/storage/maria/ma_write.c b/storage/maria/ma_write.c > index 9a6859bfd4d..7404db84067 100644 > --- a/storage/maria/ma_write.c > +++ b/storage/maria/ma_write.c > @@ -386,6 +387,11 @@ int maria_write(MARIA_HA *info, const uchar *record) > } > } > } > + else if (my_errno == HA_ERR_LOCAL_TMP_SPACE_FULL || > + my_errno == HA_ERR_GLOBAL_TMP_SPACE_FULL) > + { > + filepos= HA_OFFSET_ERROR; /* Avoid write_record_abort() */ why avoid? HA_ERR_LOCAL_TMP_SPACE_FULL is like EQUOT or ENOSPC. why should be masked as HA_OFFSET_ERROR, and not be a fatal error? > + } > else > fatal_error= 1; > > diff --git a/storage/maria/ma_close.c b/storage/maria/ma_close.c > index 7441e29a97b..47935d61b88 100644 > --- a/storage/maria/ma_close.c > +++ b/storage/maria/ma_close.c > @@ -14,7 +14,7 @@ > along with this program; if not, write to the Free Software > Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA */ > > -/* close a isam-database */ > +/* close an Aria table */ at last :) it's what 27 years old? > /* > TODO: > We need to have a separate mutex on the closed file to allow other threads > diff --git a/sql/sql_window.cc b/sql/sql_window.cc > index e0c7cef1a9e..4959c7833a1 100644 > --- a/sql/sql_window.cc > +++ b/sql/sql_window.cc > @@ -2922,15 +2922,16 @@ bool compute_window_func(THD *thd, > if (unlikely(thd->is_error() || thd->is_killed())) > break; > > - > /* Return to current row after notifying cursors for each window > function. */ > - tbl->file->ha_rnd_pos(tbl->record[0], rowid_buf); > + if (tbl->file->ha_rnd_pos(tbl->record[0], rowid_buf)) > + return true; how can it fail? > } > > /* We now have computed values for each window function. They can now > be saved in the current row. */ > - save_window_function_values(window_functions, tbl, rowid_buf); > + if (save_window_function_values(window_functions, tbl, rowid_buf)) > + return true; > > rownum++; > } > diff --git a/mysql-test/main/status.test b/mysql-test/main/status.test > index acf29f35b9e..b6be046a9ac 100644 > --- a/mysql-test/main/status.test > +++ b/mysql-test/main/status.test > @@ -244,9 +244,9 @@ let $rnd_next = `show global status like 'handler_read_rnd_next'`; > let $tmp_table = `show global status like 'Created_tmp_tables'`; > show status like 'com_show_status'; > show status like 'hand%write%'; > -show status like '%tmp%'; > +show status like '%_tmp%'; Doesn't look like a very meaningful change. Did you mean '%\_tmp%' ? > show status like 'hand%write%'; > -show status like '%tmp%'; > +show status like '%_tmp%'; > show status like 'com_show_status'; > let $rnd_next2 = `show global status like 'handler_read_rnd_next'`; > let $tmp_table2 = `show global status like 'Created_tmp_tables'`; > diff --git a/mysys/mf_cache.c b/mysys/mf_cache.c > index 2fec59f4e70..712187f6181 100644 > --- a/mysys/mf_cache.c > +++ b/mysys/mf_cache.c > @@ -43,7 +43,7 @@ my_bool open_cached_file(IO_CACHE *cache, const char* dir, const char *prefix, > cache->file_name=0; > cache->buffer=0; /* Mark that not open */ > if (!init_io_cache(cache, -1, cache_size, WRITE_CACHE, 0L, 0, > - MYF(cache_myflags | MY_NABP))) > + MYF(cache_myflags | MY_NABP | MY_TRACK))) Why not MY_TRACK_WITH_LIMIT ? > { > DBUG_RETURN(0); > } > diff --git a/sql/sql_select.cc b/sql/sql_select.cc > index 6f2797b6b39..324f7679e8f 100644 > --- a/sql/sql_select.cc > +++ b/sql/sql_select.cc > @@ -22275,7 +22275,8 @@ bool open_tmp_table(TABLE *table) > int error; > if (unlikely((error= table->file->ha_open(table, table->s->path.str, O_RDWR, > HA_OPEN_TMP_TABLE | > - HA_OPEN_INTERNAL_TABLE)))) > + HA_OPEN_INTERNAL_TABLE | > + HA_OPEN_SIZE_TRACKING)))) shouldn't HA_OPEN_TMP_TABLE or HA_OPEN_INTERNAL_TABLE imply HA_OPEN_SIZE_TRACKING? > { > table->file->print_error(error, MYF(0)); /* purecov: inspected */ > table->db_stat= 0; > diff --git a/sql/handler.cc b/sql/handler.cc > index fec473d97bb..a05b7476724 100644 > --- a/sql/handler.cc > +++ b/sql/handler.cc > @@ -443,9 +444,12 @@ int ha_init_errors(void) > /* Zerofill it to avoid uninitialized gaps. */ remove the comment, if you no longer zerofill > if (! (handler_errmsgs= (const char**) my_malloc(key_memory_handler_errmsgs, > HA_ERR_ERRORS * sizeof(char*), > - MYF(MY_WME | MY_ZEROFILL)))) > + MYF(MY_WME)))) > return 1; > > + /* Copy default handler error messages */ > + memcpy(handler_errmsgs, handler_error_messages, HA_ERR_ERRORS * sizeof(char*)); > + > /* Set the dedicated error messages. */ > SETMSG(HA_ERR_KEY_NOT_FOUND, ER_DEFAULT(ER_KEY_NOT_FOUND)); > SETMSG(HA_ERR_FOUND_DUPP_KEY, ER_DEFAULT(ER_DUP_KEY)); > diff --git a/storage/maria/maria_def.h b/storage/maria/maria_def.h > index 705562eb795..b256466f1f0 100644 > --- a/storage/maria/maria_def.h > +++ b/storage/maria/maria_def.h > @@ -778,6 +785,14 @@ typedef struct st_maria_share > myf write_flag; > enum data_file_type data_file_type; > enum pagecache_page_type page_type; /* value depending transactional */ > + > + /* > + tracked will cause lost bytes (not aligned) but this is ok as it is always > + used with tmp_file_tracking if set I don't understand your explanation why "this is ok" > + */ > + my_bool tracked; /* Tracked table (always internal) */ > + struct tmp_file_tracking track_data,track_index; > + > /** > if Checkpoint looking at table; protected by close_lock or THR_LOCK_maria > */ > diff --git a/include/my_sys.h b/include/my_sys.h > index af509c7df45..0df12963ab9 100644 > --- a/include/my_sys.h > +++ b/include/my_sys.h > @@ -178,6 +180,17 @@ uchar *my_large_malloc(size_t *size, myf my_flags); > void my_large_free(void *ptr, size_t size); > void my_large_page_truncate(size_t *size); > > +/* Tracking tmp file usage */ > + > +struct tmp_file_tracking > +{ > + ulonglong last_position; I had to read how "last_position" is used to realize that it's in fact previous file_size value. Why not to call it previous_file_size? Or recorded_file_size ? Or old_file_size? it's not a "position" > + ulonglong file_size; > +}; > + > +typedef int (*TMPFILE_SIZE_CB)(struct tmp_file_tracking *track, int no_error); > +extern TMPFILE_SIZE_CB update_tmp_file_size; > + > #ifdef _WIN32 > extern BOOL my_obtain_privilege(LPCSTR lpPrivilege); > #endif > diff --git a/sql/sql_class.h b/sql/sql_class.h > index c1efb49f6da..42eee9913ec 100644 > --- a/sql/sql_class.h > +++ b/sql/sql_class.h > @@ -1070,12 +1074,22 @@ typedef struct system_status_var > */ > > #define last_system_status_var questions > -#define last_cleared_system_status_var local_memory_used > + > +/* Parameters to set_status_var_init() */ > + > +#define STATUS_OFFSET(A) offsetof(STATUS_VAR,A) > +/* Clear as part of flush */ > +#define clear_up_to_tmp_space_used STATUS_OFFSET(tmp_space_used) > +/* Clear as part of startup */ > +#define clear_up_to_memory_used STATUS_OFFSET(local_memory_used) > +/* Full initialization. Note that global_memory_used is updated early! */ > +#define clear_up_to_global_memory_used STATUS_OFFSET(global_memory_used) > +#define last_restored_status_var clear_up_to_tmp_space_used those are bad names. and the fact that you need to rename these defines and change all code that uses them - this kind of shows they're not good. when there will be a new variable, clear_up_to_tmp_space_used can become clear_up_to_new_variable and everything will need to be changed again. better use names that don't depend on individual variables. last_restored_status_var and last_cleared_system_status_var are good. Try to split variables logically. First there are additive variables - global value is the sum of all local values. They're cleared, they're added up tp global, etc. Then there are non-additive, not cleared, not added - like local_memory_used and tmp_space_used, for example. And global_memory_used is special because it's initialized early, right? so, may be defines based on these properties would be easier to understand. because now I'm totally at loss when to use clear_up_to_tmp_space_used, when to use clear_up_to_memory_used, when clear_up_to_global_memory_used and when last_restored_status_var. > + > > /** Number of contiguous global status variables */ > -constexpr int COUNT_GLOBAL_STATUS_VARS= int(offsetof(STATUS_VAR, > - last_system_status_var) / > - sizeof(ulong)) + 1; > +constexpr int COUNT_GLOBAL_STATUS_VARS= > + int(STATUS_OFFSET(last_system_status_var) /sizeof(ulong)) + 1; > > /* > Global status variables > diff --git a/sql/filesort.cc b/sql/filesort.cc > index abcd4127a0b..df4508b635a 100644 > --- a/sql/filesort.cc > +++ b/sql/filesort.cc > @@ -1082,6 +1078,11 @@ static ha_rows find_all_keys(THD *thd, Sort_param *param, SQL_SELECT *select, > DBUG_RETURN(num_records); > > err: > + if (!quick_select) > + { > + (void) file->extra(HA_EXTRA_NO_CACHE); > + file->ha_rnd_end(); > + } why? > sort_form->column_bitmaps_set(save_read_set, save_write_set); > DBUG_RETURN(HA_POS_ERROR); > } /* find_all_keys */ > diff --git a/sql/mysqld.cc b/sql/mysqld.cc > index dc4a955aa79..4727b7b5413 100644 > --- a/sql/mysqld.cc > +++ b/sql/mysqld.cc > @@ -1947,6 +1950,10 @@ static void mysqld_exit(int exit_code) > if (exit_code == 0 || opt_endinfo) > SAFEMALLOC_REPORT_MEMORY(0); > } > + if (global_tmp_space_used) > + fprintf(stderr, "Warning: tmp_space not freed: %llu\n", > + (ulonglong) global_tmp_space_used); This is confusing. almost all (if not all) temp space is automatically freed by the OS when the process exits, so this warning will only scare the user about lost space when in fact no space is lost, it's only some space was not accounted for inside MariaDB. I suggest either make this warning debug-only or rephrase it like "Note: %llu tmp_space was lost and will be freed automatically when the process exits" > + > DBUG_LEAVE; > #ifdef _WIN32 > my_report_svc_status(SERVICE_STOPPED, exit_code, 0); > @@ -7571,6 +7636,7 @@ SHOW_VAR status_vars[]= { > {"Tc_log_page_size", (char*) &tc_log_page_size, SHOW_LONG_NOFLUSH}, > {"Tc_log_page_waits", (char*) &tc_log_page_waits, SHOW_LONG}, > #endif > + {"tmp_space_used", (char*) offsetof(STATUS_VAR, tmp_space_used), SHOW_LONGLONG_STATUS}, first letter - upper case > #ifdef HAVE_POOL_OF_THREADS > {"Threadpool_idle_threads", (char *) &show_threadpool_idle_threads, SHOW_SIMPLE_FUNC}, > {"Threadpool_threads", (char *) &show_threadpool_threads, SHOW_SIMPLE_FUNC}, > diff --git a/sql/log.cc b/sql/log.cc > index 460cefea47b..a9bc3758dd5 100644 > --- a/sql/log.cc > +++ b/sql/log.cc > @@ -2388,6 +2384,8 @@ bool Event_log::check_write_error(THD *thd) > case ER_TRANS_CACHE_FULL: > case ER_STMT_CACHE_FULL: > case ER_ERROR_ON_WRITE: > + case EE_LOCAL_TMP_SPACE_FULL: > + case EE_GLOBAL_TMP_SPACE_FULL: not a good idea to mix EE_ OS-like errors and ER_ SQL layer errors > case ER_BINLOG_LOGGING_IMPOSSIBLE: > checked= TRUE; > break; > @@ -6650,7 +6651,8 @@ Event_log::flush_and_set_pending_rows_event(THD *thd, Rows_log_event* event, > { > set_write_error(thd, is_transactional); > if (check_cache_error(thd, cache_data) && > - stmt_has_updated_non_trans_table(thd)) > + (stmt_has_updated_non_trans_table(thd) || > + !is_transactional)) why? > cache_data->set_incident(); > delete pending; > cache_data->set_pending(NULL); > @@ -7868,8 +7877,16 @@ int Event_log::write_cache(THD *thd, binlog_cache_data *cache_data) > DBUG_RETURN(res ? ER_ERROR_ON_WRITE : 0); > } > > - if (reinit_io_cache(cache, READ_CACHE, 0, 0, 0)) > + /* > + Allow flush of transaction logs to temporary go over the tmp space limit > + as we do not want the commit to fail > + */ > + cache->myflags&= ~MY_TRACK_WITH_LIMIT; > + res= reinit_io_cache(cache, READ_CACHE, 0, 0, 0); > + cache->myflags|= MY_TRACK_WITH_LIMIT; you also remove MY_TRACK_WITH_LIMIT inside reinit_io_cache(), so this is redundant. Or it's redundant inside reinit_io_cache(), either way, no point in doing it twice > + if (res) > DBUG_RETURN(ER_ERROR_ON_WRITE); > + > /* Amount of remaining bytes in the IO_CACHE read buffer. */ > size_t log_file_pos; > uchar header_buf[LOG_EVENT_HEADER_LEN]; > diff --git a/mysys/mf_iocache.c b/mysys/mf_iocache.c > index 4ee1331bdb3..6214057f9e3 100644 > --- a/mysys/mf_iocache.c > +++ b/mysys/mf_iocache.c > @@ -73,6 +74,50 @@ int (*_my_b_encr_read)(IO_CACHE *info,uchar *Buffer,size_t Count)= 0; > int (*_my_b_encr_write)(IO_CACHE *info,const uchar *Buffer,size_t Count)= 0; > > > +static inline my_bool tmp_file_track(IO_CACHE *info, ulonglong file_size) > +{ > + if ((info->myflags & (MY_TRACK | MY_TRACK_WITH_LIMIT)) && > + update_tmp_file_size) > + { > + if (info->tracking.file_size < file_size) > + { > + int error; > + info->tracking.file_size= file_size; > + if ((error= update_tmp_file_size(&info->tracking, > + !(info->myflags & > + MY_TRACK_WITH_LIMIT)))) > + { > + if (info->myflags & MY_WME) > + my_error(my_errno, MYF(0)); > + info->error= -1; > + return 1; > + } > + } > + } > + return 0; > +} > + > +my_bool io_cache_tmp_file_track(IO_CACHE *info, ulonglong file_size) > +{ > + return tmp_file_track(info, file_size); > +} > + > + > +void end_tracking_io_cache(IO_CACHE *info) what does "end tracking IO_CACHE" mean? I read it as "end tracking of this IO_CACHE, it is no longer tracked after this function call" - but it's not what it is doing. so what how do you read this function name? > +{ > + if ((info->myflags & (MY_TRACK | MY_TRACK_WITH_LIMIT)) && > + info->tracking.file_size) > + { > + info->tracking.file_size= 0; > + update_tmp_file_size(&info->tracking, 1); > + } > +} > + > +void truncate_io_cache(IO_CACHE *info) > +{ > + if (my_chsize(info->file, 0, 0, MYF(MY_WME)) <= 0) > + end_tracking_io_cache(info); not quite. If my_chsize() returns -1, you should not do info->tracking.file_size= 0, because file size wasn't actually changed. > +} > > static void > init_functions(IO_CACHE* info) Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 1

Re: 0d6dc52f43d: MDEV-30046 Refactor write_record and fix idempotent replication
by Sergei Golubchik 04 May '24

04 May '24

Hi, Nikita, On Apr 29, Nikita Malyavin wrote: > revision-id: 0d6dc52f43d (mariadb-10.5.24-175-g0d6dc52f43d) > parent(s): 652e9ee405e > author: Nikita Malyavin > committer: Nikita Malyavin > timestamp: 2024-04-27 21:08:21 +0200 > message: > > MDEV-30046 Refactor write_record and fix idempotent replication > > Idempotent write_row works same as REPLACE: if there is a duplicating > record in the table, then it will be deleted and re-inserted, with the > same update optimization. > > The code in Rows:log_event::write_row was basically copy-pasted from > write_record. > > What's done: > REPLACE operation was unified across replication and sql. It is now > representred as a Write_record class, that holds the whole state, and allows > re-using some resources in between the row writes. > > Replace, IODKU and single insert implementations are split across different > methods, reluting in a much cleaner code. > > The entry point is preserved as a single Write_record::write_record() call. > The implementation to call is chosen on the constructor stage. > > This allowed several optimizations to be done: > 1. The table key list is not iterated for every row. We find last unique key in > the order of checking once and preserve it across the rows. See last_uniq_key(). > 2. ib_handler::referenced_by_foreign_key acquires a global lock. This call was > done per row as well. Not all the table config that allows optimized replace is > folded into a single boolean field can_optimize. All the fields to check are > even stored in a single register on a 64-bit platform. > 3. DUP_REPLACE and DUP_UPDATE cases now have one less level of indirection > 4. modified_non_trans_tables is checked and set only when it's really needed. > 5. Obsolete bitmap manipulations are removed. > > Also: > * Unify replace initialization step across implementations: > add prepare_for_replace and finalize_replace > * alloca is removed in favor of mem_root allocation. This memory is reused > across the rows. > * An rpl-related callback is added to the replace branch, meaning that an extra > check is made per row replace even for the common case. It can be avoided with > templates if considered a problem. > > diff --git a/mysql-test/main/long_unique_bugs_replication.test b/mysql-test/main/long_unique_bugs_replication.test > index 9c44d13e6a5..f2225cde101 100644 > --- a/mysql-test/main/long_unique_bugs_replication.test > +++ b/mysql-test/main/long_unique_bugs_replication.test > + > +--echo # > +--echo # End of 10.4 tests > +--echo # ... > --echo # > --echo # End of 10.4 tests > --echo # may be 10.5? > diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc > index 38e8b062f18..b5810eeaf7b 100644 > --- a/sql/log_event_server.cc > +++ b/sql/log_event_server.cc > @@ -5705,10 +5708,10 @@ int Rows_log_event::do_apply_event(rpl_group_info *rgi) > &m_cols_ai : &m_cols); > bitmap_intersect(table->write_set, after_image); > > - this->slave_exec_mode= slave_exec_mode_options; // fix the mode > - > + COPY_INFO copy_info; > + Write_record write_record; > // Do event specific preparations > - error= do_before_row_operations(rli); > + error= do_before_row_operations(rgi, &copy_info, &write_record); You create write_record on the stack and save it inside this->m_write_record 1. why not to have it permanently a part of Rows_log_event ? 2. if you'll keep it on the stack, please, reset m_write_record=0 before returning from ::do_apply_event(). And may be even do m_write_record= &write_record here, not in ::do_before_row_operations() it'll be a bit easier to see how a local variable is stored and then freed and that it doesn't actually leave the scope. > > /* > Bug#56662 Assertion failed: next_insert_id == 0, file handler.cc > @@ -7132,6 +7148,40 @@ Write_rows_log_event::do_before_row_operations(const Slave_reporting_capability > m_table->mark_auto_increment_column(true); > } > > + if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT && > + (m_table->file->ha_table_flags() & HA_DUPLICATE_POS || > + m_table->s->long_unique_table)) > + error= m_table->file->ha_rnd_init_with_error(0); I'd put this inside if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT) about ~40 lines above the place where you put it now. > + > + if (!error) > + { > + bzero(copy_info, sizeof *copy_info); > + copy_info->handle_duplicates= > + slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT ? > + DUP_REPLACE : DUP_ERROR; simply copy_info->handle_duplicates= thd->lex->duplicates; > + copy_info->table_list= m_table->pos_in_table_list; > + > + int (*callback)(void *, void*)= NULL; > + if (!get_flags(COMPLETE_ROWS_F)) > + { > + /* > + If row is incomplete we will use the record found to fill > + missing columns. > + */ > + callback= [](void *e, void* r)->int { > + auto rgi= static_cast<rpl_group_info*>(r); > + auto event= static_cast<Write_rows_log_event*>(e); > + return event->incomplete_record_callback(rgi); > + }; > + } > + new (write_record) Write_record(thd, m_table, copy_info, > + m_vers_from_plain && > + m_table->versioned(VERS_TIMESTAMP), > + m_table->triggers && do_invoke_trigger(), > + NULL, callback, this, rgi); please, use write_record->init(...) here. And in all similar places below, I won't comment on them separately. > + m_write_record= write_record; > + } > + > return error; > } > > @@ -7173,7 +7223,15 @@ Write_rows_log_event::do_after_row_operations(const Slave_reporting_capability * > { > m_table->file->print_error(local_error, MYF(0)); > } > - return error? error : local_error; > + int rnd_error= 0; > + if (m_table->file->inited) > + { > + DBUG_ASSERT(slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT); > + DBUG_ASSERT(m_table->file->ha_table_flags() & HA_DUPLICATE_POS || > + m_table->s->long_unique_table); DBUG_ASSERT(m_table->file->inited == handler::RND); > + rnd_error= m_table->file->ha_rnd_end(); > + } > + return error? error : local_error ? local_error : rnd_error; > } > > bool Rows_log_event::process_triggers(trg_event_type event, > diff --git a/sql/sql_insert.cc b/sql/sql_insert.cc > index ff71361493a..e410efda121 100644 > --- a/sql/sql_insert.cc > +++ b/sql/sql_insert.cc > @@ -677,6 +678,30 @@ Field **TABLE::field_to_fill() > return triggers && triggers->nullable_fields() ? triggers->nullable_fields() : field; > } > > +int prepare_for_replace(TABLE *table, enum_duplicates handle_duplicates, > + bool ignore) > +{ > + bool create_lookup_handler= handle_duplicates != DUP_ERROR; why do you initialize create_lookup_handler here and then immediately repeat it two lines below? :) less confusing would be bool create_lookup_handler= handle_duplicates != DUP_ERROR || ignore; if (create_lookup_handler) { > + if (ignore || handle_duplicates != DUP_ERROR) > + { > + create_lookup_handler= true; > + table->file->extra(HA_EXTRA_IGNORE_DUP_KEY); > + if (table->file->ha_table_flags() & HA_DUPLICATE_POS || > + table->s->long_unique_table) > + { > + if (table->file->ha_rnd_init_with_error(false)) > + return 1; > + } > + } > + return table->file->prepare_for_insert(create_lookup_handler); > +} > + > +int finalize_replace(TABLE *table, enum_duplicates handle_duplicates, > + bool ignore) > +{ DBUG_ASSERT(m_table->file->inited != handler::INDEX); > + return table->file->inited ? table->file->ha_rnd_end() : 0; > +} > + > > /** > INSERT statement implementation > @@ -1728,31 +1745,74 @@ int mysql_prepare_insert(THD *thd, TABLE_LIST *table_list, > } > > > /* Check if there is more uniq keys after field */ this comment isn't correct anymore, is it? on the other hand, the function is small and the name is clear, I think you can simply delete the comment, if you'd like. > +static ushort get_last_unique_key(TABLE *table, KEY *key_info) > +{ > + for (uint key= table->s->keys; key; --key) > + if (key_info[key-1].flags & HA_NOSAME) > + return key-1; > + return MAX_KEY; > +} > > +/** > + An update optimization can be done in REPLACE only when certain criterial met. > + In particular, if the "error key" was the last one to check. If not, there can > + be other keys that can fail the uniqueness check after we will delete the row. > + > + The problem arises with the order of the key checking. It is divided in two > + parts: > + 1. Higher-level handler::ha_write_row goes first. It checks for long unique > + and period overlaps (the latter is not supported by REPLACE yet). better "(when the latter will be supported by REPLACE)" this way the comment will be less wrong and misleading when we'll finally implement this support and but forget to update the comment :) > + 2. Engine usual unique index check. > + However, the order of the keys in TABLE_SHARE::key_info is: > + * first, engine unique keys > + * then long unique, which is not seen by engine > + WITHOUT OVERLAPS key has no particular order > + see sort_keys in sql_table.cc > + > + So we need to wrap this order. > + @return last unique key to check for duplication, or MAX_KEY if none. > + */ > -static int last_uniq_key(TABLE *table, const KEY *key, uint keynr) > +ushort Write_record::get_last_unique_key() const > { > + if (info->handle_duplicates != DUP_REPLACE) > + return MAX_KEY; // Not used. > + if (!table->s->key_info) > + return MAX_KEY; > /* > When an underlying storage engine informs that the unique key > conflicts are not reported in the ascending order by setting > the HA_DUPLICATE_KEY_NOT_IN_ORDER flag, we cannot rely on this > information to determine the last key conflict. > > The information about the last key conflict will be used to > do a replace of the new row on the conflicting row, rather > than doing a delete (of old row) + insert (of new row). > > Hence check for this flag and disable replacing the last row > - by returning 0 always. Returning 0 will result in doing > + by returning MAX_KEY always. Returning MAX_KEY will result in doing > a delete + insert always. > */ > if (table->file->ha_table_flags() & HA_DUPLICATE_KEY_NOT_IN_ORDER) > - return 0; > + return MAX_KEY; > > - while (++keynr < table->s->keys) > - if (key[keynr].flags & HA_NOSAME) > - return 0; > - return 1; > + /* > + Note, that TABLE_SHARE and TABLE see long uniques differently: > + - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME > + - TABLE sees as usual non-unique indexes > + */ > + return > + table->key_info[0].flags & HA_NOSAME ? > + /* > + We have an in-engine unique, which should be checked last. > + Go through the TABLE list, which knows nothing about long uniques. > + */ > + ::get_last_unique_key(table, table->key_info) : > + /* > + We can only have long uniques. > + Go through the TABLE_SHARE list, where long uniques can be found. > + */ > + ::get_last_unique_key(table, table->s->key_info); > } > > > @@ -1786,449 +1846,511 @@ int vers_insert_history_row(TABLE *table) > } > > /* > - Write a record to table with optional deleting of conflicting records, > - invoke proper triggers if needed. > - > - SYNOPSIS > - write_record() > - thd - thread context > - table - table to which record should be written > - info - COPY_INFO structure describing handling of duplicates > - and which is used for counting number of records inserted > - and deleted. > - sink - result sink for the RETURNING clause > - > - NOTE > - Once this record will be written to table after insert trigger will > - be invoked. If instead of inserting new record we will update old one > - then both on update triggers will work instead. Similarly both on > - delete triggers will be invoked if we will delete conflicting records. > - > - Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which > - is updated didn't have transactions. > - > - RETURN VALUE > - 0 - success > - non-0 - error > - > -int write_record(THD *thd, TABLE *table, COPY_INFO *info, select_result *sink) > -{ > - int error, trg_error= 0; > - char *key=0; > - MY_BITMAP *save_read_set, *save_write_set; > - ulonglong prev_insert_id= table->file->next_insert_id; > - ulonglong insert_id_for_cur_row= 0; > - ulonglong prev_insert_id_for_cur_row= 0; > - DBUG_ENTER("write_record"); > + Retrieve the old (duplicated) row into record[1] to be able to > + either update or delete the offending record. We either: > + > + - use ha_rnd_pos() with a row-id (available as dup_ref) to the > + offending row, if that is possible (MyISAM and Blackhole, also long unique > + and application-time periods), or else > + > + - use ha_index_read_idx_map() with the key that is duplicated, to > + retrieve the offending row. > */ > +int Write_record::locate_dup_record() > +{ > + handler *h= table->file; > + int error= 0; > + if (h->ha_table_flags() & HA_DUPLICATE_POS || h->lookup_errkey != (uint)-1) > + { > + DBUG_PRINT("info", ("Locating offending record using rnd_pos()")); > + > + error= h->ha_rnd_pos(table->record[1], h->dup_ref); > + if (unlikely(error)) > + { > + DBUG_PRINT("info", ("rnd_pos() returns error %d",error)); > + h->print_error(error, MYF(0)); > + } > + } > + else > + { > + DBUG_PRINT("info", > + ("Locating offending record using ha_index_read_idx_map")); > > + if (h->lookup_handler) > + h= h->lookup_handler; Why? > + > + error= h->extra(HA_EXTRA_FLUSH_CACHE); > + if (unlikely(error)) > + { > + DBUG_PRINT("info",("Error when setting HA_EXTRA_FLUSH_CACHE")); > + return my_errno; > + } > > - info->records++; > - save_read_set= table->read_set; > - save_write_set= table->write_set; > + if (unlikely(key == NULL)) > + { > + key= static_cast<uchar*>(thd->alloc(table->s->max_unique_length)); > + if (key == NULL) > + { > + DBUG_PRINT("info",("Can't allocate key buffer")); > + return ENOMEM; > + } > + } > > - DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200", > + key_copy(key, table->record[0], table->key_info + key_nr, 0); > + > + error= h->ha_index_read_idx_map(table->record[1], key_nr, key, > + HA_WHOLE_KEY, HA_READ_KEY_EXACT); why did you revert commit 5e345281e3599c79 here? > + if (unlikely(error)) > { > - if (thd->rgi_slave && (thd->rgi_slave->current_gtid.seq_no == 100 || > - thd->rgi_slave->current_gtid.seq_no == 200)) > - my_sleep(20000); > - }); > - if (info->handle_duplicates == DUP_REPLACE || > - info->handle_duplicates == DUP_UPDATE) > + DBUG_PRINT("info", ("index_read_idx() returns %d", error)); > + h->print_error(error, MYF(0)); > + } > + } > + > + if (unlikely(error)) > + return error; > + > + if (table->vfield) > + { > + /* > + We have not yet called update_virtual_fields() > + in handler methods for the just read row in record[1]. > + */ > + table->move_fields(table->field, table->record[1], table->record[0]); > + error= table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE); > + table->move_fields(table->field, table->record[0], table->record[1]); > + } > + return error; > +} > + > +/** > + REPLACE is defined as either INSERT or DELETE + INSERT. If > + possible, we can replace it with an UPDATE, but that will not > + work on InnoDB if FOREIGN KEY checks are necessary. > + > + Suppose that we got the duplicate key to be a key that is not > + the last unique key for the table and we perform an update: > + then there might be another key for which the unique check will > + fail, so we're better off just deleting the row and trying to insert again. > + > + Additionally we don't use ha_update_row in following cases: > + * when triggers should be invoked, as it may spoil the intermedite NEW_ROW > + representation > + * when we have referenced keys, as there can be different ON DELETE/ON UPDATE > + actions > +*/ > +int Write_record::replace_row(ha_rows *inserted, ha_rows *deleted) > +{ > + int error; > + while (unlikely(error=table->file->ha_write_row(table->record[0]))) > { > - while (unlikely(error=table->file->ha_write_row(table->record[0]))) > + error= prepare_handle_duplicate(error); > + if (unlikely(error)) > + return restore_on_error(); > + > + if (incomplete_records_cb && (error= incomplete_records_cb(arg1, arg2))) This callback is strange. I'd rather inherit Write_record_rpl from Write_record and put rpl-specific logic there. Also you could avoid a switch on INSERT/REPLACE/ODKU if you'd use inheritance instead. But the switch is less strange than the callback :) > + return restore_on_error(); > + > + if (can_optimize && key_nr == last_unique_key) > { > - uint key_nr; > - /* > - If we do more than one iteration of this loop, from the second one the > - row will have an explicit value in the autoinc field, which was set at > - the first call of handler::update_auto_increment(). So we must save > - the autogenerated value to avoid thd->insert_id_for_cur_row to become > - 0. > - */ > - if (table->file->insert_id_for_cur_row > 0) > - insert_id_for_cur_row= table->file->insert_id_for_cur_row; > - else > - table->file->insert_id_for_cur_row= insert_id_for_cur_row; > - bool is_duplicate_key_error; > - if (table->file->is_fatal_error(error, HA_CHECK_ALL)) > - goto err; > - is_duplicate_key_error= > - table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP); > - if (!is_duplicate_key_error) > - { > - /* > - We come here when we had an ignorable error which is not a duplicate > - key error. In this we ignore error if ignore flag is set, otherwise > - report error as usual. We will not do any duplicate key processing. > - */ > - if (info->ignore) > - { > - table->file->print_error(error, MYF(ME_WARNING)); > - goto after_trg_or_ignored_err; /* Ignoring a not fatal error */ > - } > - goto err; > - } > - if (unlikely((int) (key_nr = table->file->get_dup_key(error)) < 0)) > - { > - error= HA_ERR_FOUND_DUPP_KEY; /* Database can't find key */ > - goto err; > - } > - DEBUG_SYNC(thd, "write_row_replace"); > + DBUG_PRINT("info", ("Updating row using ha_update_row()")); > + if (table->versioned(VERS_TRX_ID)) > + table->vers_start_field()->store(0, false); > > - /* Read all columns for the row we are going to replace */ > - table->use_all_columns(); > - /* > - Don't allow REPLACE to replace a row when a auto_increment column > - was used. This ensures that we don't get a problem when the > - whole range of the key has been used. > - */ > - if (info->handle_duplicates == DUP_REPLACE && table->next_number_field && > - key_nr == table->s->next_number_index && insert_id_for_cur_row > 0) > - goto err; > - if (table->file->has_dup_ref()) > + error= table->file->ha_update_row(table->record[1], table->record[0]); > + > + if (likely(!error)) > + ++*deleted; > + else if (error != HA_ERR_RECORD_IS_THE_SAME) > + return on_ha_error(error); > + > + if (versioned) > { > - DBUG_ASSERT(table->file->inited == handler::RND); > - if (table->file->ha_rnd_pos(table->record[1],table->file->dup_ref)) > - goto err; > + store_record(table, record[2]); > + error= vers_insert_history_row(table); > + restore_record(table, record[2]); > + if (unlikely(error)) > + return on_ha_error(error); > } > + > + break; > + } > + else > + { > + DBUG_PRINT("info", ("Deleting offending row and trying to write" > + " new one again")); > + > + auto *trg = table->triggers; > + if (use_triggers && trg->process_triggers(table->in_use, TRG_EVENT_DELETE, > + TRG_ACTION_BEFORE, TRUE)) > + return restore_on_error(); > + if (!versioned) > + error = table->file->ha_delete_row(table->record[1]); > else > { > - if (table->file->extra(HA_EXTRA_FLUSH_CACHE)) /* Not needed with NISAM */ > - { > - error=my_errno; > - goto err; > - } > - > - if (!key) > - { > - if (!(key=(char*) my_safe_alloca(table->s->max_unique_length))) > - { > - error=ENOMEM; > - goto err; > - } > - } > - key_copy((uchar*) key,table->record[0],table->key_info+key_nr,0); > - key_part_map keypart_map= (1 << table->key_info[key_nr].user_defined_key_parts) - 1; > - if ((error= (table->file->ha_index_read_idx_map(table->record[1], > - key_nr, (uchar*) key, > - keypart_map, > - HA_READ_KEY_EXACT)))) > - goto err; > + store_record(table, record[2]); > + restore_record(table, record[1]); > + table->vers_update_end(); > + error = table->file->ha_update_row(table->record[1], table->record[0]); > + restore_record(table, record[2]); > } > - if (table->vfield) > + > + if (unlikely(error)) > + return on_ha_error(error); > + ++*deleted; > + > + if (use_triggers) > { > - /* > - We have not yet called update_virtual_fields(VOL_UPDATE_FOR_READ) > - in handler methods for the just read row in record[1]. > - */ > - table->move_fields(table->field, table->record[1], table->record[0]); > - int verr = table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE); > - table->move_fields(table->field, table->record[0], table->record[1]); > - if (verr) > - goto err; > + error= trg->process_triggers(table->in_use, TRG_EVENT_DELETE, > + TRG_ACTION_AFTER, TRUE); > + if (unlikely(error)) > + return restore_on_error(); > } > - if (info->handle_duplicates == DUP_UPDATE) > - { > - int res= 0; > - /* > - We don't check for other UNIQUE keys - the first row > - that matches, is updated. If update causes a conflict again, > - an error is returned > - */ > - DBUG_ASSERT(table->insert_values != NULL); > - store_record(table,insert_values); > - restore_record(table,record[1]); > - table->reset_default_fields(); > + } > + } > > - /* > - in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can > - change per row. Thus, we have to do reset_default_fields() per row. > - Twice (before insert and before update). > - */ > - DBUG_ASSERT(info->update_fields->elements == > - info->update_values->elements); > - if (fill_record_n_invoke_before_triggers(thd, table, > - *info->update_fields, > - *info->update_values, > - info->ignore, > - TRG_EVENT_UPDATE)) > - goto before_trg_err; > - > - bool different_records= (!records_are_comparable(table) || > - compare_record(table)); > - /* > - Default fields must be updated before checking view updateability. > - This branch of INSERT is executed only when a UNIQUE key was violated > - with the ON DUPLICATE KEY UPDATE option. In this case the INSERT > - operation is transformed to an UPDATE, and the default fields must > - be updated as if this is an UPDATE. > - */ > - if (different_records && table->default_field) > - table->evaluate_update_default_function(); > - > - /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */ > - res= info->table_list->view_check_option(table->in_use, info->ignore); > - if (res == VIEW_CHECK_SKIP) > - goto after_trg_or_ignored_err; > - if (res == VIEW_CHECK_ERROR) > - goto before_trg_err; > - > - table->file->restore_auto_increment(prev_insert_id); > - info->touched++; > - if (different_records) > - { > - if (unlikely(error=table->file->ha_update_row(table->record[1], > - table->record[0])) && > - error != HA_ERR_RECORD_IS_THE_SAME) > - { > - if (info->ignore && > - !table->file->is_fatal_error(error, HA_CHECK_ALL)) > - { > - if (!(thd->variables.old_behavior & > - OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > - table->file->print_error(error, MYF(ME_WARNING)); > - goto after_trg_or_ignored_err; > - } > - goto err; > - } > + /* > + If more than one iteration of the above loop is done, from > + the second one the row being inserted will have an explicit > + value in the autoinc field, which was set at the first call of > + handler::update_auto_increment(). This value is saved to avoid > + thd->insert_id_for_cur_row becoming 0. Use this saved autoinc > + value. > + */ > + if (table->file->insert_id_for_cur_row == 0) > + table->file->insert_id_for_cur_row = insert_id_for_cur_row; > > - if (error != HA_ERR_RECORD_IS_THE_SAME) > - { > - info->updated++; > - if (table->versioned() && > - table->vers_check_update(*info->update_fields)) > - { > - if (table->versioned(VERS_TIMESTAMP)) > - { > - store_record(table, record[2]); > - if ((error= vers_insert_history_row(table))) > - { > - info->last_errno= error; > - table->file->print_error(error, MYF(0)); > - trg_error= 1; > - restore_record(table, record[2]); > - goto after_trg_or_ignored_err; > - } > - restore_record(table, record[2]); > - } > - info->copied++; > - } > - } > - else > - error= 0; > - /* > - If ON DUP KEY UPDATE updates a row instead of inserting > - one, it's like a regular UPDATE statement: it should not > - affect the value of a next SELECT LAST_INSERT_ID() or > - mysql_insert_id(). Except if LAST_INSERT_ID(#) was in the > - INSERT query, which is handled separately by > - THD::arg_of_last_insert_id_function. > - */ > - prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row; > - insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0; > - trg_error= (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_UPDATE, > - TRG_ACTION_AFTER, TRUE)); > - info->copied++; > - } > + return after_insert(inserted); > +} > > - /* > - Only update next_insert_id if the AUTO_INCREMENT value was explicitly > - updated, so we don't update next_insert_id with the value from the > - row being updated. Otherwise reset next_insert_id to what it was > - before the duplicate key error, since that value is unused. > - */ > - if (table->next_number_field_updated) > - { > - DBUG_ASSERT(table->next_number_field != NULL); > > - table->file->adjust_next_insert_id_after_explicit_value(table->next_number_field->val_int()); > - } > - else if (prev_insert_id_for_cur_row) > - { > - table->file->restore_auto_increment(prev_insert_id_for_cur_row); > - } > - goto ok; > +int Write_record::insert_on_duplicate_update(ha_rows *inserted, > + ha_rows *updated) > +{ > + int res= table->file->ha_write_row(table->record[0]); > + if (likely(!res)) > + return after_insert(inserted); > + > + res=prepare_handle_duplicate(res); > + if (unlikely(res)) > + return restore_on_error(); > + > + ulonglong prev_insert_id_for_cur_row= 0; > + /* > + We don't check for other UNIQUE keys - the first row > + that matches, is updated. If update causes a conflict again, > + an error is returned > + */ > + DBUG_ASSERT(table->insert_values != NULL); > + store_record(table,insert_values); > + restore_record(table,record[1]); > + table->reset_default_fields(); > + > + /* > + in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can > + change per row. Thus, we have to do reset_default_fields() per row. > + Twice (before insert and before update). > + */ > + DBUG_ASSERT(info->update_fields->elements == > + info->update_values->elements); > + if (fill_record_n_invoke_before_triggers(thd, table, > + *info->update_fields, > + *info->update_values, > + info->ignore, > + TRG_EVENT_UPDATE)) > + return restore_on_error(); > + > + bool different_records= (!records_are_comparable(table) || > + compare_record(table)); > + /* > + Default fields must be updated before checking view updateability. > + This branch of INSERT is executed only when a UNIQUE key was violated > + with the ON DUPLICATE KEY UPDATE option. In this case the INSERT > + operation is transformed to an UPDATE, and the default fields must > + be updated as if this is an UPDATE. > + */ > + if (different_records && table->default_field) > + table->evaluate_update_default_function(); > + > + /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */ > + res= info->table_list->view_check_option(table->in_use, info->ignore); > + if (unlikely(res)) > + { > + if (res == VIEW_CHECK_ERROR) > + return restore_on_error(); > + DBUG_ASSERT(res == VIEW_CHECK_SKIP); > + return 0; > + } > + > + table->file->restore_auto_increment(prev_insert_id); > + info->touched++; > + if (different_records) > + { > + int error= table->file->ha_update_row(table->record[1], table->record[0]); > + > + if (unlikely(error) && error != HA_ERR_RECORD_IS_THE_SAME) > + { > + if (info->ignore && !table->file->is_fatal_error(error, HA_CHECK_ALL)) why not is_fatal_error(error) ? > + { > + if (!(thd->variables.old_behavior & > + OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > + table->file->print_error(error, MYF(ME_WARNING)); > + return 0; > } > - else /* DUP_REPLACE */ > + return on_ha_error(error); > + } > + > + if (error != HA_ERR_RECORD_IS_THE_SAME) > + { > + ++*updated; > + if (table->versioned() && > + table->vers_check_update(*info->update_fields)) > { > - /* > - The manual defines the REPLACE semantics that it is either > - an INSERT or DELETE(s) + INSERT; FOREIGN KEY checks in > - InnoDB do not function in the defined way if we allow MySQL > - to convert the latter operation internally to an UPDATE. > - We also should not perform this conversion if we have > - timestamp field with ON UPDATE which is different from DEFAULT. > - Another case when conversion should not be performed is when > - we have ON DELETE trigger on table so user may notice that > - we cheat here. Note that it is ok to do such conversion for > - tables which have ON UPDATE but have no ON DELETE triggers, > - we just should not expose this fact to users by invoking > - ON UPDATE triggers. > - > - Note, TABLE_SHARE and TABLE see long uniques differently: > - - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME > - - TABLE sees as usual non-unique indexes > - */ > - bool is_long_unique= table->s->key_info && > - table->s->key_info[key_nr].algorithm == > - HA_KEY_ALG_LONG_HASH; > - if ((is_long_unique ? > - /* > - We have a long unique. Test that there are no in-engine > - uniques and the current long unique is the last long unique. > - */ > - !(table->key_info[0].flags & HA_NOSAME) && > - last_uniq_key(table, table->s->key_info, key_nr) : > - /* > - We have a normal key - not a long unique. > - Test is the current normal key is unique and > - it is the last normal unique. > - */ > - last_uniq_key(table, table->key_info, key_nr)) && > - !table->file->referenced_by_foreign_key() && > - (!table->triggers || !table->triggers->has_delete_triggers())) > - { > - if (table->versioned(VERS_TRX_ID)) > - { > - bitmap_set_bit(table->write_set, table->vers_start_field()->field_index); > - table->file->column_bitmaps_signal(); > - table->vers_start_field()->store(0, false); > - } > - if (unlikely(error= table->file->ha_update_row(table->record[1], > - table->record[0])) && > - error != HA_ERR_RECORD_IS_THE_SAME) > - goto err; > - if (likely(!error)) > - { > - info->deleted++; > - if (!table->file->has_transactions()) > - thd->transaction->stmt.modified_non_trans_table= TRUE; > - if (table->versioned(VERS_TIMESTAMP)) > - { > - store_record(table, record[2]); > - error= vers_insert_history_row(table); > - restore_record(table, record[2]); > - if (unlikely(error)) > - goto err; > - } > - } > - else > - error= 0; // error was HA_ERR_RECORD_IS_THE_SAME > - /* > - Since we pretend that we have done insert we should call > - its after triggers. > - */ > - goto after_trg_n_copied_inc; > - } > - else > + if (versioned) > { > - if (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_DELETE, > - TRG_ACTION_BEFORE, TRUE)) > - goto before_trg_err; > - > - if (!table->versioned(VERS_TIMESTAMP)) > - error= table->file->ha_delete_row(table->record[1]); > - else > - { > - store_record(table, record[2]); > - restore_record(table, record[1]); > - table->vers_update_end(); > - error= table->file->ha_update_row(table->record[1], > - table->record[0]); > - restore_record(table, record[2]); > - } > + store_record(table, record[2]); > + error= vers_insert_history_row(table); > + restore_record(table, record[2]); > if (unlikely(error)) > - goto err; > - if (!table->versioned(VERS_TIMESTAMP)) > - info->deleted++; > - else > - info->updated++; > - if (!table->file->has_transactions_and_rollback()) > - thd->transaction->stmt.modified_non_trans_table= TRUE; > - if (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_DELETE, > - TRG_ACTION_AFTER, TRUE)) > - { > - trg_error= 1; > - goto after_trg_or_ignored_err; > - } > - /* Let us attempt do write_row() once more */ > + return on_ha_error(error); > } > + ++*inserted; > } > } > - > - /* > - If more than one iteration of the above while loop is done, from > - the second one the row being inserted will have an explicit > - value in the autoinc field, which was set at the first call of > - handler::update_auto_increment(). This value is saved to avoid > - thd->insert_id_for_cur_row becoming 0. Use this saved autoinc > - value. > - */ > - if (table->file->insert_id_for_cur_row == 0) > - table->file->insert_id_for_cur_row= insert_id_for_cur_row; > - > /* > - Restore column maps if they where replaced during an duplicate key > - problem. > + If ON DUP KEY UPDATE updates a row instead of inserting > + one, it's like a regular UPDATE statement: it should not > + affect the value of a next SELECT LAST_INSERT_ID() or > + mysql_insert_id(). Except if LAST_INSERT_ID(#) was in the > + INSERT query, which is handled separately by > + THD::arg_of_last_insert_id_function. > */ > - if (table->read_set != save_read_set || > - table->write_set != save_write_set) > - table->column_bitmaps_set(save_read_set, save_write_set); > + prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row; > + insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0; > + > + ++*inserted; // Conforms the older behavior; > + > + if (use_triggers > + && table->triggers->process_triggers(thd, TRG_EVENT_UPDATE, > + TRG_ACTION_AFTER, TRUE)) > + return restore_on_error(); > } > - else if (unlikely((error=table->file->ha_write_row(table->record[0])))) > + > + /* > + Only update next_insert_id if the AUTO_INCREMENT value was explicitly > + updated, so we don't update next_insert_id with the value from the > + row being updated. Otherwise reset next_insert_id to what it was > + before the duplicate key error, since that value is unused. > + */ > + if (table->next_number_field_updated) > + { > + DBUG_ASSERT(table->next_number_field != NULL); > + > + table->file->adjust_next_insert_id_after_explicit_value( > + table->next_number_field->val_int()); > + } > + else if (prev_insert_id_for_cur_row) > + { > + table->file->restore_auto_increment(prev_insert_id_for_cur_row); > + } > + > + return send_data(); > +} > + > +bool Write_record::is_fatal_error(int error) > +{ > + return error == HA_ERR_LOCK_DEADLOCK || > + error == HA_ERR_LOCK_WAIT_TIMEOUT || These errors were only in rpl code, not anywhere in sql_insert.cc. I don't think it's wrong to check for them everywhere consistently, but does it change the behavior? Can you create a test case for INSERT before your changes that comes to thd->is_fatal_error() with HA_ERR_LOCK_DEADLOCK or HA_ERR_LOCK_WAIT_TIMEOUT? What will be the behavior before and after your change? > + table->file->is_fatal_error(error, HA_CHECK_ALL); > +} > + > +int Write_record::single_insert(ha_rows *inserted) > +{ > + DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200", > + { > + uint64 seq= thd->rgi_slave ? thd->rgi_slave->current_gtid.seq_no : 0; > + if (seq == 100 || seq == 200) > + my_sleep(20000); > + }); > + > + int error= table->file->ha_write_row(table->record[0]); > + if (unlikely(error)) > { > DEBUG_SYNC(thd, "write_row_noreplace"); > - if (!info->ignore || > - table->file->is_fatal_error(error, HA_CHECK_ALL)) > - goto err; > + if (!info->ignore || is_fatal_error(error)) > + return on_ha_error(error); > if (!(thd->variables.old_behavior & > OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE)) > table->file->print_error(error, MYF(ME_WARNING)); > table->file->restore_auto_increment(prev_insert_id); > - goto after_trg_or_ignored_err; > + return 0; > } > + return after_insert(inserted); > +} > > -after_trg_n_copied_inc: > - info->copied++; > - thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row); > - trg_error= (table->triggers && > - table->triggers->process_triggers(thd, TRG_EVENT_INSERT, > - TRG_ACTION_AFTER, TRUE)); > +/** > + Write a record to table with optional deleting of conflicting records, > + invoke proper triggers if needed. > + > + @note > + Once this record will be written to table after insert trigger will > + be invoked. If instead of inserting new record we will update old one > + then both on update triggers will work instead. Similarly both on > + delete triggers will be invoked if we will delete conflicting records. > > -ok: > + Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which > + is updated didn't have transactions. > + > + @return > + 0 - success > + non-0 - error > +*/ > +int Write_record::write_record() > +{ > + ha_rows inserted= 0, updated= 0; > + prev_insert_id= table->file->next_insert_id; > + info->records++; > + ignored_error= false; > + int res= 0; > + switch(info->handle_duplicates) > + { > + case DUP_ERROR: > + res= single_insert(&inserted); > + break; > + case DUP_UPDATE: > + res= insert_on_duplicate_update(&inserted, &updated); > + info->updated+= updated; > + break; > + case DUP_REPLACE: > + res= replace_row(&inserted, &updated); > + if (versioned) > + info->updated+= updated; > + else > + info->deleted+= updated; > + } > + > + info->copied+= inserted; > + if (inserted || updated) > + notify_non_trans_table_modified(); > + return res; > +} > + > + > +int Write_record::prepare_handle_duplicate(int error) > +{ > /* > - We send the row after writing it to the table so that the > - correct values are sent to the client. Otherwise it won't show > - autoinc values (generated inside the handler::ha_write()) and > - values updated in ON DUPLICATE KEY UPDATE. > + If we do more than one iteration of the replace loop, from the second one the > + row will have an explicit value in the autoinc field, which was set at > + the first call of handler::update_auto_increment(). So we must save > + the autogenerated value to avoid thd->insert_id_for_cur_row to become > + 0. > */ > - if (sink && sink->send_data(thd->lex->returning()->item_list) < 0) > - trg_error= 1; > + if (table->file->insert_id_for_cur_row > 0) > + insert_id_for_cur_row= table->file->insert_id_for_cur_row; > + else > + table->file->insert_id_for_cur_row= insert_id_for_cur_row; > + > + if (is_fatal_error(error)) > + return on_ha_error(error); > > -after_trg_or_ignored_err: > - if (key) > - my_safe_afree(key,table->s->max_unique_length); > + bool is_duplicate_key_error= > + table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP); > + if (!is_duplicate_key_error) > + { > + /* > + We come here when we had an ignorable error which is not a duplicate > + key error. In this we ignore error if ignore flag is set, otherwise > + report error as usual. We will not do any duplicate key processing. > + */ > + if (info->ignore) > + { > + table->file->print_error(error, MYF(ME_WARNING)); > + ignored_error= true; > + return 1; /* Ignoring a not fatal error */ > + } > + return on_ha_error(error); > + } > + > + key_nr = table->file->get_dup_key(error); > + > + if (unlikely(key_nr == std::numeric_limits<decltype(key_nr)>::max())) Well, the handler returns (uint)-1, please compare with it, instead of inventing numerous other ways of archieving the same. And why did you decide to use ushort for the key number? We pretty much never use ushort, key number is uint everywhere. I agree that ushort is enough size-wise, but let's rather be consistent. > + return on_ha_error(HA_ERR_FOUND_DUPP_KEY); /* Database can't find key */ > + > + DEBUG_SYNC(thd, "write_row_replace"); > + > + /* Read all columns for the row we are going to replace */ > + table->use_all_columns(); > + /* > + Don't allow REPLACE to replace a row when an auto_increment column > + was used. This ensures that we don't get a problem when the > + whole range of the key has been used. > + */ > + if (info->handle_duplicates == DUP_REPLACE && table->next_number_field && > + key_nr == table->s->next_number_index && insert_id_for_cur_row > 0) > + return on_ha_error(error); > + > + error= locate_dup_record(); > + if (error) > + return on_ha_error(error); > + > + return 0; > +} > + > +inline > +void Write_record::notify_non_trans_table_modified() > +{ > if (!table->file->has_transactions_and_rollback()) > thd->transaction->stmt.modified_non_trans_table= TRUE; ok, as you like. I'm personally not a great fan of one-liners that are used only once. > - DBUG_RETURN(trg_error); > +} > > -err: > +inline > +int Write_record::on_ha_error(int error) > +{ > info->last_errno= error; > table->file->print_error(error,MYF(0)); > - > -before_trg_err: > + DBUG_PRINT("info", ("Returning error %d", error)); > + return restore_on_error(); > +} > + > +inline > +int Write_record::restore_on_error() > +{ > table->file->restore_auto_increment(prev_insert_id); > - if (key) > - my_safe_afree(key, table->s->max_unique_length); > - table->column_bitmaps_set(save_read_set, save_write_set); > - DBUG_RETURN(1); > + return ignored_error ? 0 : 1; > +} > + > +inline > +int Write_record::after_insert(ha_rows *inserted) > +{ > + ++*inserted; > + thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row); > + return after_ins_trg() || send_data(); > +} > + > +inline > +int Write_record::after_ins_trg() > +{ > + int res= 0; > + if (use_triggers) > + res= table->triggers->process_triggers(thd, TRG_EVENT_INSERT, > + TRG_ACTION_AFTER, TRUE); > + return res; > +} > + > +inline > +int Write_record::send_data() > +{ > + /* > + We send the row after writing it to the table so that the > + correct values are sent to the client. Otherwise it won't show > + autoinc values (generated inside the handler::ha_write()) and > + values updated in ON DUPLICATE KEY UPDATE. > + */ > + return sink ? sink->send_data(thd->lex->returning()->item_list) < 0 : 0; > } > > > + > /****************************************************************************** > Check that there aren't any null_fields > ******************************************************************************/ > @@ -4816,18 +4934,8 @@ select_create::prepare(List<Item> &_values, SELECT_LEX_UNIT *u) > > restore_record(table,s->default_values); // Get empty record > thd->cuted_fields=0; > - bool create_lookup_handler= info.handle_duplicates != DUP_ERROR; > - if (info.ignore || info.handle_duplicates != DUP_ERROR) > - { > - create_lookup_handler= true; > - table->file->extra(HA_EXTRA_IGNORE_DUP_KEY); > - if (table->file->ha_table_flags() & HA_DUPLICATE_POS) > - { > - if (table->file->ha_rnd_init_with_error(0)) > - DBUG_RETURN(1); > - } > - } > - table->file->prepare_for_insert(create_lookup_handler); > + if (prepare_for_replace(table, info.handle_duplicates, info.ignore)) ok. strange that select_create::prepare doesn't call select_insert::prepare > + DBUG_RETURN(1); > if (info.handle_duplicates == DUP_REPLACE && > (!table->triggers || !table->triggers->has_delete_triggers())) > table->file->extra(HA_EXTRA_WRITE_CAN_REPLACE); > diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc > index 5db53b6f2b8..7ce4850f534 100644 > --- a/sql/sql_parse.cc > +++ b/sql/sql_parse.cc > @@ -4814,7 +4815,8 @@ mysql_execute_command(THD *thd) > &lex->value_list, > lex->duplicates, > lex->ignore, > - result))) > + result, > + &write))) Why do you need Write_record here on the stack? Wouldn't it be more logical to have it in select_insert ? > { > if (lex->analyze_stmt) > ((select_result_interceptor*)sel_result)->disable_my_ok_calls(); Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

2 1

Re: 652e9ee405e: MDEV-30046 wrong row targeted with "insert ... on duplicate" and "replace"
by Sergei Golubchik 29 Apr '24

29 Apr '24

Hi, Nikita, Good fix. Ok to push, thanks On Apr 29, Nikita Malyavin wrote: > revision-id: 652e9ee405e (mariadb-10.5.24-174-g652e9ee405e) > parent(s): 3d417476256 > author: Nikita Malyavin > committer: Nikita Malyavin > timestamp: 2024-04-27 21:08:21 +0200 > message: > > MDEV-30046 wrong row targeted with "insert ... on duplicate" and "replace" > > When HA_DUPLICATE_POS is not supported, the row to replace was navigated by > ha_index_read_idx_map, which uses only hash to navigate. > > Suchwise, given a hash collision it may choose an incorrect row. > > handler::position would be correct and very convenient to use here. > The code is updated to set dup_ref by a handler independently of engine > capabilities, when extra lookup is made (for long unique or something else, > for example WITHOUT OVERLAPS) > Regards, Sergei Chief Architect, MariaDB Server and security(a)mariadb.org

1 1

SOUNDEX sources
by lucyfrost＠tutanota.com 26 Apr '24

26 Apr '24

I'm developing a project which uses soundex matching (alongside other options more sophisticated than that). This is custom software with a mariad back end. MariaDB soundex built-in produces codes of apparently arbitary length; the usual is 4 characters. I've come to the conclusion that the best way forward is to mimic Maria's soundex in my code. I cannot find a description of the algorithm Maria uses though: the only source code I can find is in mf_soundex.c, and that is explicitly limited to 4 characters. But I have installed triggers which generate codes such as A4152636125262 Where is the soundex function found? Anyone know?

3 4

3 PRs pending second approver
by Otto Kekäläinen 25 Apr '24

25 Apr '24

Hi all with permission to review / add approval, This is ready to be merged into 11.5 and already has approval from Andrew, just needs one more approval: https://github.com/MariaDB/server/pull/2671 This has also one approval from Andrew, pending a second to be merged: https://github.com/MariaDB/server/pull/2169 Likewise, approved by Andrew, pending second: https://github.com/MariaDB/server/pull/2516 Additionally, this has been open for since January without any reviews: https://github.com/MariaDB/server/pull/2985 This was updated a month ago based on feedback, but hasn't gotten a re-review since: https://github.com/MariaDB/server/pull/3114 Pending re-review: https://github.com/MariaDB/server/pull/2556 - Otto

3 2

Re: [MariaDB commits] [PATCH] MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start
by Kristian Nielsen 25 Apr '24

25 Apr '24

Andrei Elkin <andrei.elkin(a)mariadb.com> writes: >>> --- a/sql/rpl_rli.cc >>> +++ b/sql/rpl_rli.cc >>> @@ -1030,7 +1030,7 @@ void Relay_log_info::inc_group_relay_log_pos(ulonglong log_pos, >>> rgi->last_master_timestamp > last_master_timestamp) >>> last_master_timestamp= rgi->last_master_timestamp; >>> } >>> - else >>> + else if (!mi->using_gtid || seen_gtid_log_list_event) >>> { >>> /* Non-parallel case. */ >>> group_relay_log_pos= event_relay_log_pos; >> >> Agree, this looks fine. >> The dump thread should always be sending the gtid_list event. And yes, the >> log_pos of the events received until then are not useful to set for the >> slave. >> >> Right, so this seems a good approach for a fix. >> >> Maybe I should push the workaround to 10.5 to help the effort to remove >> sporadic failures, and then file a separate bug about the underlying problem >> and your proposed solution? > > Sounds reasonable. Thank you! Filed as https://jira.mariadb.org/browse/MDEV-33996 - Kristian.

1 0

Re: [MariaDB commits] [PATCH] MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start
by Kristian Nielsen 24 Apr '24

24 Apr '24

Andrei Elkin <andrei.elkin(a)mariadb.com> writes: >> 1. I have not so far been able to reproduce the failure, so this patch is >> assumed to fix the sporadic failure, but I have not been able to verify that >> it does. > > I verified your analysis to confirm. Thanks for checking! >> But on the other hand this will not solve the fundamental issue, that until the >> GTID mode connect is completed, we will in the general case not have a valid >> corresponding old-style position. So maybe the current behaviour is >> ok? > Sure but it looks fixes are not overly difficult. > > As it's Gtid_list_log_event::log_pos that makes the file:pos (old-style coordinates) > state be valid why won't be keep the coordinates intact until the event > has been arrived/processed? Say that situation is remembered in `RLI::seen_gtid_log_list_event`. > Then for instance in the serial case the fixes would look like > > --- a/sql/rpl_rli.cc > +++ b/sql/rpl_rli.cc > @@ -1030,7 +1030,7 @@ void Relay_log_info::inc_group_relay_log_pos(ulonglong log_pos, > rgi->last_master_timestamp > last_master_timestamp) > last_master_timestamp= rgi->last_master_timestamp; > } > - else > + else if (!mi->using_gtid || seen_gtid_log_list_event) > { > /* Non-parallel case. */ > group_relay_log_pos= event_relay_log_pos; Agree, this looks fine. The dump thread should always be sending the gtid_list event. And yes, the log_pos of the events received until then are not useful to set for the slave. Right, so this seems a good approach for a fix. Maybe I should push the workaround to 10.5 to help the effort to remove sporadic failures, and then file a separate bug about the underlying problem and your proposed solution? Thanks, - Kristian.

1 0

Re: [MariaDB commits] [PATCH] MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start
by Kristian Nielsen 23 Apr '24

23 Apr '24

Hi Andrei, Brandon, What do you think of this patch to fix sporadic failure of rpl.rpl_gtid_stop_start in buildbot, MDEV-33602? The test fails relatively frequent, so would be good to get it fixed. Two issues: 1. I have not so far been able to reproduce the failure, so this patch is assumed to fix the sporadic failure, but I have not been able to verify that it does. 2. The question is if this will be just hiding a real bug? When the GTID-mode slave connects, it receives a few events from the start of the binlog (Gtid_list, Binlog_checkpoint, etc), and these briefly update the old-style position to a wrong value until the fake Gtid_list is received with the correct position. One might argue that this should be avoided. But on the other hand this will not solve the fundamental issue, that until the GTID mode connect is completed, we will in the general case not have a valid corresponding old-style position. So maybe the current behaviour is ok? - Kristian. Kristian Nielsen via commits <commits(a)lists.mariadb.org> writes: > The test could fail with a duplicate key error because switching to non-GTID > mode could start at the wrong old-style position. The position could be > wrong when the previous GTID connect was stopped before receiving the fake > GTID list event which gives the old-style position corresponding to the GTID > connected position. > > Work-around by injecting an extra event and syncing the slave before > switching to non-GTID mode. > > Signed-off-by: Kristian Nielsen <knielsen(a)knielsen-hq.org> > --- > .../suite/rpl/r/rpl_gtid_stop_start.result | 4 ++++ > mysql-test/suite/rpl/t/rpl_gtid_stop_start.test | 16 ++++++++++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/mysql-test/suite/rpl/r/rpl_gtid_stop_start.result b/mysql-test/suite/rpl/r/rpl_gtid_stop_start.result > index ae0050c353a..e8633cd45bb 100644 > --- a/mysql-test/suite/rpl/r/rpl_gtid_stop_start.result > +++ b/mysql-test/suite/rpl/r/rpl_gtid_stop_start.result > @@ -111,6 +111,10 @@ a > 6 > 7 > *** MDEV-4486: Allow to start old-style replication even if mysql.gtid_slave_pos is unavailable > +connection server_1; > +INSERT INTO t1 VALUES (8); > +DELETE FROM t1 WHERE a=8; > +connection server_2; > connection server_2; > include/stop_slave.inc > CHANGE MASTER TO master_use_gtid= no; > diff --git a/mysql-test/suite/rpl/t/rpl_gtid_stop_start.test b/mysql-test/suite/rpl/t/rpl_gtid_stop_start.test > index b5ff294908b..032ebb77d1e 100644 > --- a/mysql-test/suite/rpl/t/rpl_gtid_stop_start.test > +++ b/mysql-test/suite/rpl/t/rpl_gtid_stop_start.test > @@ -173,6 +173,22 @@ SELECT * FROM t1 ORDER BY a; > > --echo *** MDEV-4486: Allow to start old-style replication even if mysql.gtid_slave_pos is unavailable > > +# In GTID mode, the old-style replication position is also updated. But during > +# GTID connect, the old-style position is not known until receiving the fake > +# GTID list event, which contains the required position value. If we happened > +# to stop the slave above before this fake GTID list event, the test could fail > +# with duplicate key errors due to switching to non-GTID mode at a wrong > +# position too far back in the binlog. > +# > +# Work-around this by injecting an extra dummt event and syncing the slave to > +# it, ensuring the old-style position will be updated. > +--connection server_1 > +INSERT INTO t1 VALUES (8); > +DELETE FROM t1 WHERE a=8; > +--save_master_pos > +--connection server_2 > +--sync_with_master > + > --connection server_2 > --source include/stop_slave.inc > CHANGE MASTER TO master_use_gtid= no;

1 0

MDEV-27512, --slave-skip-errors, and optimistic parallel replication
by Kristian Nielsen 18 Apr '24

18 Apr '24

Hi Brandon, I only now got to check on MDEV-27512. This bug seems to be about using --slave_skip_errors=ALL with optimistic parallel replication. I saw your patch for the bug. While the patch by itself is perhaps ok, I think it's only fixing a minor thing, and not addressing the real problem, let me explain my thinking: Now, --slave-skip-errors is obviously a dangerous option, but presumably the use case is to make the slave continue at all costs in case an unexpected error occurs, rather than stop the slave and break replication. But optimistic parallel replication by design can introduce many different transient errors due to conflicts, that are then handled by rolling back and retrying the offending transactions. These errors are *expected*, and they do *not* cause slave to stop or replication to break. However, it appears that --slave_skip_errors also affects such transient errors due to optimistic parallel replication, and will cause any such transaction to be silently ignored! This must be very wrong, it will cause massive replication divergence. It seems to me this is the real bug here. When an error is encountered during optimistic apply of a transaction (is_parallel_retry_error() returns true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should _not_ be subject to --slave-skip-errors. The transaction should be rolled back as normal, and wait_for_prior_commit() done. Then after setting rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error, --slave-skip-errors can apply. I put together a quick test that seems to show this behaviour, included below. This tests replicates correctly without replication stopping with error. But running it with --mysqld=--slave-skip-errors=all, it replicates incorrectly, skipping lots of transactions. The test is somewhat contrieved, but I think it shows the real problem, that --slave-skip-errors can randomly cause transactions to be skipped or not depending on if optimistic parallel replication triggers a matching transient error or not. So in summary, it looks like there is a real problem here, that optimistic parallel replication is not working correctly with --slave-skip-errors, transient errors incorrectly causes conflicts to skip transactions rather than retrying them. This will cause replication to diverge even when no real errors occur. And the patch for MDEV-27512 doesn't seem to address this issue. So I think MDEV-27512 should be re-opened, what do you think? - Kristian. ----------------------------------------------------------------------- --source include/have_innodb.inc --source include/have_binlog_format_row.inc --source include/master-slave.inc --connection master ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6); --sync_slave_with_master --source include/stop_slave.inc CHANGE MASTER TO master_use_gtid=slave_pos; SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout; SET GLOBAL innodb_lock_wait_timeout= 5; SET @old_parallel= @@GLOBAL.slave_parallel_threads; SET @old_mode= @@GLOBAL.slave_parallel_mode; SET GLOBAL slave_parallel_mode= aggressive; SET GLOBAL slave_parallel_threads= 20; --connection master UPDATE t1 SET b=b+1 WHERE a=6; --disable_query_log let $i= 0; while ($i < 40) { eval UPDATE t1 SET b=b+1 WHERE a=2; inc $i; } --enable_query_log SELECT * FROM t1 ORDER BY a; --save_master_pos --connection slave1 # Block first worker, and recursively pause all following workers that get # temporary errors before they can retry. BEGIN; SELECT * FROM t1 WHERE a=6 FOR UPDATE; --connection slave # Cause initial row not found error. SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2; --source include/start_slave.inc --sleep 2 # Now following workers should be waiting for prior commit before retrying. # Remove the row not found error. SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7; --connection slave1 ROLLBACK; --connection slave --sync_with_master SELECT * FROM t1 ORDER BY a; # Cleanup --connection slave --source include/stop_slave.inc SET GLOBAL innodb_lock_wait_timeout= @old_timeout; SET GLOBAL slave_parallel_threads= @old_parallel; SET GLOBAL slave_parallel_mode= @old_mode; --source include/start_slave.inc --connection default DROP TABLE t1; --source include/rpl_end.inc

2 2