[MariaDB developers] Re: 0d6dc52f43d: MDEV-30046 Refactor write_record and fix idempotent replication

1 May 2024

Hi, Nikita,

On Apr 29, Nikita Malyavin wrote:
...
revision-id: 0d6dc52f43d (mariadb-10.5.24-175-g0d6dc52f43d)
parent(s): 652e9ee405e
author: Nikita Malyavin
committer: Nikita Malyavin
timestamp: 2024-04-27 21:08:21 +0200
message:
MDEV-30046 Refactor write_record and fix idempotent replication
Idempotent write_row works same as REPLACE: if there is a duplicating
record in the table, then it will be deleted and re-inserted, with the
same update optimization.
The code in Rows:log_event::write_row was basically copy-pasted from
write_record.
What's done:
REPLACE operation was unified across replication and sql. It is now
representred as a Write_record class, that holds the whole state, and allows
re-using some resources in between the row writes.
Replace, IODKU and single insert implementations are split across different
methods, reluting in a much cleaner code.
The entry point is preserved as a single Write_record::write_record() call.
The implementation to call is chosen on the constructor stage.
This allowed several optimizations to be done:
1. The table key list is not iterated for every row. We find last unique key in
the order of checking once and preserve it across the rows. See last_uniq_key().
2. ib_handler::referenced_by_foreign_key acquires a global lock. This call was
done per row as well. Not all the table config that allows optimized replace is
folded into a single boolean field can_optimize. All the fields to check are
even stored in a single register on a 64-bit platform.
3. DUP_REPLACE and DUP_UPDATE cases now have one less level of indirection
4. modified_non_trans_tables is checked and set only when it's really needed.
5. Obsolete bitmap manipulations are removed.
Also:
* Unify replace initialization step across implementations:
  add prepare_for_replace and finalize_replace
* alloca is removed in favor of mem_root allocation. This memory is reused
  across the rows.
* An rpl-related callback is added to the replace branch, meaning that an extra
check is made per row replace even for the common case. It can be avoided with
templates if considered a problem.

diff --git a/mysql-test/main/long_unique_bugs_replication.test b/mysql-test/main/long_unique_bugs_replication.test
index 9c44d13e6a5..f2225cde101 100644
--- a/mysql-test/main/long_unique_bugs_replication.test
+++ b/mysql-test/main/long_unique_bugs_replication.test
+
+--echo #
+--echo # End of 10.4 tests
+--echo #
...
 --echo #
 --echo # End of 10.4 tests
 --echo #
may be 10.5?
...
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
index 38e8b062f18..b5810eeaf7b 100644
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -5705,10 +5708,10 @@ int Rows_log_event::do_apply_event(rpl_group_info *rgi)
                              &m_cols_ai : &m_cols);
     bitmap_intersect(table->write_set, after_image);
-    this->slave_exec_mode= slave_exec_mode_options; // fix the mode
-
+    COPY_INFO copy_info;
+    Write_record write_record;
     // Do event specific preparations 
-    error= do_before_row_operations(rli);
+    error= do_before_row_operations(rgi, ©_info, &write_record);
You create write_record on the stack and save it inside this->m_write_record
1. why not to have it permanently a part of Rows_log_event ?
2. if you'll keep it on the stack, please, reset m_write_record=0
before returning from ::do_apply_event(). And may be even do
m_write_record= &write_record here, not in ::do_before_row_operations()
it'll be a bit easier to see how a local variable is stored and then freed
and that it doesn't actually leave the scope.
...
/*
       Bug#56662 Assertion failed: next_insert_id == 0, file handler.cc
@@ -7132,6 +7148,40 @@ Write_rows_log_event::do_before_row_operations(const Slave_reporting_capability
     m_table->mark_auto_increment_column(true);
   }
+  if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT &&
+      (m_table->file->ha_table_flags() & HA_DUPLICATE_POS ||
+       m_table->s->long_unique_table))
+    error= m_table->file->ha_rnd_init_with_error(0);
I'd put this inside if (slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT)
about ~40 lines above the place where you put it now.
...
+
+  if (!error)
+  {
+    bzero(copy_info, sizeof *copy_info);
+    copy_info->handle_duplicates=
+            slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT ?
+            DUP_REPLACE : DUP_ERROR;
simply copy_info->handle_duplicates= thd->lex->duplicates;
...
+    copy_info->table_list= m_table->pos_in_table_list;
+
+    int (*callback)(void *, void*)= NULL;
+    if (!get_flags(COMPLETE_ROWS_F))
+    {
+      /*
+        If row is incomplete we will use the record found to fill
+        missing columns.
+      */
+      callback= [](void *e, void* r)->int {
+        auto rgi= static_cast<rpl_group_info*>(r);
+        auto event= static_cast<Write_rows_log_event*>(e);
+        return event->incomplete_record_callback(rgi);
+      };
+    }
+    new (write_record) Write_record(thd, m_table, copy_info,
+                                    m_vers_from_plain &&
+                                    m_table->versioned(VERS_TIMESTAMP),
+                                    m_table->triggers && do_invoke_trigger(),
+                                    NULL, callback, this, rgi);
please, use write_record->init(...) here. And in all similar places below,
I won't comment on them separately.
...
+    m_write_record= write_record;
+  }
+
   return error;
 }
@@ -7173,7 +7223,15 @@ Write_rows_log_event::do_after_row_operations(const Slave_reporting_capability *
   {
     m_table->file->print_error(local_error, MYF(0));
   }
-  return error? error : local_error;
+  int rnd_error= 0;
+  if (m_table->file->inited)
+  {
+    DBUG_ASSERT(slave_exec_mode == SLAVE_EXEC_MODE_IDEMPOTENT);
+    DBUG_ASSERT(m_table->file->ha_table_flags() & HA_DUPLICATE_POS ||
+                m_table->s->long_unique_table);
DBUG_ASSERT(m_table->file->inited == handler::RND);
...
+    rnd_error= m_table->file->ha_rnd_end();
+  }
+  return error? error : local_error ? local_error : rnd_error;
 }
bool Rows_log_event::process_triggers(trg_event_type event,
diff --git a/sql/sql_insert.cc b/sql/sql_insert.cc
index ff71361493a..e410efda121 100644
--- a/sql/sql_insert.cc
+++ b/sql/sql_insert.cc
@@ -677,6 +678,30 @@ Field **TABLE::field_to_fill()
   return triggers && triggers->nullable_fields() ? triggers->nullable_fields() : field;
 }
+int prepare_for_replace(TABLE *table, enum_duplicates handle_duplicates,
+                        bool ignore)
+{
+  bool create_lookup_handler= handle_duplicates != DUP_ERROR;
why do you initialize create_lookup_handler here
and then immediately repeat it two lines below? :)

less confusing would be

  bool create_lookup_handler= handle_duplicates != DUP_ERROR || ignore;
  if (create_lookup_handler)
  {
...
+  if (ignore || handle_duplicates != DUP_ERROR)
+  {
+    create_lookup_handler= true;
+    table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
+    if (table->file->ha_table_flags() & HA_DUPLICATE_POS ||
+        table->s->long_unique_table)
+    {
+      if (table->file->ha_rnd_init_with_error(false))
+        return 1;
+    }
+  }
+  return table->file->prepare_for_insert(create_lookup_handler);
+}
+
+int finalize_replace(TABLE *table, enum_duplicates handle_duplicates,
+                      bool ignore)
+{
DBUG_ASSERT(m_table->file->inited != handler::INDEX);
...
+  return table->file->inited ? table->file->ha_rnd_end() : 0;
+}
+
/**
   INSERT statement implementation
@@ -1728,31 +1745,74 @@ int mysql_prepare_insert(THD *thd, TABLE_LIST *table_list,
 }
/* Check if there is more uniq keys after field */
this comment isn't correct anymore, is it?
on the other hand, the function is small and the name is clear,
I think you can simply delete the comment, if you'd like.
...
+static ushort get_last_unique_key(TABLE *table, KEY *key_info)
+{
+  for (uint key= table->s->keys; key; --key)
+    if (key_info[key-1].flags & HA_NOSAME)
+      return key-1;
+  return MAX_KEY;
+}
+/**
+  An update optimization can be done in REPLACE only when certain criterial met.
+  In particular, if the "error key" was the last one to check. If not, there can
+  be other keys that can fail the uniqueness check after we will delete the row.
+
+  The problem arises with the order of the key checking. It is divided in two
+  parts:
+  1. Higher-level handler::ha_write_row goes first. It checks for long unique
+  and period overlaps (the latter is not supported by REPLACE yet).
better "(when the latter will be supported by REPLACE)"

this way the comment will be less wrong and misleading when we'll finally
implement this support and but forget to update the comment :)
...
+  2. Engine usual unique index check.
+  However, the order of the keys in TABLE_SHARE::key_info is:
+  * first, engine unique keys
+  * then long unique, which is not seen by engine
+  WITHOUT OVERLAPS key has no particular order
+  see sort_keys in sql_table.cc
+
+  So we need to wrap this order.
+  @return last unique key to check for duplication, or MAX_KEY if none.
+ */
-static int last_uniq_key(TABLE *table, const KEY *key, uint keynr)
+ushort Write_record::get_last_unique_key() const
 {
+  if (info->handle_duplicates != DUP_REPLACE)
+    return MAX_KEY; // Not used.
+  if (!table->s->key_info)
+    return MAX_KEY;
   /*
     When an underlying storage engine informs that the unique key
     conflicts are not reported in the ascending order by setting
     the HA_DUPLICATE_KEY_NOT_IN_ORDER flag, we cannot rely on this
     information to determine the last key conflict.
The information about the last key conflict will be used to
     do a replace of the new row on the conflicting row, rather
     than doing a delete (of old row) + insert (of new row).
Hence check for this flag and disable replacing the last row
-    by returning 0 always. Returning 0 will result in doing
+    by returning MAX_KEY always. Returning MAX_KEY will result in doing
     a delete + insert always.
   */
   if (table->file->ha_table_flags() & HA_DUPLICATE_KEY_NOT_IN_ORDER)
-    return 0;
+    return MAX_KEY;
-  while (++keynr < table->s->keys)
-    if (key[keynr].flags & HA_NOSAME)
-      return 0;
-  return 1;
+  /*
+    Note, that TABLE_SHARE and TABLE see long uniques differently:
+    - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME
+    - TABLE sees as usual non-unique indexes
+  */
+  return
+       table->key_info[0].flags & HA_NOSAME ?
+       /*
+         We have an in-engine unique, which should be checked last.
+         Go through the TABLE list, which knows nothing about long uniques.
+       */
+       ::get_last_unique_key(table, table->key_info) :
+       /*
+         We can only have long uniques.
+         Go through the TABLE_SHARE list, where long uniques can be found.
+       */
+       ::get_last_unique_key(table, table->s->key_info);
 }
@@ -1786,449 +1846,511 @@ int vers_insert_history_row(TABLE *table)
 }
/*
-  Write a record to table with optional deleting of conflicting records,
-  invoke proper triggers if needed.
-
-  SYNOPSIS
-     write_record()
-      thd   - thread context
-      table - table to which record should be written
-      info  - COPY_INFO structure describing handling of duplicates
-              and which is used for counting number of records inserted
-              and deleted.
-      sink  - result sink for the RETURNING clause
-
-  NOTE
-    Once this record will be written to table after insert trigger will
-    be invoked. If instead of inserting new record we will update old one
-    then both on update triggers will work instead. Similarly both on
-    delete triggers will be invoked if we will delete conflicting records.
-
-    Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which
-    is updated didn't have transactions.
-
-  RETURN VALUE
-    0     - success
-    non-0 - error
-
-int write_record(THD *thd, TABLE *table, COPY_INFO *info, select_result *sink)
-{
-  int error, trg_error= 0;
-  char *key=0;
-  MY_BITMAP *save_read_set, *save_write_set;
-  ulonglong prev_insert_id= table->file->next_insert_id;
-  ulonglong insert_id_for_cur_row= 0;
-  ulonglong prev_insert_id_for_cur_row= 0;
-  DBUG_ENTER("write_record");
+  Retrieve the old (duplicated) row into record[1] to be able to
+  either update or delete the offending record.  We either:
+
+  - use ha_rnd_pos() with a row-id (available as dup_ref) to the
+    offending row, if that is possible (MyISAM and Blackhole, also long unique
+    and application-time periods), or else
+
+  - use ha_index_read_idx_map() with the key that is duplicated, to
+    retrieve the offending row.
  */
+int Write_record::locate_dup_record()
+{
+  handler *h= table->file;
+  int error= 0;
+  if (h->ha_table_flags() & HA_DUPLICATE_POS || h->lookup_errkey != (uint)-1)
+  {
+    DBUG_PRINT("info", ("Locating offending record using rnd_pos()"));
+
+    error= h->ha_rnd_pos(table->record[1], h->dup_ref);
+    if (unlikely(error))
+    {
+      DBUG_PRINT("info", ("rnd_pos() returns error %d",error));
+      h->print_error(error, MYF(0));
+    }
+  }
+  else
+  {
+    DBUG_PRINT("info",
+               ("Locating offending record using ha_index_read_idx_map"));
+    if (h->lookup_handler)
+      h= h->lookup_handler;
Why?
...
+
+    error= h->extra(HA_EXTRA_FLUSH_CACHE);
+    if (unlikely(error))
+    {
+      DBUG_PRINT("info",("Error when setting HA_EXTRA_FLUSH_CACHE"));
+      return my_errno;
+    }
-  info->records++;
-  save_read_set= table->read_set;
-  save_write_set= table->write_set;
+    if (unlikely(key == NULL))
+    {
+      key= static_cast<uchar*>(thd->alloc(table->s->max_unique_length));
+      if (key == NULL)
+      {
+        DBUG_PRINT("info",("Can't allocate key buffer"));
+        return ENOMEM;
+      }
+    }
-  DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200",
+    key_copy(key, table->record[0], table->key_info + key_nr, 0);
+
+    error= h->ha_index_read_idx_map(table->record[1], key_nr, key,
+                                    HA_WHOLE_KEY, HA_READ_KEY_EXACT);
why did you revert commit 5e345281e3599c79 here?
...
+    if (unlikely(error))
     {
-      if (thd->rgi_slave && (thd->rgi_slave->current_gtid.seq_no == 100 ||
-                             thd->rgi_slave->current_gtid.seq_no == 200))
-        my_sleep(20000);
-    });
-  if (info->handle_duplicates == DUP_REPLACE ||
-      info->handle_duplicates == DUP_UPDATE)
+      DBUG_PRINT("info", ("index_read_idx() returns %d", error));
+      h->print_error(error, MYF(0));
+    }
+  }
+
+  if (unlikely(error))
+    return error;
+
+  if (table->vfield)
+  {
+    /*
+      We have not yet called update_virtual_fields()
+      in handler methods for the just read row in record[1].
+    */
+    table->move_fields(table->field, table->record[1], table->record[0]);
+    error= table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE);
+    table->move_fields(table->field, table->record[0], table->record[1]);
+  }
+  return error;
+}
+
+/**
+   REPLACE is defined as either INSERT or DELETE + INSERT.  If
+   possible, we can replace it with an UPDATE, but that will not
+   work on InnoDB if FOREIGN KEY checks are necessary.
+
+   Suppose that we got the duplicate key to be a key that is not
+   the last unique key for the table and we perform an update:
+   then there might be another key for which the unique check will
+   fail, so we're better off just deleting the row and trying to insert again.
+
+   Additionally we don't use ha_update_row in following cases:
+   * when triggers should be invoked, as it may spoil the intermedite NEW_ROW
+     representation
+   * when we have referenced keys, as there can be different ON DELETE/ON UPDATE
+     actions
+*/
+int Write_record::replace_row(ha_rows *inserted, ha_rows *deleted)
+{
+  int error;
+  while (unlikely(error=table->file->ha_write_row(table->record[0])))
   {
-    while (unlikely(error=table->file->ha_write_row(table->record[0])))
+    error= prepare_handle_duplicate(error);
+    if (unlikely(error))
+      return restore_on_error();
+
+    if (incomplete_records_cb && (error= incomplete_records_cb(arg1, arg2)))
This callback is strange. I'd rather inherit Write_record_rpl
from Write_record and put rpl-specific logic there.

Also you could avoid a switch on INSERT/REPLACE/ODKU if you'd use
inheritance instead. But the switch is less strange than the callback :)
...
+      return restore_on_error();
+
+    if (can_optimize && key_nr == last_unique_key)
     {
-      uint key_nr;
-      /*
-        If we do more than one iteration of this loop, from the second one the
-        row will have an explicit value in the autoinc field, which was set at
-        the first call of handler::update_auto_increment(). So we must save
-        the autogenerated value to avoid thd->insert_id_for_cur_row to become
-        0.
-      */
-      if (table->file->insert_id_for_cur_row > 0)
-        insert_id_for_cur_row= table->file->insert_id_for_cur_row;
-      else
-        table->file->insert_id_for_cur_row= insert_id_for_cur_row;
-      bool is_duplicate_key_error;
-      if (table->file->is_fatal_error(error, HA_CHECK_ALL))
-        goto err;
-      is_duplicate_key_error=
-        table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP);
-      if (!is_duplicate_key_error)
-      {
-        /*
-          We come here when we had an ignorable error which is not a duplicate
-          key error. In this we ignore error if ignore flag is set, otherwise
-          report error as usual. We will not do any duplicate key processing.
-        */
-        if (info->ignore)
-        {
-          table->file->print_error(error, MYF(ME_WARNING));
-          goto after_trg_or_ignored_err; /* Ignoring a not fatal error */
-        }
-        goto err;
-      }
-      if (unlikely((int) (key_nr = table->file->get_dup_key(error)) < 0))
-      {
-	error= HA_ERR_FOUND_DUPP_KEY;         /* Database can't find key */
-	goto err;
-      }
-      DEBUG_SYNC(thd, "write_row_replace");
+      DBUG_PRINT("info", ("Updating row using ha_update_row()"));
+      if (table->versioned(VERS_TRX_ID))
+        table->vers_start_field()->store(0, false);
-      /* Read all columns for the row we are going to replace */
-      table->use_all_columns();
-      /*
-	Don't allow REPLACE to replace a row when a auto_increment column
-	was used.  This ensures that we don't get a problem when the
-	whole range of the key has been used.
-      */
-      if (info->handle_duplicates == DUP_REPLACE && table->next_number_field &&
-          key_nr == table->s->next_number_index && insert_id_for_cur_row > 0)
-	goto err;
-      if (table->file->has_dup_ref())
+      error= table->file->ha_update_row(table->record[1], table->record[0]);
+
+      if (likely(!error))
+        ++*deleted;
+      else if (error != HA_ERR_RECORD_IS_THE_SAME)
+        return on_ha_error(error);
+
+      if (versioned)
       {
-        DBUG_ASSERT(table->file->inited == handler::RND);
-	if (table->file->ha_rnd_pos(table->record[1],table->file->dup_ref))
-	  goto err;
+        store_record(table, record[2]);
+        error= vers_insert_history_row(table);
+        restore_record(table, record[2]);
+        if (unlikely(error))
+          return on_ha_error(error);
       }
+
+      break;
+    }
+    else
+    {
+      DBUG_PRINT("info", ("Deleting offending row and trying to write"
+                          " new one again"));
+
+      auto *trg = table->triggers;
+      if (use_triggers && trg->process_triggers(table->in_use, TRG_EVENT_DELETE,
+                                                TRG_ACTION_BEFORE, TRUE))
+        return restore_on_error();
+      if (!versioned)
+        error = table->file->ha_delete_row(table->record[1]);
       else
       {
-	if (table->file->extra(HA_EXTRA_FLUSH_CACHE)) /* Not needed with NISAM */
-	{
-	  error=my_errno;
-	  goto err;
-	}
-
-	if (!key)
-	{
-	  if (!(key=(char*) my_safe_alloca(table->s->max_unique_length)))
-	  {
-	    error=ENOMEM;
-	    goto err;
-	  }
-	}
-	key_copy((uchar*) key,table->record[0],table->key_info+key_nr,0);
-        key_part_map keypart_map= (1 << table->key_info[key_nr].user_defined_key_parts) - 1;
-	if ((error= (table->file->ha_index_read_idx_map(table->record[1],
-                                                        key_nr, (uchar*) key,
-                                                        keypart_map,
-                                                        HA_READ_KEY_EXACT))))
-	  goto err;
+        store_record(table, record[2]);
+        restore_record(table, record[1]);
+        table->vers_update_end();
+        error = table->file->ha_update_row(table->record[1], table->record[0]);
+        restore_record(table, record[2]);
       }
-      if (table->vfield)
+
+      if (unlikely(error))
+        return on_ha_error(error);
+      ++*deleted;
+
+      if (use_triggers)
       {
-        /*
-          We have not yet called update_virtual_fields(VOL_UPDATE_FOR_READ)
-          in handler methods for the just read row in record[1].
-        */
-        table->move_fields(table->field, table->record[1], table->record[0]);
-        int verr = table->update_virtual_fields(table->file, VCOL_UPDATE_FOR_REPLACE);
-        table->move_fields(table->field, table->record[0], table->record[1]);
-        if (verr)
-          goto err;
+        error= trg->process_triggers(table->in_use, TRG_EVENT_DELETE,
+                                     TRG_ACTION_AFTER, TRUE);
+        if (unlikely(error))
+          return restore_on_error();
       }
-      if (info->handle_duplicates == DUP_UPDATE)
-      {
-        int res= 0;
-        /*
-          We don't check for other UNIQUE keys - the first row
-          that matches, is updated. If update causes a conflict again,
-          an error is returned
-        */
-	DBUG_ASSERT(table->insert_values != NULL);
-        store_record(table,insert_values);
-        restore_record(table,record[1]);
-        table->reset_default_fields();
+    }
+  }
-        /*
-          in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can
-          change per row. Thus, we have to do reset_default_fields() per row.
-          Twice (before insert and before update).
-        */
-        DBUG_ASSERT(info->update_fields->elements ==
-                    info->update_values->elements);
-        if (fill_record_n_invoke_before_triggers(thd, table,
-                                                 *info->update_fields,
-                                                 *info->update_values,
-                                                 info->ignore,
-                                                 TRG_EVENT_UPDATE))
-          goto before_trg_err;
-
-        bool different_records= (!records_are_comparable(table) ||
-                                 compare_record(table));
-        /*
-          Default fields must be updated before checking view updateability.
-          This branch of INSERT is executed only when a UNIQUE key was violated
-          with the ON DUPLICATE KEY UPDATE option. In this case the INSERT
-          operation is transformed to an UPDATE, and the default fields must
-          be updated as if this is an UPDATE.
-        */
-        if (different_records && table->default_field)
-          table->evaluate_update_default_function();
-
-        /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */
-        res= info->table_list->view_check_option(table->in_use, info->ignore);
-        if (res == VIEW_CHECK_SKIP)
-          goto after_trg_or_ignored_err;
-        if (res == VIEW_CHECK_ERROR)
-          goto before_trg_err;
-
-        table->file->restore_auto_increment(prev_insert_id);
-        info->touched++;
-        if (different_records)
-        {
-          if (unlikely(error=table->file->ha_update_row(table->record[1],
-                                                        table->record[0])) &&
-              error != HA_ERR_RECORD_IS_THE_SAME)
-          {
-            if (info->ignore &&
-                !table->file->is_fatal_error(error, HA_CHECK_ALL))
-            {
-              if (!(thd->variables.old_behavior &
-                    OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE))
-                table->file->print_error(error, MYF(ME_WARNING));
-              goto after_trg_or_ignored_err;
-            }
-            goto err;
-          }
+  /*
+    If more than one iteration of the above loop is done, from
+    the second one the row being inserted will have an explicit
+    value in the autoinc field, which was set at the first call of
+    handler::update_auto_increment(). This value is saved to avoid
+    thd->insert_id_for_cur_row becoming 0. Use this saved autoinc
+    value.
+  */
+  if (table->file->insert_id_for_cur_row == 0)
+    table->file->insert_id_for_cur_row = insert_id_for_cur_row;
-          if (error != HA_ERR_RECORD_IS_THE_SAME)
-          {
-            info->updated++;
-            if (table->versioned() &&
-                table->vers_check_update(*info->update_fields))
-            {
-              if (table->versioned(VERS_TIMESTAMP))
-              {
-                store_record(table, record[2]);
-                if ((error= vers_insert_history_row(table)))
-                {
-                  info->last_errno= error;
-                  table->file->print_error(error, MYF(0));
-                  trg_error= 1;
-                  restore_record(table, record[2]);
-                  goto after_trg_or_ignored_err;
-                }
-                restore_record(table, record[2]);
-              }
-              info->copied++;
-            }
-          }
-          else
-            error= 0;
-          /*
-            If ON DUP KEY UPDATE updates a row instead of inserting
-            one, it's like a regular UPDATE statement: it should not
-            affect the value of a next SELECT LAST_INSERT_ID() or
-            mysql_insert_id().  Except if LAST_INSERT_ID(#) was in the
-            INSERT query, which is handled separately by
-            THD::arg_of_last_insert_id_function.
-          */
-          prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row;
-          insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0;
-          trg_error= (table->triggers &&
-                      table->triggers->process_triggers(thd, TRG_EVENT_UPDATE,
-                                                        TRG_ACTION_AFTER, TRUE));
-          info->copied++;
-        }
+  return after_insert(inserted);
+}
-        /*
-          Only update next_insert_id if the AUTO_INCREMENT value was explicitly
-          updated, so we don't update next_insert_id with the value from the
-          row being updated. Otherwise reset next_insert_id to what it was
-          before the duplicate key error, since that value is unused.
-        */
-        if (table->next_number_field_updated)
-        {
-          DBUG_ASSERT(table->next_number_field != NULL);
-          table->file->adjust_next_insert_id_after_explicit_value(table->next_number_field->val_int());
-        }
-        else if (prev_insert_id_for_cur_row)
-        {
-          table->file->restore_auto_increment(prev_insert_id_for_cur_row);
-        }
-        goto ok;
+int Write_record::insert_on_duplicate_update(ha_rows *inserted,
+                                             ha_rows *updated)
+{
+  int res= table->file->ha_write_row(table->record[0]);
+  if (likely(!res))
+    return after_insert(inserted);
+
+  res=prepare_handle_duplicate(res);
+  if (unlikely(res))
+    return restore_on_error();
+
+  ulonglong prev_insert_id_for_cur_row= 0;
+  /*
+    We don't check for other UNIQUE keys - the first row
+    that matches, is updated. If update causes a conflict again,
+    an error is returned
+  */
+  DBUG_ASSERT(table->insert_values != NULL);
+  store_record(table,insert_values);
+  restore_record(table,record[1]);
+  table->reset_default_fields();
+
+  /*
+    in INSERT ... ON DUPLICATE KEY UPDATE the set of modified fields can
+    change per row. Thus, we have to do reset_default_fields() per row.
+    Twice (before insert and before update).
+  */
+  DBUG_ASSERT(info->update_fields->elements ==
+              info->update_values->elements);
+  if (fill_record_n_invoke_before_triggers(thd, table,
+                                           *info->update_fields,
+                                           *info->update_values,
+                                           info->ignore,
+                                           TRG_EVENT_UPDATE))
+    return restore_on_error();
+
+  bool different_records= (!records_are_comparable(table) ||
+                           compare_record(table));
+  /*
+    Default fields must be updated before checking view updateability.
+    This branch of INSERT is executed only when a UNIQUE key was violated
+    with the ON DUPLICATE KEY UPDATE option. In this case the INSERT
+    operation is transformed to an UPDATE, and the default fields must
+    be updated as if this is an UPDATE.
+  */
+  if (different_records && table->default_field)
+    table->evaluate_update_default_function();
+
+  /* CHECK OPTION for VIEW ... ON DUPLICATE KEY UPDATE ... */
+  res= info->table_list->view_check_option(table->in_use, info->ignore);
+  if (unlikely(res))
+  {
+    if (res == VIEW_CHECK_ERROR)
+      return restore_on_error();
+    DBUG_ASSERT(res == VIEW_CHECK_SKIP);
+    return 0;
+  }
+
+  table->file->restore_auto_increment(prev_insert_id);
+  info->touched++;
+  if (different_records)
+  {
+    int error= table->file->ha_update_row(table->record[1], table->record[0]);
+
+    if (unlikely(error) && error != HA_ERR_RECORD_IS_THE_SAME)
+    {
+      if (info->ignore && !table->file->is_fatal_error(error, HA_CHECK_ALL))
why not is_fatal_error(error) ?
...
+      {
+        if (!(thd->variables.old_behavior &
+              OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE))
+          table->file->print_error(error, MYF(ME_WARNING));
+        return 0;
       }
-      else /* DUP_REPLACE */
+      return on_ha_error(error);
+    }
+
+    if (error != HA_ERR_RECORD_IS_THE_SAME)
+    {
+      ++*updated;
+      if (table->versioned() &&
+          table->vers_check_update(*info->update_fields))
       {
-	/*
-	  The manual defines the REPLACE semantics that it is either
-	  an INSERT or DELETE(s) + INSERT; FOREIGN KEY checks in
-	  InnoDB do not function in the defined way if we allow MySQL
-	  to convert the latter operation internally to an UPDATE.
-          We also should not perform this conversion if we have 
-          timestamp field with ON UPDATE which is different from DEFAULT.
-          Another case when conversion should not be performed is when
-          we have ON DELETE trigger on table so user may notice that
-          we cheat here. Note that it is ok to do such conversion for
-          tables which have ON UPDATE but have no ON DELETE triggers,
-          we just should not expose this fact to users by invoking
-          ON UPDATE triggers.
-
-          Note, TABLE_SHARE and TABLE see long uniques differently:
-          - TABLE_SHARE sees as HA_KEY_ALG_LONG_HASH and HA_NOSAME
-          - TABLE sees as usual non-unique indexes
-        */
-        bool is_long_unique= table->s->key_info &&
-                             table->s->key_info[key_nr].algorithm ==
-                             HA_KEY_ALG_LONG_HASH;
-        if ((is_long_unique ?
-             /*
-               We have a long unique. Test that there are no in-engine
-               uniques and the current long unique is the last long unique.
-             */
-             !(table->key_info[0].flags & HA_NOSAME) &&
-             last_uniq_key(table, table->s->key_info, key_nr) :
-             /*
-               We have a normal key - not a long unique.
-               Test is the current normal key is unique and
-               it is the last normal unique.
-             */
-             last_uniq_key(table, table->key_info, key_nr)) &&
-            !table->file->referenced_by_foreign_key() &&
-            (!table->triggers || !table->triggers->has_delete_triggers()))
-        {
-          if (table->versioned(VERS_TRX_ID))
-          {
-            bitmap_set_bit(table->write_set, table->vers_start_field()->field_index);
-            table->file->column_bitmaps_signal();
-            table->vers_start_field()->store(0, false);
-          }
-          if (unlikely(error= table->file->ha_update_row(table->record[1],
-                                                         table->record[0])) &&
-              error != HA_ERR_RECORD_IS_THE_SAME)
-            goto err;
-          if (likely(!error))
-          {
-            info->deleted++;
-            if (!table->file->has_transactions())
-              thd->transaction->stmt.modified_non_trans_table= TRUE;
-            if (table->versioned(VERS_TIMESTAMP))
-            {
-              store_record(table, record[2]);
-              error= vers_insert_history_row(table);
-              restore_record(table, record[2]);
-              if (unlikely(error))
-                goto err;
-            }
-          }
-          else
-            error= 0;   // error was HA_ERR_RECORD_IS_THE_SAME
-          /*
-            Since we pretend that we have done insert we should call
-            its after triggers.
-          */
-          goto after_trg_n_copied_inc;
-        }
-        else
+        if (versioned)
         {
-          if (table->triggers &&
-              table->triggers->process_triggers(thd, TRG_EVENT_DELETE,
-                                                TRG_ACTION_BEFORE, TRUE))
-            goto before_trg_err;
-
-          if (!table->versioned(VERS_TIMESTAMP))
-            error= table->file->ha_delete_row(table->record[1]);
-          else
-          {
-            store_record(table, record[2]);
-            restore_record(table, record[1]);
-            table->vers_update_end();
-            error= table->file->ha_update_row(table->record[1],
-                                              table->record[0]);
-            restore_record(table, record[2]);
-          }
+          store_record(table, record[2]);
+          error= vers_insert_history_row(table);
+          restore_record(table, record[2]);
           if (unlikely(error))
-            goto err;
-          if (!table->versioned(VERS_TIMESTAMP))
-            info->deleted++;
-          else
-            info->updated++;
-          if (!table->file->has_transactions_and_rollback())
-            thd->transaction->stmt.modified_non_trans_table= TRUE;
-          if (table->triggers &&
-              table->triggers->process_triggers(thd, TRG_EVENT_DELETE,
-                                                TRG_ACTION_AFTER, TRUE))
-          {
-            trg_error= 1;
-            goto after_trg_or_ignored_err;
-          }
-          /* Let us attempt do write_row() once more */
+            return on_ha_error(error);
         }
+        ++*inserted;
       }
     }
-    
-    /*
-      If more than one iteration of the above while loop is done, from
-      the second one the row being inserted will have an explicit
-      value in the autoinc field, which was set at the first call of
-      handler::update_auto_increment(). This value is saved to avoid
-      thd->insert_id_for_cur_row becoming 0. Use this saved autoinc
-      value.
-     */
-    if (table->file->insert_id_for_cur_row == 0)
-      table->file->insert_id_for_cur_row= insert_id_for_cur_row;
-      
     /*
-      Restore column maps if they where replaced during an duplicate key
-      problem.
+      If ON DUP KEY UPDATE updates a row instead of inserting
+      one, it's like a regular UPDATE statement: it should not
+      affect the value of a next SELECT LAST_INSERT_ID() or
+      mysql_insert_id().  Except if LAST_INSERT_ID(#) was in the
+      INSERT query, which is handled separately by
+      THD::arg_of_last_insert_id_function.
     */
-    if (table->read_set != save_read_set ||
-        table->write_set != save_write_set)
-      table->column_bitmaps_set(save_read_set, save_write_set);
+    prev_insert_id_for_cur_row= table->file->insert_id_for_cur_row;
+    insert_id_for_cur_row= table->file->insert_id_for_cur_row= 0;
+
+    ++*inserted; // Conforms the older behavior;
+
+    if (use_triggers
+        && table->triggers->process_triggers(thd, TRG_EVENT_UPDATE,
+                                             TRG_ACTION_AFTER, TRUE))
+      return restore_on_error();
   }
-  else if (unlikely((error=table->file->ha_write_row(table->record[0]))))
+
+  /*
+    Only update next_insert_id if the AUTO_INCREMENT value was explicitly
+    updated, so we don't update next_insert_id with the value from the
+    row being updated. Otherwise reset next_insert_id to what it was
+    before the duplicate key error, since that value is unused.
+  */
+  if (table->next_number_field_updated)
+  {
+    DBUG_ASSERT(table->next_number_field != NULL);
+
+    table->file->adjust_next_insert_id_after_explicit_value(
+                                           table->next_number_field->val_int());
+  }
+  else if (prev_insert_id_for_cur_row)
+  {
+    table->file->restore_auto_increment(prev_insert_id_for_cur_row);
+  }
+
+  return send_data();
+}
+
+bool Write_record::is_fatal_error(int error)
+{
+  return error == HA_ERR_LOCK_DEADLOCK ||
+         error == HA_ERR_LOCK_WAIT_TIMEOUT ||
These errors were only in rpl code, not anywhere in sql_insert.cc.
I don't think it's wrong to check for them everywhere consistently,
but does it change the behavior? Can you create a test case for INSERT
before your changes that comes to thd->is_fatal_error() with
HA_ERR_LOCK_DEADLOCK or HA_ERR_LOCK_WAIT_TIMEOUT? What will be the
behavior before and after your change?
...
+         table->file->is_fatal_error(error, HA_CHECK_ALL);
+}
+
+int Write_record::single_insert(ha_rows *inserted)
+{
+  DBUG_EXECUTE_IF("rpl_write_record_small_sleep_gtid_100_200",
+  {
+    uint64 seq= thd->rgi_slave ? thd->rgi_slave->current_gtid.seq_no : 0;
+    if (seq == 100 || seq == 200)
+      my_sleep(20000);
+  });
+
+  int error= table->file->ha_write_row(table->record[0]);
+  if (unlikely(error))
   {
     DEBUG_SYNC(thd, "write_row_noreplace");
-    if (!info->ignore ||
-        table->file->is_fatal_error(error, HA_CHECK_ALL))
-      goto err;
+    if (!info->ignore || is_fatal_error(error))
+      return on_ha_error(error);
     if (!(thd->variables.old_behavior &
           OLD_MODE_NO_DUP_KEY_WARNINGS_WITH_IGNORE))
       table->file->print_error(error, MYF(ME_WARNING));
     table->file->restore_auto_increment(prev_insert_id);
-    goto after_trg_or_ignored_err;
+    return 0;
   }
+  return after_insert(inserted);
+}
-after_trg_n_copied_inc:
-  info->copied++;
-  thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row);
-  trg_error= (table->triggers &&
-              table->triggers->process_triggers(thd, TRG_EVENT_INSERT,
-                                                TRG_ACTION_AFTER, TRUE));
+/**
+  Write a record to table with optional deleting of conflicting records,
+  invoke proper triggers if needed.
+
+  @note
+    Once this record will be written to table after insert trigger will
+    be invoked. If instead of inserting new record we will update old one
+    then both on update triggers will work instead. Similarly both on
+    delete triggers will be invoked if we will delete conflicting records.
-ok:
+    Sets thd->transaction.stmt.modified_non_trans_table to TRUE if table which
+    is updated didn't have transactions.
+
+  @return
+    0     - success
+    non-0 - error
+*/
+int Write_record::write_record()
+{
+  ha_rows inserted= 0, updated= 0;
+  prev_insert_id= table->file->next_insert_id;
+  info->records++;
+  ignored_error= false;
+  int res= 0;
+  switch(info->handle_duplicates)
+  {
+  case DUP_ERROR:
+    res= single_insert(&inserted);
+    break;
+  case DUP_UPDATE:
+    res= insert_on_duplicate_update(&inserted, &updated);
+    info->updated+= updated;
+    break;
+  case DUP_REPLACE:
+    res= replace_row(&inserted, &updated);
+    if (versioned)
+      info->updated+= updated;
+    else
+      info->deleted+= updated;
+  }
+
+  info->copied+= inserted;
+  if (inserted || updated)
+    notify_non_trans_table_modified();
+  return res;
+}
+
+
+int Write_record::prepare_handle_duplicate(int error)
+{
   /*
-    We send the row after writing it to the table so that the
-    correct values are sent to the client. Otherwise it won't show
-    autoinc values (generated inside the handler::ha_write()) and
-    values updated in ON DUPLICATE KEY UPDATE.
+    If we do more than one iteration of the replace loop, from the second one the
+    row will have an explicit value in the autoinc field, which was set at
+    the first call of handler::update_auto_increment(). So we must save
+    the autogenerated value to avoid thd->insert_id_for_cur_row to become
+    0.
   */
-  if (sink && sink->send_data(thd->lex->returning()->item_list) < 0)
-    trg_error= 1;
+  if (table->file->insert_id_for_cur_row > 0)
+    insert_id_for_cur_row= table->file->insert_id_for_cur_row;
+  else
+    table->file->insert_id_for_cur_row= insert_id_for_cur_row;
+
+  if (is_fatal_error(error))
+    return on_ha_error(error);
-after_trg_or_ignored_err:
-  if (key)
-    my_safe_afree(key,table->s->max_unique_length);
+  bool is_duplicate_key_error=
+          table->file->is_fatal_error(error, HA_CHECK_ALL & ~HA_CHECK_DUP);
+  if (!is_duplicate_key_error)
+  {
+    /*
+      We come here when we had an ignorable error which is not a duplicate
+      key error. In this we ignore error if ignore flag is set, otherwise
+      report error as usual. We will not do any duplicate key processing.
+    */
+    if (info->ignore)
+    {
+      table->file->print_error(error, MYF(ME_WARNING));
+      ignored_error= true;
+      return 1; /* Ignoring a not fatal error */
+    }
+    return on_ha_error(error);
+  }
+
+  key_nr = table->file->get_dup_key(error);
+
+  if (unlikely(key_nr == std::numeric_limits<decltype(key_nr)>::max()))
Well, the handler returns (uint)-1, please compare with it, instead of
inventing numerous other ways of archieving the same.

And why did you decide to use ushort for the key number?
We pretty much never use ushort, key number is uint everywhere.
I agree that ushort is enough size-wise, but let's rather be consistent.
...
+    return on_ha_error(HA_ERR_FOUND_DUPP_KEY); /* Database can't find key */
+
+  DEBUG_SYNC(thd, "write_row_replace");
+
+  /* Read all columns for the row we are going to replace */
+  table->use_all_columns();
+  /*
+    Don't allow REPLACE to replace a row when an auto_increment column
+    was used.  This ensures that we don't get a problem when the
+    whole range of the key has been used.
+  */
+  if (info->handle_duplicates == DUP_REPLACE && table->next_number_field &&
+      key_nr == table->s->next_number_index && insert_id_for_cur_row > 0)
+    return on_ha_error(error);
+
+  error= locate_dup_record();
+  if (error)
+    return on_ha_error(error);
+
+  return 0;
+}
+
+inline
+void Write_record::notify_non_trans_table_modified()
+{
   if (!table->file->has_transactions_and_rollback())
     thd->transaction->stmt.modified_non_trans_table= TRUE;
ok, as you like.
I'm personally not a great fan of one-liners that are used only once.
...
-  DBUG_RETURN(trg_error);
+}
-err:
+inline
+int Write_record::on_ha_error(int error)
+{
   info->last_errno= error;
   table->file->print_error(error,MYF(0));
-  
-before_trg_err:
+  DBUG_PRINT("info", ("Returning error %d", error));
+  return restore_on_error();
+}
+
+inline
+int Write_record::restore_on_error()
+{
   table->file->restore_auto_increment(prev_insert_id);
-  if (key)
-    my_safe_afree(key, table->s->max_unique_length);
-  table->column_bitmaps_set(save_read_set, save_write_set);
-  DBUG_RETURN(1);
+  return ignored_error ? 0 : 1;
+}
+
+inline
+int Write_record::after_insert(ha_rows *inserted)
+{
+  ++*inserted;
+  thd->record_first_successful_insert_id_in_cur_stmt(table->file->insert_id_for_cur_row);
+  return after_ins_trg() || send_data();
+}
+
+inline
+int Write_record::after_ins_trg()
+{
+  int res= 0;
+  if (use_triggers)
+    res= table->triggers->process_triggers(thd, TRG_EVENT_INSERT,
+                                           TRG_ACTION_AFTER, TRUE);
+  return res;
+}
+
+inline
+int Write_record::send_data()
+{
+  /*
+    We send the row after writing it to the table so that the
+    correct values are sent to the client. Otherwise it won't show
+    autoinc values (generated inside the handler::ha_write()) and
+    values updated in ON DUPLICATE KEY UPDATE.
+  */
+  return sink ? sink->send_data(thd->lex->returning()->item_list) < 0 : 0;
 }
+
 /******************************************************************************
   Check that there aren't any null_fields
 ******************************************************************************/
@@ -4816,18 +4934,8 @@ select_create::prepare(List<Item> &_values, SELECT_LEX_UNIT *u)
restore_record(table,s->default_values);      // Get empty record
   thd->cuted_fields=0;
-  bool create_lookup_handler= info.handle_duplicates != DUP_ERROR;
-  if (info.ignore || info.handle_duplicates != DUP_ERROR)
-  {
-    create_lookup_handler= true;
-    table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
-    if (table->file->ha_table_flags() & HA_DUPLICATE_POS)
-    {
-      if (table->file->ha_rnd_init_with_error(0))
-        DBUG_RETURN(1);
-    }
-  }
-  table->file->prepare_for_insert(create_lookup_handler);
+  if (prepare_for_replace(table, info.handle_duplicates, info.ignore))
ok.
strange that select_create::prepare doesn't call select_insert::prepare
...
+    DBUG_RETURN(1);
   if (info.handle_duplicates == DUP_REPLACE &&
       (!table->triggers || !table->triggers->has_delete_triggers()))
     table->file->extra(HA_EXTRA_WRITE_CAN_REPLACE);
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
index 5db53b6f2b8..7ce4850f534 100644
--- a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -4814,7 +4815,8 @@ mysql_execute_command(THD *thd)
                                     &lex->value_list,
                                     lex->duplicates,
                                     lex->ignore,
-                                    result)))
+                                    result,
+                                    &write)))
Why do you need Write_record here on the stack?
Wouldn't it be more logical to have it in select_insert ?
...
{
         if (lex->analyze_stmt)
           ((select_result_interceptor*)sel_result)->disable_my_ok_calls();
Regards,
Sergei
Chief Architect, MariaDB Server
and security@mariadb.org

    

[MariaDB developers] Re: 0d6dc52f43d: MDEV-30046 Refactor write_record and fix idempotent replication

Sergei Golubchik