May 2021 - developers - lists.mariadb.org

Re: [Maria-developers] 308c28e8a60: MDEV-15888 Implement FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK for views.
by Sergei Golubchik 23 May '21

23 May '21

Hi, Alexey! On Mar 21, Alexey Botchkov wrote: > revision-id: 308c28e8a60 (mariadb-10.5.2-393-g308c28e8a60) > parent(s): 786bc312b85 > author: Alexey Botchkov <holyfoot(a)mariadb.com> > committer: Alexey Botchkov <holyfoot(a)mariadb.com> > timestamp: 2021-02-09 00:59:55 +0400 > message: > > MDEV-15888 Implement FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK for views. > > Kind of 'naive' solution that surprisingly worked so trying to > understand why. > > diff --git a/mysql-test/main/flush-innodb.result b/mysql-test/main/flush-innodb.result > index 21e5bda7785..2c886e4f9fc 100644 > --- a/mysql-test/main/flush-innodb.result > +++ b/mysql-test/main/flush-innodb.result > @@ -60,7 +60,7 @@ DROP TABLE export; > CREATE VIEW v1 AS SELECT 1; > CREATE TEMPORARY TABLE t1 (a INT); > FLUSH TABLES v1 FOR EXPORT; > -ERROR HY000: 'test.v1' is not of type 'BASE TABLE' > +UNLOCK TABLES; good, but it only tests that there's no error. may be you need to add tests to verify that tables are actually locked? > FLUSH TABLES t1 FOR EXPORT; > ERROR 42S02: Table 'test.t1' doesn't exist > FLUSH TABLES non_existent FOR EXPORT; > diff --git a/sql/sql_reload.cc b/sql/sql_reload.cc > index 8f87d633d19..c146b7693c1 100644 > --- a/sql/sql_reload.cc > +++ b/sql/sql_reload.cc > @@ -543,6 +543,7 @@ bool flush_tables_with_read_lock(THD *thd, TABLE_LIST *all_tables) > > if (thd->lex->type & REFRESH_READ_LOCK) > { > +#ifdef MDEV15887 > /* > Acquire SNW locks on tables to be flushed. Don't acquire global > IX and database-scope IX locks on the tables as this will make > @@ -559,6 +560,7 @@ bool flush_tables_with_read_lock(THD *thd, TABLE_LIST *all_tables) > for (auto table_list= all_tables; table_list; > table_list= table_list->next_global) > table_list->mdl_request.ticket= NULL; > +#endif /*MDEV15887*/ what's that? btw, the number is MDEV15888 > } > > thd->variables.option_bits|= OPTION_TABLE_LOCK; Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1

Re: [Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 22 May '21

22 May '21

Welcome to the thread, Andrei! For everybody: we just had a productive chat with Andrei, and I'd like to outline the results. Let's take a look at binlog_commit, or more exactly MYSQL_BIN_LOG::trx_group_commit_leader called from there (follow binlog_commit_flush_trx_cache -> THD::binlog_flush_pending_rows_event -> MYSQL_BIN_LOG::flush_and_set_pending_rows_event -> MYSQL_BIN_LOG::write_transaction_to_binlog call chain). The committing is currently strongly sequenced. The transactions are organized into groups, and when the latest transaction is acknowledged on the first commit phase, they all are being committed by leader in a chosen order. However, even here we can write the transactions in parallel, preserving the order. Andrei also claims, that the order can be potentially restored on the replication side. Anyway we technically can't send the transaction to the replication slave before the binlog flush&fsync, nevertheless the data will still be preserved in the volatile append cache. There was MDEV-20925 <https://jira.mariadb.org/browse/MDEV-20925> to store the transaction length in the event, but it unfortunately was rejected: > I will be closing this issue because we have COMMIT/ROLLBACK query log event > in the end of transaction , whose size is difficult to determine , > So current plan is to do MDEv-19687 without transaction length. The replication team decided to calculate the transaction sizes during receiving the data form the io, and then to store it in the hash, and no protocol modifications would be required then. I suggested to buffer the transaction separately, and then push it into the relay log in the data frame, storing the length. Then, we have MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>, which was the supertask for MDEV-20925 <https://jira.mariadb.org/browse/MDEV-20925>. The aim is to parallelize the replication on the slave side. They are implementing their own parallel circular buffer with single-writer, multiple-reader use case: the parallel workers are going to pick out the transaction data, and resolve the commit order later. Sergei petrunya, you've been questioning: > I see SEQ_READ_APPEND is only used for Relay Log on the slave. Afaiu, the > relay > log has only only one producer, the network io thread. Andrei clarified, that there can actually be multiple sources. However they are just going to create a working queue for each source. =============================================================== To underline, the rationale is changed for this IO_CACHE improvement: * The replication relay log is not the use-case anymore, since a separate, much simplified circular buffer is going to be implemented (AFAIU, Sachin is in the middle of the progress of MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>). * The binlog commit is instead a good case! Most likely I have forgotten about it, when I was writing the rationale. But anyway it wasn't clear for me, can the transactions be written in parallel there. On the reader side, the network replication sender should not block the committing process by reading from the binlog. Regards, Nikita

1 0

[Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 19 May '21

19 May '21

Hello, dear MariaDB community! I am glad to present the design for IO_CACHE's SEQ_READ_APPEND mode concurrency improvements. Jira task is MDEV-24676. A student, Vladislav Kakurin, has applied to GSoC for this task, and has passed, so he will be engaged in implementing it this summer. The discussion is very important here, since I could make the mistakes, and Vladislav can miss them, but the success of the task completion may depend on it. We're hoping to succeed in implementing the multi-producer till GSoC end. So please don't hesitate to ask for clarifications. Note that reader/writer and producer/consumer pairs are used interchangeably in the text. The text is attached as PDF, but for the sake of convenient quoting it is additionally embodied in the email with shrinked markup and no illustrations. Sincerely, Nikita Malyavin Software Engineer of MariaDB ==================================================================== A fine-grained concurrent ring buffer mode for IO_CACHE 1. Rationale. --------------------------------------------------- Current implementation of IO_CACHE’s SEQ_READ_APPEND mode behaves coarsely grained on its write buffer: every read and every write is protected by append_buffer_lock. int _my_b_seq_read(IO_CACHE *info, ...) { lock_append_buffer(info); ... // read logic unlock_append_buffer(info); return Count ? 1 : 0; } int my_b_append(IO_CACHE *info, ...) { lock_append_buffer(info); ... // append logic unlock_append_buffer(info); return 0; } Despite the separate read buffer is read-only, and therefore is accessed wait-free, the write buffer can have a contention with medium-sized transactions. The design described hereafter is going to solve this issue, and an extension for a parallel multi-producer workflow is additionally provided. Furthermore, the API extension for multi-producer approach support is proposed, and the multi-consumerness is discussed. 2. The single-producer, single-consumer case. --------------------------------------------------- Idea. The memcpy operations of consumer and producer never overlap, therefore they can be freed of locks. Overflow and emptiness We cannot begin writing in the area still involved in reading. Therefore, reader should not update the pointers before it finishes reading. This means that we should lock in the beginning to atomically read the data, and in the end, to write the new reader data. The same for vice-versa, we cannot read from the area still involved into writing, therefore a read should finish with EMPTY error (currently _my_b_seq_read just returns 1) When we reach a “buffer is full” condition, we can flip the read and write (append) buffers, if we were reading from an append buffer. Otherwise, the append buffer is flushed. The algorithm. The following pseudocode will describe the single-consumer, single-producer approach. It is assumed that reading from the read buffer is handled in the usual way. io->total_size and io->read_buffer are considered to be accessed atomically. io_err_t read(IO_CACHE *io, uchar *buffer, size_t sz) { if (sz > io->total_size) return E_IO_EMPTY; uchar *read_buffer = io->read_buffer; if (io->read_pos points to read_buffer) sz_read = read_from_read_buffer(io, buffer, sz); buffer += sz_read; sz -= sz_read; io->total_size -= sz_read; if (sz == 0) return 0; // else copy from append buffer lock(io->append_buffer_lock); // copy the local variables uchar *read_pos = io->read_pos; uchar *read_buffer = io->read_buffer; uchar *append_start_pos = i->append_start_pos; uchar *append_size = io->append_size; uchar *append_pos = io->append_pos; // etc, if needed unlock(io->append_buffer_lock); read from append buffer; lock(io->append_buffer_lock); // update the variables io->append_size -= sz; io->append_start_pos += sz; if (i->append_start_pos >= io->append_buf + io->cache_size) io->append_start_pos -= io->cache_size; unlock(io->append_buffer_lock); io->total_size -= sz; } The first read()’s part tries to read from a read-only buffer. If it’s empty, it moves the effort to a volatile append buffer. All the metadata is copied in the first critical section, before the data copying, to the stack. It is updated back in the second critical section, after the data copying. io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz) { lock(io->append_buffer_lock); if (append_buffer is full and io->total_size <= io->append_size) swap(io->append_buffer, io->read_buffer); else flush the append buffer if needed; write to disk directly, if the data is too large; uchar *write_pos = io->write_pos; unlock(io->append_buffer_lock); write to append buffer; lock(io->append_buffer_lock); io->write_pos = new_write_pos; unlock(io->append_buffer_lock); io->total_size += sz; } The important note here is that we access io->read_buffer in the reader’s thread without the lock (the accesses are marked bold). However this access happens only once in the beginning and is safe: Only writer changes read_buffer. The writer can change it only once during one read() if io->read_buffer is considered reads-only, then it will not flip again, and continue to be consistent, until io->total_size is changed: io->total_size -= sz_read; Then the lock happens. It should be fine to read from a flipped buffer on that stage. 3. Multi-producer concurrency --------------------------------------------------- Idea. Writes start from io->write_start, which is to update immediately. Reads are possible only until io->read_end, which is updated, as soon as writes are finished. Medium-grained approach io->write_start is updated immediately to allow parallel writes. However, we cannot update io->read_end immediately after this thread’s write ends, because earlier writes can still be in progress. We should wait for them i.e. we wait while (io->read_end != local_read_end) Algorithm (medium-grained). Medium-grained approach will modify write() function as follows (the changed lines and locks are bolded): io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz) { lock(io->append_buffer_lock); if (buffer flip of flush is needed) wait until all the writes are finished; if (append_buffer is full && io->write_total_size <= io->append_size) swap(io->append_buffer, io->read_buffer); else flush the append buffer if needed; write to disk directly, if the data is too large; uchar *local_write_start = io->write_start; io->write_total_size += sz; io->write_start += sz; if (io->write_start > io->append_buffer + io->cache_size) io->write_start -= io->cache_size; unlock(io->append_buffer_lock); write to append buffer; lock(io->write_event_lock) while(local_write_start != io->read_end) cond_wait(io->write_event, io->write_event_lock); unlock(io->write_event_lock) lock(io->append_buffer_lock); io->read_end = new_read_end; unlock(io->append_buffer_lock); cond_signal(io->write_event); io->total_size += sz; } The read function should be modified mostly cosmetically. Fine graining. The writers are still waiting for each other’s finish. The approach described here defers waiting through helping pattern by introducing progress slots. Each time a writer begins progress it allocates a slot in the dedicated (fixed size) array. When the writer finishes its job, it checks whether it is the leftmost one (relative to its read_end value. If it is, it updates read_end for itself, and for all the consecutive writers already finished. The slot allocation will be controlled by a semaphore to prevent overflow. Therefore, only a fixed number of producers can work simultaneously. The slot array is made of elements of private cache_slot_t structure: struct cache_slot_t { bool vacant: 1; bool finished: 1; uint next: size_bits(uint) - 2; uint pos; } The slot is acquired whenever a write begins by searching an array cell with vacant=1. When it’s found, vacant = 0, finished = 0 is set. The last_slot variable holds the slot index for the latest write. slots[ last_slot].next is set to a new index, and last_slot itself is updated. The following example demonstrates how the slots work: +-------------------------------------------------------------+ | | write1 | write2 | write3 | | +-------------------------------------------------------------+ A B C D +-------------------------------------------------+ | vacant=0 | vacant=1 | vacant=0 | vacant=0 | | finished=0 | | finished=1 | finished=1 | | next=2 | | next=3 | next=? | | pos=A | | pos=B | pos=C | +-------------------------------------------------+ there were three writes currently running in parallel. write2 and write3 are finished, but write1 is still running. When it finishes, it will hop through slot.next while vacant==0 and finished==1 and pos != io->write_start. Therefore, read_end will be updated to C if no other write will begin in parallel. If another write begins in parallel before write1 finishes, it allocates slots[1] and sets pos=D. slots[3].next would be set to 1, and last_slot will be updated from 3 to 1. The slot run through expected complexity is O(1). The proof for acquisition is however not that obvious to prove the same, and no effort was spent for proving it (It’s only obvious that it’s O(slots)). 4. Arbitrary data sources support --------------------------------------------------- The widely spread use-case is pouring from another IO_CACHE source (like a statement or transaction cache). The operation may require several consecutive write() calls with an external lock: lock(write_lock); uchar buffer[SIZE]; while(cache_out is not empty) { read(ceche_out, buffer, SIZE); write(cache_in, buffer, SIZE); } unlock(write_lock); This case destroys all the parallel design described. However, let’s make api changes to allow blocks of predicted size be written in parallel: /** Allocates the slot of a requested size for a writer. Returns new slot id. */ slot_id_t append_allocate(IO_CAHCE*, size_t block_size); /** Frees the slot and propogates the data to be available for reading */ void append_commit(IO_CACHE*, slot_id_t); These two functions just decompose our write() function: append_allocate would include the first critical section and append_commit would include the second one. The use-case will be changed slightly: slot_id_t slot = append_allocate(cache_out, append_tell(cache_in)); uchar buffer[SIZE]; while(cache_out is not empty) { read(ceche_out, buffer, SIZE); write(cache_in, buffer, SIZE); } append_commit(cache_out, slot); 5. Multi-consumerness --------------------------------------------------- We currently have no cases with several readers working in parallel in SEQ_READ_APPEND mode. It is only used by the replication thread to read out the log, where it is delegated to a dedicated worker. The first problem is that parallel readout would require additional coordination -- the order of event application can be important. Another problem is that a variable-sized blocks require at least two consecutive reads if the structure is not known. If the length is stored, it can be read out with exactly two reads (first reads length, second reads the body). The slot allocation strategy can be applied, and api can be added similar to a new write api: /** lock the cache and allocate the read slot */ slot_id_t read_allocate_lock(IO_CACHE*); /** Allocate a read zone of the requested size and unlock the cache */ void read_allocate_unlock(IO_CACHE*, slot_id_t, size_t size); /** Finish reading; deallocate the read slot */ viod read_commit(IO_CACHE*, slot_id_t); Reading api needs one function more than writing api -- the allocation is split on two phases: locking phase (to compute the block length), and the actual requesting phase. This approach has several disadvantages: 1. The read buffer access is no longer lock-free 2. read_allocate_lock leaves the IO_CACHE in a locked state, which can be potentially misused. Additionally, two SX locks can be used (one for readers and one for writers) for extra parallelism.

2 2

[Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 18 May '21

18 May '21

Hello, dear MariaDB community! I am glad to present the design for IO_CACHE's SEQ_READ_APPEND mode concurrency improvements. Jira task is MDEV-24676. A student, Vladislav Kakurin, has applied to GSoC for this task, and has passed, so he will be engaged in implementing it this summer. The discussion is very important here, since I could make the mistakes, and Vladislav can miss them, but the success of the task completion may depend on it. We're hoping to succeed in implementing the multi-producer till GSoC end. So please don't hesitate to ask for clarifications. Note that reader/writer and producer/consumer pairs are used interchangeably in the text. The text is attached as PDF, but for the sake of convenient quoting it is additionally embodied in the email with shrinked markup and no illustrations. Sincerely, Nikita Malyavin Software Engineer of MariaDB ==================================================================== A fine-grained concurrent ring buffer mode for IO_CACHE 1. Rationale. --------------------------------------------------- Current implementation of IO_CACHE’s SEQ_READ_APPEND mode behaves coarsely grained on its write buffer: every read and every write is protected by append_buffer_lock. int _my_b_seq_read(IO_CACHE *info, ...) { lock_append_buffer(info); ... // read logic unlock_append_buffer(info); return Count ? 1 : 0; } int my_b_append(IO_CACHE *info, ...) { lock_append_buffer(info); ... // append logic unlock_append_buffer(info); return 0; } Despite the separate read buffer is read-only, and therefore is accessed wait-free, the write buffer can have a contention with medium-sized transactions. The design described hereafter is going to solve this issue, and an extension for a parallel multi-producer workflow is additionally provided. Furthermore, the API extension for multi-producer approach support is proposed, and the multi-consumerness is discussed. 2. The single-producer, single-consumer case. --------------------------------------------------- Idea. The memcpy operations of consumer and producer never overlap, therefore they can be freed of locks. Overflow and emptiness We cannot begin writing in the area still involved in reading. Therefore, reader should not update the pointers before it finishes reading. This means that we should lock in the beginning to atomically read the data, and in the end, to write the new reader data. The same for vice-versa, we cannot read from the area still involved into writing, therefore a read should finish with EMPTY error (currently _my_b_seq_read just returns 1) When we reach a “buffer is full” condition, we can flip the read and write (append) buffers, if we were reading from an append buffer. Otherwise, the append buffer is flushed. The algorithm. The following pseudocode will describe the single-consumer, single-producer approach. It is assumed that reading from the read buffer is handled in the usual way. io->total_size and io->read_buffer are considered to be accessed atomically. io_err_t read(IO_CACHE *io, uchar *buffer, size_t sz) { if (sz > io->total_size) return E_IO_EMPTY; uchar *read_buffer = io->read_buffer; if (io->read_pos points to read_buffer) sz_read = read_from_read_buffer(io, buffer, sz); buffer += sz_read; sz -= sz_read; io->total_size -= sz_read; if (sz == 0) return 0; // else copy from append buffer lock(io->append_buffer_lock); // copy the local variables uchar *read_pos = io->read_pos; uchar *read_buffer = io->read_buffer; uchar *append_start_pos = i->append_start_pos; uchar *append_size = io->append_size; uchar *append_pos = io->append_pos; // etc, if needed unlock(io->append_buffer_lock); read from append buffer; lock(io->append_buffer_lock); // update the variables io->append_size -= sz; io->append_start_pos += sz; if (i->append_start_pos >= io->append_buf + io->cache_size) io->append_start_pos -= io->cache_size; unlock(io->append_buffer_lock); io->total_size -= sz; } The first read()’s part tries to read from a read-only buffer. If it’s empty, it moves the effort to a volatile append buffer. All the metadata is copied in the first critical section, before the data copying, to the stack. It is updated back in the second critical section, after the data copying. io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz) { lock(io->append_buffer_lock); if (append_buffer is full and io->total_size <= io->append_size) swap(io->append_buffer, io->read_buffer); else flush the append buffer if needed; write to disk directly, if the data is too large; uchar *write_pos = io->write_pos; unlock(io->append_buffer_lock); write to append buffer; lock(io->append_buffer_lock); io->write_pos = new_write_pos; unlock(io->append_buffer_lock); io->total_size += sz; } The important note here is that we access io->read_buffer in the reader’s thread without the lock (the accesses are marked bold). However this access happens only once in the beginning and is safe: Only writer changes read_buffer. The writer can change it only once during one read() if io->read_buffer is considered reads-only, then it will not flip again, and continue to be consistent, until io->total_size is changed: io->total_size -= sz_read; Then the lock happens. It should be fine to read from a flipped buffer on that stage. 3. Multi-producer concurrency --------------------------------------------------- Idea. Writes start from io->write_start, which is to update immediately. Reads are possible only until io->read_end, which is updated, as soon as writes are finished. Medium-grained approach io->write_start is updated immediately to allow parallel writes. However, we cannot update io->read_end immediately after this thread’s write ends, because earlier writes can still be in progress. We should wait for them i.e. we wait while (io->read_end != local_read_end) Algorithm (medium-grained). Medium-grained approach will modify write() function as follows (the changed lines and locks are bolded): io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz) { lock(io->append_buffer_lock); if (buffer flip of flush is needed) wait until all the writes are finished; if (append_buffer is full && io->write_total_size <= io->append_size) swap(io->append_buffer, io->read_buffer); else flush the append buffer if needed; write to disk directly, if the data is too large; uchar *local_write_start = io->write_start; io->write_total_size += sz; io->write_start += sz; if (io->write_start > io->append_buffer + io->cache_size) io->write_start -= io->cache_size; unlock(io->append_buffer_lock); write to append buffer; lock(io->write_event_lock) while(local_write_start != io->read_end) cond_wait(io->write_event, io->write_event_lock); unlock(io->write_event_lock) lock(io->append_buffer_lock); io->read_end = new_read_end; unlock(io->append_buffer_lock); cond_signal(io->write_event); io->total_size += sz; } The read function should be modified mostly cosmetically. Fine graining. The writers are still waiting for each other’s finish. The approach described here defers waiting through helping pattern by introducing progress slots. Each time a writer begins progress it allocates a slot in the dedicated (fixed size) array. When the writer finishes its job, it checks whether it is the leftmost one (relative to its read_end value. If it is, it updates read_end for itself, and for all the consecutive writers already finished. The slot allocation will be controlled by a semaphore to prevent overflow. Therefore, only a fixed number of producers can work simultaneously. The slot array is made of elements of private cache_slot_t structure: struct cache_slot_t { bool vacant: 1; bool finished: 1; uint next: size_bits(uint) - 2; uint pos; } The slot is acquired whenever a write begins by searching an array cell with vacant=1. When it’s found, vacant = 0, finished = 0 is set. The last_slot variable holds the slot index for the latest write. slots[ last_slot].next is set to a new index, and last_slot itself is updated. The following example demonstrates how the slots work: +-------------------------------------------------------------+ | | write1 | write2 | write3 | | +-------------------------------------------------------------+ A B C D +-------------------------------------------------+ | vacant=0 | vacant=1 | vacant=0 | vacant=0 | | finished=0 | | finished=1 | finished=1 | | next=2 | | next=3 | next=? | | pos=A | | pos=B | pos=C | +-------------------------------------------------+ there were three writes currently running in parallel. write2 and write3 are finished, but write1 is still running. When it finishes, it will hop through slot.next while vacant==0 and finished==1 and pos != io->write_start. Therefore, read_end will be updated to C if no other write will begin in parallel. If another write begins in parallel before write1 finishes, it allocates slots[1] and sets pos=D. slots[3].next would be set to 1, and last_slot will be updated from 3 to 1. The slot run through expected complexity is O(1). The proof for acquisition is however not that obvious to prove the same, and no effort was spent for proving it (It’s only obvious that it’s O(slots)). 4. Arbitrary data sources support --------------------------------------------------- The widely spread use-case is pouring from another IO_CACHE source (like a statement or transaction cache). The operation may require several consecutive write() calls with an external lock: lock(write_lock); uchar buffer[SIZE]; while(cache_out is not empty) { read(ceche_out, buffer, SIZE); write(cache_in, buffer, SIZE); } unlock(write_lock); This case destroys all the parallel design described. However, let’s make api changes to allow blocks of predicted size be written in parallel: /** Allocates the slot of a requested size for a writer. Returns new slot id. */ slot_id_t append_allocate(IO_CAHCE*, size_t block_size); /** Frees the slot and propogates the data to be available for reading */ void append_commit(IO_CACHE*, slot_id_t); These two functions just decompose our write() function: append_allocate would include the first critical section and append_commit would include the second one. The use-case will be changed slightly: slot_id_t slot = append_allocate(cache_out, append_tell(cache_in)); uchar buffer[SIZE]; while(cache_out is not empty) { read(ceche_out, buffer, SIZE); write(cache_in, buffer, SIZE); } append_commit(cache_out, slot); 5. Multi-consumerness --------------------------------------------------- We currently have no cases with several readers working in parallel in SEQ_READ_APPEND mode. It is only used by the replication thread to read out the log, where it is delegated to a dedicated worker. The first problem is that parallel readout would require additional coordination -- the order of event application can be important. Another problem is that a variable-sized blocks require at least two consecutive reads if the structure is not known. If the length is stored, it can be read out with exactly two reads (first reads length, second reads the body). The slot allocation strategy can be applied, and api can be added similar to a new write api: /** lock the cache and allocate the read slot */ slot_id_t read_allocate_lock(IO_CACHE*); /** Allocate a read zone of the requested size and unlock the cache */ void read_allocate_unlock(IO_CACHE*, slot_id_t, size_t size); /** Finish reading; deallocate the read slot */ viod read_commit(IO_CACHE*, slot_id_t); Reading api needs one function more than writing api -- the allocation is split on two phases: locking phase (to compute the block length), and the actual requesting phase. This approach has several disadvantages: 1. The read buffer access is no longer lock-free 2. read_allocate_lock leaves the IO_CACHE in a locked state, which can be potentially misused. Additionally, two SX locks can be used (one for readers and one for writers) for extra parallelism.

1 0

Re: [Maria-developers] eebebe8ef75: MDEV-23842 Atomic RENAME TABLE
by Sergei Golubchik 16 May '21

16 May '21

Hi, Monty! On May 10, Michael Widenius wrote: > revision-id: eebebe8ef75 (mariadb-10.6.0-77-geebebe8ef75) > parent(s): 5a602e2ce7f > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-05-04 21:09:10 +0300 > message: > > MDEV-23842 Atomic RENAME TABLE > diff --git a/client/mysqltest.cc b/client/mysqltest.cc > index 350c55edee2..133d1f76839 100644 > --- a/client/mysqltest.cc > +++ b/client/mysqltest.cc > @@ -8095,9 +8095,10 @@ void handle_error(struct st_command *command, > const char *err_sqlstate, DYNAMIC_STRING *ds) > { > int i; > - > DBUG_ENTER("handle_error"); > > + var_set_int("$errno", err_errno); there was $mysql_errno already, was is not something you could've used? > + > command->used_replace= 1; > if (command->require_file) > { > diff --git a/mysql-test/suite/atomic/rename_case.result b/mysql-test/suite/atomic/rename_case.result > new file mode 100644 > index 00000000000..4b58c555cf2 > --- /dev/null > +++ b/mysql-test/suite/atomic/rename_case.result > @@ -0,0 +1,52 @@ > +create database test2; > +# > +# Testing rename error in different places > +# > +create table t1 (a int); > +create table T2 (b int); > +create table t3 (c int); > +create table T4 (d int); > +insert into t1 values(1); > +insert into T2 values(2); > +insert into t3 values(3); > +insert into T4 values(4); > +create temporary table tmp1 (a int); > +create temporary table tmp2 (b int); > +create temporary table tmp3 (c int); > +create temporary table tmp4 (d int); > +insert into tmp1 values(11); > +insert into tmp2 values(22); > +insert into tmp3 values(33); > +insert into tmp4 values(44); > +rename table t3 to T4, t1 to t5, T2 to t1, t5 to T2; > +ERROR 42S01: Table 'T4' already exists > +rename table t1 to t5, t3 to T4, T2 to t1, t5 to T2; > +ERROR 42S01: Table 'T4' already exists > +rename table t1 to t5, T2 to t1, t3 to T4, t5 to T2; > +ERROR 42S01: Table 'T4' already exists > +rename table t1 to t5, T2 to t1, t5 to T2, t3 to T4; > +ERROR 42S01: Table 'T4' already exists > +# Try failed rename using two databases > +rename table test.t1 to test2.t5, test.T2 to test.t1, t5 to test.T2; > +ERROR 42S02: Table 'test.t5' doesn't exist > +select t1.a+T2.b+t3.c+T4.d from t1,T2,t3,T4; better do select t1.a,T2.b,t3.c,T4.d from t1,T2,t3,T4; same one row, but it can detect when tables are swapped. or if you need one scalar value (you do below) it could ne select CONCAT(t1.a,T2.b,t3.c,T4.d) from t1,T2,t3,T4; > +t1.a+T2.b+t3.c+T4.d > +10 > +select * from t5; > +ERROR 42S02: Table 'test.t5' doesn't exist > +T2.MYD > +T2.MYI > +T2.frm > +T4.MYD > +T4.MYI > +T4.frm > +db.opt > +t1.MYD > +t1.MYI > +t1.frm > +t3.MYD > +t3.MYI > +t3.frm > +# Cleanup > +drop table t1,T2,t3,T4; > +drop database test2; > diff --git a/sql/sql_class.h b/sql/sql_class.h > index 0dcb2e04ba4..5ae0866672d 100644 > --- a/sql/sql_class.h > +++ b/sql/sql_class.h > @@ -2835,6 +2835,7 @@ class THD: public THD_count, /* this must be first */ > > #ifndef MYSQL_CLIENT > binlog_cache_mngr * binlog_setup_trx_data(); > + ulonglong binlog_xid; /* binlog request storing of xid */ "Tell binlog to store thd->query_id in the next Query_log_event" > > /* > Public interface to write RBR events to the binlog > diff --git a/sql/log_event_client.cc b/sql/log_event_client.cc > index 6b5d71348e1..733fc14c697 100644 > --- a/sql/log_event_client.cc > +++ b/sql/log_event_client.cc > @@ -1822,7 +1822,7 @@ bool Query_log_event::print_query_header(IO_CACHE* file, > my_b_printf(file, > "\t%s\tthread_id=%lu\texec_time=%lu\terror_code=%d\n", > get_type_str(), (ulong) thread_id, (ulong) exec_time, > - error_code)) > + error_code, (ulong) xid)) huh? you didn't add "\nxid=%lu" to the format string > goto err; > } > > diff --git a/sql/debug_sync.h b/sql/debug_sync.h > index 3b8aa8815e1..4e3e10fcc51 100644 > --- a/sql/debug_sync.h > +++ b/sql/debug_sync.h > @@ -53,4 +53,12 @@ static inline bool debug_sync_set_action(THD *, const char *, size_t) > { return false; } > #endif /* defined(ENABLED_DEBUG_SYNC) */ > > +#ifndef DBUG_OFF > +void debug_crash_here(const char *keyword); > +bool debug_simulate_error(const char *keyword, uint error); > +#else > +#define debug_crash_here(A) do { } while(0) > +#define debug_simulate_error(A, B) 0 > +#endif this doesn't belong to debug_sync.h or debug_sync.cc put it, hmm, may be in sql_priv.h ? or sql_class.h/sql_class.cc > + > #endif /* DEBUG_SYNC_INCLUDED */ > diff --git a/sql/debug_sync.cc b/sql/debug_sync.cc > index 1fbb15592a4..f523e22f6ce 100644 > --- a/sql/debug_sync.cc > +++ b/sql/debug_sync.cc > @@ -1628,3 +1628,74 @@ bool debug_sync_set_action(THD *thd, const char *action_str, size_t len) > /* prevent linker/lib warning about file without public symbols */ > int debug_sync_dummy; > #endif /* defined(ENABLED_DEBUG_SYNC) */ > + > + > +/** > + Debug utility to do crash after a set number of executions > + > + The user variable, either @debug_crash_counter or @debug_error_counter, this is very hackish. better to use @@debug_dbug_counter, something like that. also it could be global for an extra benefit if you'd need to crash a non-user thread. > + is decremented each time debug_crash() or debug_simulate_error is called > + if the keyword is set with @@debug_push, like > + @@debug_push="d+frm_data_type_info_emulate" > + > + If the variable is not set or is not an integer it will be ignored. > +*/ > + > +#ifndef DBUG_OFF > + > +static const LEX_CSTRING debug_crash_counter= > +{ STRING_WITH_LEN("debug_crash_counter") }; > +static const LEX_CSTRING debug_error_counter= > +{ STRING_WITH_LEN("debug_error_counter") }; > + > +static bool debug_decrement_counter(const LEX_CSTRING *name) > +{ > + THD *thd= current_thd; > + user_var_entry *entry= (user_var_entry*) > + my_hash_search(&thd->user_vars, (uchar*) name->str, name->length); > + if (!entry || entry->type != INT_RESULT || ! entry->value) > + return 0; > + (*(ulonglong*) entry->value)= (*(ulonglong*) entry->value)-1; > + return !*(ulonglong*) entry->value; > +} > + > +void debug_crash_here(const char *keyword) > +{ > + DBUG_ENTER("debug_crash_here"); > + DBUG_PRINT("enter", ("keyword: %s", keyword)); > + > + DBUG_EXECUTE_IF(keyword, > + if (debug_decrement_counter(&debug_crash_counter)) > + { > + my_printf_error(ER_INTERNAL_ERROR, > + "Crashing at %s", > + MYF(ME_ERROR_LOG | ME_NOTE), keyword); > + DBUG_SUICIDE(); > + }); > + DBUG_VOID_RETURN; > +} > + > +/* > + This can be used as debug_counter to simulate an error at a specific > + position. > + > + Typical usage would be > + if (debug_simualte_error("keyword")) > + error= 1; > +*/ > + > +bool debug_simulate_error(const char *keyword, uint error) > +{ > + DBUG_ENTER("debug_crash_here"); > + DBUG_PRINT("enter", ("keyword: %s", keyword)); > + DBUG_EXECUTE_IF(keyword, > + if (debug_decrement_counter(&debug_error_counter)) > + { > + my_printf_error(error, > + "Simulating error for '%s'", > + MYF(ME_ERROR_LOG), keyword); > + DBUG_RETURN(1); > + }); > + DBUG_RETURN(0); > +} > +#endif /* DBUG_OFF */ > diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc > index cca05f1fcca..0cf037e3a6f 100644 > --- a/sql/log_event_server.cc > +++ b/sql/log_event_server.cc > @@ -1294,6 +1294,15 @@ bool Query_log_event::write() > int3store(start, when_sec_part); > start+= 3; > } > + > + /* xid's is used with ddl_log handling */ it's not really an XID, it's a query_id or the DDL statetent. XID is a combination of a 64-byte global transaction identifier and a 64-byte branch qualifier. It would be much less confusing not to pretend that the query_id is an "xid" is some shape or form. Better to rename it to, say ddl_query_id. Or even ddl_id, if you like it short. > + if (thd && thd->binlog_xid) > + { > + *start++= Q_XID; > + int8store(start, thd->query_id); > + start+= 8; > + } > + > /* > NOTE: When adding new status vars, please don't forget to update > the MAX_SIZE_LOG_EVENT_STATUS in log_event.h and update the function > diff --git a/sql/log.cc b/sql/log.cc > index 17c599627cb..6e0a9706ec8 100644 > --- a/sql/log.cc > +++ b/sql/log.cc > @@ -10654,6 +10669,8 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name, > { > if (ha_recover(&xids)) > goto err2; > + if (ddl_log_close_binlogged_events(&xids)) your xids and normal Xid_log_event xids are from two different non-intersecting sets and have even different semantics. It'd be safer not to have them in the same hash. > + goto err2; > > free_root(&mem_root, MYF(0)); > my_hash_free(&xids); > diff --git a/sql/sql_view.cc b/sql/sql_view.cc > index 5df5ecb76a9..e6d726b30d7 100644 > --- a/sql/sql_view.cc > +++ b/sql/sql_view.cc > @@ -2231,16 +2234,25 @@ mysql_rename_view(THD *thd, > (uchar*)&view_def, view_parameters)) > { > /* restore renamed view in case of error */ > - rename_in_schema_file(thd, new_db->str, new_name->str, view->db.str, > - view->table_name.str); > + rename_in_schema_file(thd, new_db->str, new_name->str, old_db->str, > + old_name->str); > goto err; > } > - } else > + } > + else > DBUG_RETURN(1); > > /* remove cache entries */ > - query_cache_invalidate3(thd, view, 0); > - sp_cache_invalidate(); > + { > + char key[NAME_LEN*2+1], *ptr; > + memcpy(key, old_db->str, old_db->length); > + ptr= key+ old_db->length; > + *ptr++= 0; > + memcpy(key, old_name->str, old_name->length); > + ptr= key+ old_db->length; > + *ptr++= 0; there's tdc_create_key for that. > + query_cache.invalidate(thd, key, (size_t) (ptr-key), 0); > + } > error= FALSE; > > err: > diff --git a/sql/ddl_log.h b/sql/ddl_log.h > index 0747699cd85..316e6708f22 100644 > --- a/sql/ddl_log.h > +++ b/sql/ddl_log.h > @@ -24,6 +24,9 @@ > enum ddl_log_entry_code > { > /* > + DDL_LOG_UNKOWN s/UNKOWN/UNKNOWN/ > + Here mainly to detect blocks that are all zero > + > DDL_LOG_EXECUTE_CODE: > This is a code that indicates that this is a log entry to > be executed, from this entry a linked list of log entries > @@ -34,48 +37,95 @@ enum ddl_log_entry_code > DDL_IGNORE_LOG_ENTRY_CODE: > An entry that is to be ignored > */ > - DDL_LOG_EXECUTE_CODE = 'e', > - DDL_LOG_ENTRY_CODE = 'l', > - DDL_IGNORE_LOG_ENTRY_CODE = 'i' > + DDL_LOG_UNKNOWN= 0, > + DDL_LOG_EXECUTE_CODE= 1, > + DDL_LOG_ENTRY_CODE= 2, > + DDL_IGNORE_LOG_ENTRY_CODE= 3, > + DDL_LOG_ENTRY_CODE_LAST= 4 > }; > > + > +/* > + When adding things below, also add an entry to ddl_log_entry_phases in > + ddl_log.cc > +*/ > + > enum ddl_log_action_code > { > /* > The type of action that a DDL_LOG_ENTRY_CODE entry is to > perform. > - DDL_LOG_DELETE_ACTION: > - Delete an entity > - DDL_LOG_RENAME_ACTION: > - Rename an entity > - DDL_LOG_REPLACE_ACTION: > - Rename an entity after removing the previous entry with the > - new name, that is replace this entry. > - DDL_LOG_EXCHANGE_ACTION: > - Exchange two entities by renaming them a -> tmp, b -> a, tmp -> b. > */ > - DDL_LOG_DELETE_ACTION = 'd', > - DDL_LOG_RENAME_ACTION = 'r', > - DDL_LOG_REPLACE_ACTION = 's', > - DDL_LOG_EXCHANGE_ACTION = 'e' > + DDL_LOG_UNKNOWN_ACTION= 0, > + > + /* Delete a .frm file or a table in the partition engine */ > + DDL_LOG_DELETE_ACTION= 1, > + > + /* Rename a .frm fire a table in the partition engine */ > + DDL_LOG_RENAME_ACTION= 2, > + > + /* > + Rename an entity after removing the previous entry with the > + new name, that is replace this entry. > + */ > + DDL_LOG_REPLACE_ACTION= 3, > + > + /* Exchange two entities by renaming them a -> tmp, b -> a, tmp -> b */ > + DDL_LOG_EXCHANGE_ACTION= 4, > + /* > + log do_rename(): Rename of .frm file, table, stat_tables and triggers > + */ > + DDL_LOG_RENAME_TABLE_ACTION= 5, > + DDL_LOG_RENAME_VIEW_ACTION= 6, > + DDL_LOG_LAST_ACTION /* End marker */ > }; > > + > +/* Number of phases for each ddl_log_action_code */ > +extern const uchar ddl_log_entry_phases[DDL_LOG_LAST_ACTION]; > + > + > enum enum_ddl_log_exchange_phase { > EXCH_PHASE_NAME_TO_TEMP= 0, > EXCH_PHASE_FROM_TO_NAME= 1, > EXCH_PHASE_TEMP_TO_FROM= 2 > }; > > +enum enum_ddl_log_rename_table_phase { > + DDL_RENAME_PHASE_TRIGGER= 0, > + DDL_RENAME_PHASE_STAT, > + DDL_RENAME_PHASE_TABLE, > +}; > + > +/* > + Setting ddl_log_entry.phase to this has the same effect as setting > + action_type to DDL_IGNORE_LOG_ENTRY_CODE this doesn't make much sense. May be you can add a line or two about why it's needed and one cannot simply set action_type to DDL_IGNORE_LOG_ENTRY_CODE instead? > +*/ > + > +#define DDL_LOG_FINAL_PHASE ((uchar) 0xff) > > typedef struct st_ddl_log_entry > { > - const char *name; > - const char *from_name; > - const char *handler_name; > - const char *tmp_name; > + LEX_CSTRING name; > + LEX_CSTRING from_name; > + LEX_CSTRING handler_name; > + LEX_CSTRING tmp_name; > + LEX_CSTRING db; > + LEX_CSTRING from_db; > + LEX_CSTRING from_handler_name; > + uchar uuid[MY_UUID_SIZE]; // UUID for new frm file > + > + ulonglong xid; // Xid stored in the binary log > uint next_entry; > - uint entry_pos; > - enum ddl_log_entry_code entry_type; > + uint entry_pos; // Set by write_dll_log_entry() > + /* > + unique_id can be used to store a unique number to check current state. > + Currently it is used to store new size of frm file or link to a ddl log > + entry. is that so? I don't see it storing a file version, a new size of the frm or a link, I see it stores some kind of a counter to abort recovery when it was repeated more than DDL_LOG_MAX_RETRY times > + */ > + uint32 unique_id; // Unique id for file version > + uint16 flags; // Flags unique for each command > + enum ddl_log_entry_code entry_type; // Set automatically > enum ddl_log_action_code action_type; > /* > Most actions have only one phase. REPLACE does however have two > @@ -95,17 +145,52 @@ typedef struct st_ddl_log_memory_entry > } DDL_LOG_MEMORY_ENTRY; > > > +typedef struct st_ddl_log_state > +{ > + /* List of ddl log entries */ > + DDL_LOG_MEMORY_ENTRY *list; > + /* One execute entry per list */ > + DDL_LOG_MEMORY_ENTRY *execute_entry; > +} DDL_LOG_STATE; would be nice to have a small comment, explaining what is a "state" and how two lists can represent it. is execute_entry a pointer to a some element from the list? > + > + > +/* These functions are for recovery */ > +bool ddl_log_initialize(); > +void ddl_log_release(); > +bool ddl_log_close_binlogged_events(HASH *xids); > +int ddl_log_execute_recovery(); > + > +/* functions for updating the ddl log */ > bool ddl_log_write_entry(DDL_LOG_ENTRY *ddl_log_entry, > DDL_LOG_MEMORY_ENTRY **active_entry); > + > bool ddl_log_write_execute_entry(uint first_entry, > - bool complete, > - DDL_LOG_MEMORY_ENTRY **active_entry); > -bool ddl_log_increment_phase(uint entry_no); > + DDL_LOG_MEMORY_ENTRY **active_entry); > +bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry); > + > +void ddl_log_complete(DDL_LOG_STATE *ddl_log_state); > +void ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state); > + > +bool ddl_log_update_phase(DDL_LOG_STATE *entry, uchar phase); > +bool ddl_log_update_xid(DDL_LOG_STATE *state, ulonglong xid); > +bool ddl_log_disable_entry(DDL_LOG_STATE *state); > +bool ddl_log_increment_phase(uint entry_pos); > void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry); > bool ddl_log_sync(); > -void ddl_log_release(); > -void ddl_log_execute_recovery(); > bool ddl_log_execute_entry(THD *thd, uint first_entry); > > +void ddl_log_release_entries(DDL_LOG_STATE *ddl_log_state); > +bool ddl_log_rename_table(THD *thd, DDL_LOG_STATE *ddl_state, > + handlerton *hton, > + const LEX_CSTRING *org_db, > + const LEX_CSTRING *org_alias, > + const LEX_CSTRING *new_db, > + const LEX_CSTRING *new_alias); > +bool ddl_log_rename_view(THD *thd, DDL_LOG_STATE *ddl_state, > + const LEX_CSTRING *org_db, > + const LEX_CSTRING *org_alias, > + const LEX_CSTRING *new_db, > + const LEX_CSTRING *new_alias); > + > extern mysql_mutex_t LOCK_gdl; > #endif /* DDL_LOG_INCLUDED */ > diff --git a/sql/sql_rename.cc b/sql/sql_rename.cc > index b7aed97a8a2..2b2f32cf126 100644 > --- a/sql/sql_rename.cc > +++ b/sql/sql_rename.cc > @@ -186,49 +172,43 @@ bool mysql_rename_tables(THD *thd, TABLE_LIST *table_list, bool silent, > /* Add IF EXISTS to binary log */ > thd->variables.option_bits|= OPTION_IF_EXISTS; > } > + > + debug_crash_here("ddl_log_rename_before_binlog"); > + /* > + Store xid in ddl log and binary log so that we can check on ddl recovery > + if the item is in the binary log (and thus the operation was complete > + */ > + thd->binlog_xid= thd->query_id; > + ddl_log_update_xid(&ddl_log_state, thd->binlog_xid); > binlog_error= write_bin_log(thd, TRUE, thd->query(), thd->query_length()); I'd prefer to avoid blowing up THD with arbitrary flags that are only needed rarely in one specific place in the code to pass some argument few stack frames down. Perhaps you can do instead something like bool Query_log_event::write() { ... /* xid's is used with ddl_log handling */ if (ddl_log_entry_incomplete()) { *start++= Q_XID; int8store(start, thd->query_id); start+= 8; } or at least you could replace ulonglong binlog_xid with a one bit state, like thd->server_status & SERVER_STATUS_IN_DDL or with some other bitmap of flags. > + if (binlog_error) > + error= 1; > + thd->binlog_xid= 0; > thd->variables.option_bits= save_option_bits; > + debug_crash_here("ddl_log_rename_after_binlog"); > > if (likely(!binlog_error)) > my_ok(thd); > } > > if (likely(!error)) > + { > query_cache_invalidate3(thd, table_list, 0); > + ddl_log_complete(&ddl_log_state); > + } > + else > + { > + /* Revert the renames of normal tables with the help of the ddl log */ > + ddl_log_revert(thd, &ddl_log_state); > + } > > err: > DBUG_RETURN(error || binlog_error); > } > > > -/* > - reverse table list > - > - SYNOPSIS > - reverse_table_list() > - table_list pointer to table _list > - > - RETURN > - pointer to new (reversed) list > -*/ > -static TABLE_LIST *reverse_table_list(TABLE_LIST *table_list) > -{ > - TABLE_LIST *prev= 0; > - > - while (table_list) > - { > - TABLE_LIST *next= table_list->next_local; > - table_list->next_local= prev; > - prev= table_list; > - table_list= next; > - } > - return (prev); > -} > - > - > static bool > -do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table, > - bool skip_error) > +do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table) I wonder why we never got any warnings about it > { > LEX_CSTRING *new_alias; > DBUG_ENTER("do_rename_temporary"); > diff --git a/sql/ddl_log.cc b/sql/ddl_log.cc > index bad786b3799..777dfdddbc7 100644 > --- a/sql/ddl_log.cc > +++ b/sql/ddl_log.cc > @@ -22,176 +22,333 @@ > #include "log.h" // sql_print_error() > #include "ddl_log.h" > #include "ha_partition.h" // PAR_EXT > +#include "sql_table.h" // build_table_filename > +#include "sql_statistics.h" // rename_table_in_stats_tables > +#include "sql_view.h" // mysql_rename_view() > +#include "strfunc.h" // strconvert > +#include <mysys_err.h> // EE_LINK > > > /*-------------------------------------------------------------------------- > > MODULE: DDL log > ----------------- > > - This module is used to ensure that we can recover from crashes that occur > - in the middle of a meta-data operation in MySQL. E.g. DROP TABLE t1, t2; > - We need to ensure that both t1 and t2 are dropped and not only t1 and > - also that each table drop is entirely done and not "half-baked". > + This module is used to ensure that we can recover from crashes that > + occur in the middle of a meta-data operation in MySQL. E.g. DROP > + TABLE t1, t2; We need to ensure that both t1 and t2 are dropped and > + not only t1 and also that each table drop is entirely done and not > + "half-baked". > > - To support this we create log entries for each meta-data statement in the > - ddl log while we are executing. These entries are dropped when the > - operation is completed. > + To support this we create log entries for each meta-data statement > + in the ddl log while we are executing. These entries are dropped > + when the operation is completed. > > At recovery those entries that were not completed will be executed. > > There is only one ddl log in the system and it is protected by a mutex > and there is a global struct that contains information about its current > state. > > + DDl recovery after a crash works the following way: DDL or ddl > + > + - ddl_log_initialize() initializes the global global_ddl_log variable > + and opens the binary log if it exists. If it doesn't exists a new one > + is created. it doesn't seem to be opening a binary log. Did you mean ddl log? > + - ddl_log_close_binlogged_events() loops over all log events and checks if > + their xid (stored in the EXECUTE_CODE event) is in the binary log. If xid > + exists in the binary log the entry is marked as finished in the ddl log. > + - After a new binary log is created and is open for new entries, > + ddl_log_execute_recovery() is executed on remaining open events: > + - Loop over all events > + - For each entry with DDL_LOG_ENTRY_CODE execute the remaining phases > + in ddl_log_execute_entry_no_lock() > + > + The ddl_log.log file is created at startup and deleted when server goes down. > + After the final recovery phase is done, the file is truncated. > + > History: > First version written in 2006 by Mikael Ronstrom > - Second version written in 2020 by Monty > + Second version in 2020 by Monty > --------------------------------------------------------------------------*/ > > +#define DDL_LOG_MAGIC_LENGTH 4 > +/* How many times to try to execute a ddl log entry that causes crashes */ > +#define DDL_LOG_MAX_RETRY 3 > + > +uchar ddl_log_file_magic[]= > +{ (uchar) 254, (uchar) 254, (uchar) 11, (uchar) 2 }; > + > +/* Action names for ddl_log_action_code */ > + > +const char *ddl_log_action_name[DDL_LOG_LAST_ACTION]= > +{ > + "Unknown", "partitioning delete", "partitioning rename", > + "partitioning replace", "partitioning exchange", > + "rename table", "rename view" > +}; > + > +/* Number of phases per entry */ > +const uchar ddl_log_entry_phases[DDL_LOG_LAST_ACTION]= > +{ > + 1, 1, 2, 3, 4, 1 > +}; > + > + > struct st_global_ddl_log > { > - /* > - We need to adjust buffer size to be able to handle downgrades/upgrades > - where IO_SIZE has changed. We'll set the buffer size such that we can > - handle that the buffer size was upto 4 times bigger in the version > - that wrote the DDL log. > - */ > - char file_entry_buf[4*IO_SIZE]; > - char file_name_str[FN_REFLEN]; > - char *file_name; > + uchar *file_entry_buf; > DDL_LOG_MEMORY_ENTRY *first_free; > DDL_LOG_MEMORY_ENTRY *first_used; > File file_id; > - uint name_len; > uint num_entries; > + uint name_pos; > uint io_size; > - bool inited; > - bool do_release; > - bool recovery_phase; > - st_global_ddl_log() : inited(false), do_release(false) {} > + bool initialized; > + bool open; > }; > > st_global_ddl_log global_ddl_log; static? > > mysql_mutex_t LOCK_gdl; > > +/* Positions to different data in a ddl log block */ > #define DDL_LOG_ENTRY_TYPE_POS 0 > +/* > + Note that ACTION_TYPE and PHASE_POS must be after each other. > + See update_phase() > +*/ > #define DDL_LOG_ACTION_TYPE_POS 1 > #define DDL_LOG_PHASE_POS 2 > #define DDL_LOG_NEXT_ENTRY_POS 4 > -#define DDL_LOG_NAME_POS 8 > +/* Flags to remember something unique about the query, like if .frm was used */ > +#define DDL_LOG_FLAG_POS 8 > +/* Used to store XID entry that was written to binary log */ > +#define DDL_LOG_XID_POS 10 > +/* Used to store unique uuid from the .frm file */ "unique uuid" is a weird choice of words > +#define DDL_LOG_UUID_POS 18 > +/* ID_POS can be used to store something unique, like file size (4 bytes) */ > +#define DDL_LOG_ID_POS DDL_LOG_UUID_POS + MY_UUID_SIZE > +#define DDL_LOG_END_POS DDL_LOG_ID_POS + 8 > > -#define DDL_LOG_NUM_ENTRY_POS 0 > -#define DDL_LOG_NAME_LEN_POS 4 > -#define DDL_LOG_IO_SIZE_POS 8 > +/* > + Position to where names are stored in the ddl log blocks. The current > + value is stored in the header and can thus be changed if we need more > + space for constants in the header than what is between DDL_LOG_ID_POS and > + DDL_LOG_TMP_NAME_POS. > +*/ > +#define DDL_LOG_TMP_NAME_POS 56 > > +/* Definitions for the ddl log header, the first block in the file */ > +/* IO_SIZE is stored in the header and can thus be changed */ > +#define DDL_LOG_IO_SIZE IO_SIZE > + > +/* Header is stored in positions 0-3 */ > +#define DDL_LOG_IO_SIZE_POS 4 > +#define DDL_LOG_NAME_OFFSET_POS 6 > +/* Sum of the above variables */ > +#define DDL_LOG_HEADER_SIZE 4+2+2 > + > /** > + Sync the ddl log file. > + > + @return Operation status > + @retval FALSE Success > + @retval TRUE Error > +*/ > + > +static bool ddl_log_sync_file() > +{ > + DBUG_ENTER("ddl_log_sync_file"); > + DBUG_RETURN(mysql_file_sync(global_ddl_log.file_id, MYF(MY_WME))); > +} I don't quite understand that, are there cases when you want to sync the ddl log and you *don't know* whether LOCK_gdl is locked or not? because if you know there should be no lock, you'd better use mysql_mutex_assert_not_owner here. > + > +/* Same as above, but ensure we have the LOCK_gdb locked */ that's an interesting Freudian slip > + > +static bool ddl_log_sync_no_lock() > +{ > + DBUG_ENTER("ddl_log_sync_no_lock"); > + > + mysql_mutex_assert_owner(&LOCK_gdl); > + DBUG_RETURN(ddl_log_sync_file()); > +} ... > +/* > + Store and read strings in ddl log buffers > + > + Format is: > + 2 byte: length (not counting end \0) > + X byte: string value of length 'length' > + 1 byte: \0 a bit redundant, don't you think? Why did you do it? > +*/ > + > +static uchar *store_string(uchar *pos, uchar *end, const LEX_CSTRING *str) > { > + uint32 length= (uint32) str->length; > + if (unlikely(pos + 2 + length + 1 > end)) > + { > + DBUG_ASSERT(0); > + return end; // Overflow > } > + > + int2store(pos, length); > + if (likely(length)) > + memcpy(pos+2, str->str, length); > + pos[2+length]= 0; // Store end \0 > + return pos + 2 + length +1; > +} > + > + > +static LEX_CSTRING get_string(uchar **pos, const uchar *end) > { > + LEX_CSTRING tmp; > + uint32 length; > + if (likely(*pos + 3 <= end)) > + { > + length= uint2korr(*pos); > + if (likely(*pos + 2 + length + 1 <= end)) > + { > + char *str= (char*) *pos+2; > + *pos= *pos + 2 + length + 1; > + tmp.str= str; > + tmp.length= length; > + return tmp; > } > + } > + /* > + Overflow on read, should never happen > + Set *pos to end to ensure any future calls also returns empty string > + */ > + DBUG_ASSERT(0); > + *pos= (uchar*) end; > + tmp.str= ""; > + tmp.length= 0; > + return tmp; > } > > > @@ -256,38 +484,32 @@ static uint read_ddl_log_header() > > static void set_global_from_ddl_log_entry(const DDL_LOG_ENTRY *ddl_log_entry) it'd be good to have some comment explaining why you need this function. Couldn't you write directly from the ddl_log_entry? or use a local buffer? why to serialize into a buffer inside global_ddl_log? > { > + uchar *file_entry_buf= global_ddl_log.file_entry_buf, *pos, *end; > + > mysql_mutex_assert_owner(&LOCK_gdl); it would be logical to have LOCK_gdl inside the global_ddl_log struct > - global_ddl_log.file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= > - (char)DDL_LOG_ENTRY_CODE; > - global_ddl_log.file_entry_buf[DDL_LOG_ACTION_TYPE_POS]= > - (char)ddl_log_entry->action_type; > - global_ddl_log.file_entry_buf[DDL_LOG_PHASE_POS]= 0; > - int4store(&global_ddl_log.file_entry_buf[DDL_LOG_NEXT_ENTRY_POS], > - ddl_log_entry->next_entry); > - DBUG_ASSERT(strlen(ddl_log_entry->name) < FN_REFLEN); > - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS], > - ddl_log_entry->name, FN_REFLEN - 1); > - if (ddl_log_entry->action_type == DDL_LOG_RENAME_ACTION || > - ddl_log_entry->action_type == DDL_LOG_REPLACE_ACTION || > - ddl_log_entry->action_type == DDL_LOG_EXCHANGE_ACTION) > - { > - DBUG_ASSERT(strlen(ddl_log_entry->from_name) < FN_REFLEN); > - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN], > - ddl_log_entry->from_name, FN_REFLEN - 1); > - } > - else > - global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN]= 0; > - DBUG_ASSERT(strlen(ddl_log_entry->handler_name) < FN_REFLEN); > - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (2*FN_REFLEN)], > - ddl_log_entry->handler_name, FN_REFLEN - 1); > - if (ddl_log_entry->action_type == DDL_LOG_EXCHANGE_ACTION) > - { > - DBUG_ASSERT(strlen(ddl_log_entry->tmp_name) < FN_REFLEN); > - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (3*FN_REFLEN)], > - ddl_log_entry->tmp_name, FN_REFLEN - 1); > - } > - else > - global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (3*FN_REFLEN)]= 0; > + > + file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (uchar) ddl_log_entry->entry_type; > + file_entry_buf[DDL_LOG_ACTION_TYPE_POS]= (uchar) ddl_log_entry->action_type; > + file_entry_buf[DDL_LOG_PHASE_POS]= (uchar) ddl_log_entry->phase; > + int4store(file_entry_buf+DDL_LOG_NEXT_ENTRY_POS, ddl_log_entry->next_entry); > + int2store(file_entry_buf+DDL_LOG_FLAG_POS, ddl_log_entry->flags); > + int8store(file_entry_buf+DDL_LOG_XID_POS, ddl_log_entry->xid); > + memcpy(file_entry_buf+DDL_LOG_UUID_POS, ddl_log_entry->uuid, MY_UUID_SIZE); > + int8store(file_entry_buf+DDL_LOG_ID_POS, ddl_log_entry->unique_id); > + bzero(file_entry_buf+DDL_LOG_END_POS, > + global_ddl_log.name_pos - DDL_LOG_END_POS); > + > + pos= file_entry_buf + global_ddl_log.name_pos; > + end= file_entry_buf + global_ddl_log.io_size; > + > + pos= store_string(pos, end, &ddl_log_entry->handler_name); > + pos= store_string(pos, end, &ddl_log_entry->db); > + pos= store_string(pos, end, &ddl_log_entry->name); > + pos= store_string(pos, end, &ddl_log_entry->from_handler_name); > + pos= store_string(pos, end, &ddl_log_entry->from_db); > + pos= store_string(pos, end, &ddl_log_entry->from_name); > + pos= store_string(pos, end, &ddl_log_entry->tmp_name); you have DBUG_ASSERT(0) that all those strings won't overflow io_size. How can you guarantee that? > + bzero(pos, global_ddl_log.io_size - (pos - file_entry_buf)); > } > > > @@ -442,63 +691,120 @@ static bool ddl_log_sync_no_lock() ... > + > + > +/* > + Build a filename for a table, trigger file or .frm > + Delete also any temporary file suffixed with ~ Where do you get these *~ files from? What creates them? > +*/ > + > +static void build_filename_and_delete_tmp_file(char *path, size_t path_length, > + const LEX_CSTRING *db, > + const LEX_CSTRING *name, > + const char *ext, > + PSI_file_key psi_key) > +{ > + uint length= build_table_filename(path, path_length-1, > + db->str, name->str, ext, 0); > + > + path[length]= '~'; > + path[length+1]= 0; > + (void) mysql_file_delete(psi_key, path, MYF(0)); > + path[length]= 0; > } > > > @@ -806,154 +1296,132 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry) > bool ddl_log_write_entry(DDL_LOG_ENTRY *ddl_log_entry, > DDL_LOG_MEMORY_ENTRY **active_entry) > { > - bool error, write_header; > + bool error; > + uchar *pos, *end; > DBUG_ENTER("ddl_log_write_entry"); > > mysql_mutex_assert_owner(&LOCK_gdl); > - if (init_ddl_log()) > + if (!global_ddl_log.open) > DBUG_RETURN(TRUE); > + > + ddl_log_entry->entry_type= DDL_LOG_ENTRY_CODE; > set_global_from_ddl_log_entry(ddl_log_entry); > - if (get_free_ddl_log_entry(active_entry, &write_header)) > + if (ddl_log_get_free_entry(active_entry)) > DBUG_RETURN(TRUE); > + > error= FALSE; > + pos= global_ddl_log.file_entry_buf + global_ddl_log.name_pos; > + end= global_ddl_log.file_entry_buf + global_ddl_log.io_size; > DBUG_PRINT("ddl_log", > - ("write type %c next %u name '%s' from_name '%s' handler '%s'" > - " tmp_name '%s'", > + ("type: %c next: %u handler: %s " > + "to_name: '%s.%s' from_name: '%s.%s' " > + "tmp_name: '%s'", > (char) global_ddl_log.file_entry_buf[DDL_LOG_ACTION_TYPE_POS], > ddl_log_entry->next_entry, > - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS], > - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS > - + FN_REFLEN], > - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS > - + (2*FN_REFLEN)], > - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS > - + (3*FN_REFLEN)])); > + get_string(&pos, end).str, // Handler > + get_string(&pos, end).str, // to db.table > + get_string(&pos, end).str, > + get_string(&pos, end).str, // From db.table > + get_string(&pos, end).str, > + get_string(&pos, end).str)); // Tmp name This is not a critical bug, because it's in a DBUG_PRINT, but it's still a bug. C++ (and C) standard both say that the order of evaluation of function arguments is *unspecified*. > + > if (unlikely(write_ddl_log_file_entry((*active_entry)->entry_pos))) > { > + ddl_log_release_memory_entry(*active_entry); > + *active_entry= 0; > error= TRUE; > - sql_print_error("Failed to write entry_no = %u", > + sql_print_error("DDL_LOG: Failed to write entry %u", > (*active_entry)->entry_pos); > } > - if (write_header && likely(!error)) > - { > - (void) ddl_log_sync_no_lock(); > - if (write_ddl_log_header()) > - error= TRUE; > - } > - if (unlikely(error)) > - ddl_log_release_memory_entry(*active_entry); > DBUG_RETURN(error); > } > > > /** > - @brief Write final entry in the ddl log. > + @brief Write or update execute entry in the ddl log. > > - @details This is the last write in the ddl log. The previous log entries > - have already been written but not yet synched to disk. > - We write a couple of log entries that describes action to perform. > - This entries are set-up in a linked list, however only when a first > - execute entry is put as the first entry these will be executed. > - This routine writes this first. > + @details An execute entry points to the first entry that should > + be excuted during recovery. In some cases it's only written once, > + in other cases it's updated for each log entry to point to the new > + header for the list. > > + When called, the previous log entries have already been written but not yet > + synched to disk. We write a couple of log entries that describes > + action to perform. This entries are set-up in a linked list, > + however only when an execute entry is put as the first entry these will be > + executed during recovery. > + > @param first_entry First entry in linked list of entries > - to execute, if 0 = NULL it means that > - the entry is removed and the entries > - are put into the free list. > - @param complete Flag indicating we are simply writing > - info about that entry has been completed > + to execute. > @param[in,out] active_entry Entry to execute, 0 = NULL if the entry > is written first time and needs to be > returned. In this case the entry written > is returned in this parameter > - > @return Operation status > @retval TRUE Error > @retval FALSE Success > */ > > bool ddl_log_write_execute_entry(uint first_entry, > - bool complete, > DDL_LOG_MEMORY_ENTRY **active_entry) > { > - bool write_header= FALSE; > - char *file_entry_buf= (char*)global_ddl_log.file_entry_buf; > + uchar *file_entry_buf= global_ddl_log.file_entry_buf; > + bool got_free_entry= 0; > DBUG_ENTER("ddl_log_write_execute_entry"); > > mysql_mutex_assert_owner(&LOCK_gdl); > - if (init_ddl_log()) > - { > - DBUG_RETURN(TRUE); > - } > - if (!complete) > - { > /* > We haven't synched the log entries yet, we synch them now before > writing the execute entry. If complete is true we haven't written there's no 'complete' anymore > any log entries before, we are only here to write the execute > entry to indicate it is done. > */ > (void) ddl_log_sync_no_lock(); Why do you need to sync previous log entries? As far as I understand, they're useless without DDL_LOG_EXECUTE_CODE entry, so why would it matter whether they're on disk or not after a crash? > - file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (char)DDL_LOG_EXECUTE_CODE; > - } > - else > - file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (char)DDL_IGNORE_LOG_ENTRY_CODE; > - file_entry_buf[DDL_LOG_ACTION_TYPE_POS]= 0; /* Ignored for execute entries */ > - file_entry_buf[DDL_LOG_PHASE_POS]= 0; > - int4store(&file_entry_buf[DDL_LOG_NEXT_ENTRY_POS], first_entry); > - file_entry_buf[DDL_LOG_NAME_POS]= 0; > - file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN]= 0; > - file_entry_buf[DDL_LOG_NAME_POS + 2*FN_REFLEN]= 0; > + bzero(file_entry_buf, global_ddl_log.io_size); > + > + file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (uchar)DDL_LOG_EXECUTE_CODE; > + int4store(file_entry_buf + DDL_LOG_NEXT_ENTRY_POS, first_entry); > + > if (!(*active_entry)) > { > - if (get_free_ddl_log_entry(active_entry, &write_header)) > - { > + if (ddl_log_get_free_entry(active_entry)) > DBUG_RETURN(TRUE); > + got_free_entry= TRUE; > } > - write_header= TRUE; > - } > if (write_ddl_log_file_entry((*active_entry)->entry_pos)) > { > - sql_print_error("Error writing execute entry in ddl log"); > - ddl_log_release_memory_entry(*active_entry); > - DBUG_RETURN(TRUE); > - } > - (void) ddl_log_sync_no_lock(); > - if (write_header) > + if (got_free_entry) > { > - if (write_ddl_log_header()) > - { > ddl_log_release_memory_entry(*active_entry); > + *active_entry= 0; > + } > + sql_print_error("DDL_LOG: Error writing execute entry %u", > + (*active_entry)->entry_pos); > DBUG_RETURN(TRUE); > } > - } > + (void) ddl_log_sync_no_lock(); > DBUG_RETURN(FALSE); > } > > - > /** > - Deactivate an individual entry. > + Increment phase for enty. Will deactivate entry after all phases are done > > @details see ddl_log_increment_phase_no_lock. > > - @param entry_no Entry position of record to change > + @param entry_pos Entry position of record to change > > @return Operation status > @retval TRUE Error > @retval FALSE Success > */ > > -bool ddl_log_increment_phase(uint entry_no) > +bool ddl_log_increment_phase(uint entry_pos) > { > bool error; > DBUG_ENTER("ddl_log_increment_phase"); > > mysql_mutex_lock(&LOCK_gdl); > - error= ddl_log_increment_phase_no_lock(entry_no); > + error= ddl_log_increment_phase_no_lock(entry_pos); > mysql_mutex_unlock(&LOCK_gdl); > DBUG_RETURN(error); > } > @@ -1042,87 +1484,157 @@ static void close_ddl_log() > (void) mysql_file_close(global_ddl_log.file_id, MYF(MY_WME)); > global_ddl_log.file_id= (File) -1; > } > + global_ddl_log.open= 0; > DBUG_VOID_RETURN; > } > > > /** > + Loop over ddl log excute entries and mark those that are already stored > + in the binary log as completed > + > + @return > + @retval 0 ok > + @return 1 fail (write error) > + > +*/ > + > +bool ddl_log_close_binlogged_events(HASH *xids) > +{ > + uint i; > + DDL_LOG_ENTRY ddl_log_entry; > + DBUG_ENTER("ddl_log_close_binlogged_events"); > + > + if (global_ddl_log.num_entries == 0 || xids->records == 0) > + DBUG_RETURN(0); > + > + mysql_mutex_lock(&LOCK_gdl); > + for (i= 1; i <= global_ddl_log.num_entries; i++) > + { > + if (read_ddl_log_entry(i, &ddl_log_entry)) > + break; // Read error. Ignore it's not really "ignore", it's more like "stop reading" > + if (ddl_log_entry.entry_type == DDL_LOG_EXECUTE_CODE && > + ddl_log_entry.xid != 0 && > + my_hash_search(xids, (uchar*) &ddl_log_entry.xid, > + sizeof(ddl_log_entry.xid))) > + { > + if (disable_execute_entry(i)) > + { > + mysql_mutex_unlock(&LOCK_gdl); > + DBUG_RETURN(1); // Write error. Fatal! > + } > + } > + } > + (void) ddl_log_sync_no_lock(); > + mysql_mutex_unlock(&LOCK_gdl); > + DBUG_RETURN(0); > +} > + > + > +/** > Execute the ddl log at recovery of MySQL Server. > + > + @return > + @retval 0 Ok. > + @retval > 0 Fatal error. We have to abort (can't create ddl log) > + @return < -1 Recovery failed, but new log exists and is usable > + > */ > > -void ddl_log_execute_recovery() > +int ddl_log_execute_recovery() > { > - uint num_entries, i; > + uint i, count= 0; > + int error= 0; > THD *thd; > DDL_LOG_ENTRY ddl_log_entry; > - char file_name[FN_REFLEN]; > static char recover_query_string[]= "INTERNAL DDL LOG RECOVER IN PROGRESS"; > DBUG_ENTER("ddl_log_execute_recovery"); > > - /* > - Initialise global_ddl_log struct > - */ > - bzero(global_ddl_log.file_entry_buf, sizeof(global_ddl_log.file_entry_buf)); > - global_ddl_log.inited= FALSE; > - global_ddl_log.recovery_phase= TRUE; > - global_ddl_log.io_size= IO_SIZE; > - global_ddl_log.file_id= (File) -1; > + if (global_ddl_log.num_entries == 0) > + DBUG_RETURN(0); > > /* > To be able to run this from boot, we allocate a temporary THD > */ > if (!(thd=new THD(0))) > - DBUG_VOID_RETURN; > + { > + DBUG_ASSERT(0); // Fatal error > + DBUG_RETURN(1); > + } > thd->thread_stack= (char*) &thd; > thd->store_globals(); > + thd->log_all_errors= (global_system_variables.log_warnings >= 3); > > thd->set_query(recover_query_string, strlen(recover_query_string)); > > - /* this also initialize LOCK_gdl */ > - num_entries= read_ddl_log_header(); > mysql_mutex_lock(&LOCK_gdl); > - for (i= 1; i < num_entries + 1; i++) > + for (i= 1; i <= global_ddl_log.num_entries; i++) > { > if (read_ddl_log_entry(i, &ddl_log_entry)) > { > - sql_print_error("Failed to read entry no = %u from ddl log", > - i); > + error= -1; > continue; > } > if (ddl_log_entry.entry_type == DDL_LOG_EXECUTE_CODE) > { > + /* purecov: begin tested */ really? you tested with purify? > + if (ddl_log_entry.unique_id > DDL_LOG_MAX_RETRY) > + { > + error= -1; > + continue; > + } > + update_unique_id(i, ++ddl_log_entry.unique_id); > + if (ddl_log_entry.unique_id > DDL_LOG_MAX_RETRY) > + { > + sql_print_error("DDL_LOG: Aborting executing entry %u after %llu " > + "retries", i, ddl_log_entry.unique_id); > + error= -1; > + continue; > + } > + /* purecov: end tested */ > if (ddl_log_execute_entry_no_lock(thd, ddl_log_entry.next_entry)) > { > - /* Real unpleasant scenario but we continue anyways. */ > + /* Real unpleasant scenario but we have to continue anyway */ why can this happen? do you have an example? besides hardware failure, like a disk cannot be read > + error= -1; > continue; > } > + count++; > } > } > close_ddl_log(); > - create_ddl_log_file_name(file_name); > - (void) mysql_file_delete(key_file_global_ddl_log, file_name, MYF(0)); > - global_ddl_log.recovery_phase= FALSE; > mysql_mutex_unlock(&LOCK_gdl); > thd->reset_query(); > delete thd; > - DBUG_VOID_RETURN; > + > + /* > + Create a new ddl_log to get rid of old stuff and ensure that header matches > + the current source version > + */ because here you delete/overwrite the old ddl log, it'd make sense to log all details of every entry that you failed to execute but decided to continue anyway. Now you log only the entry number on failure, which is rather useless, because it's a number in a ddl log that you overwrite. > + if (create_ddl_log()) > + DBUG_RETURN(1); > + if (count > 0) > + sql_print_information("DDL_LOG: Crash recovery executed %u entries", > + count); > + DBUG_RETURN(error); > } > > > /** > - Release all memory allocated to the ddl log. > + Release all memory allocated to the ddl log and delete the ddl log > */ > > void ddl_log_release() > { > + char file_name[FN_REFLEN]; > DDL_LOG_MEMORY_ENTRY *free_list; > DDL_LOG_MEMORY_ENTRY *used_list; > DBUG_ENTER("ddl_log_release"); > > - if (!global_ddl_log.do_release) > + if (!global_ddl_log.initialized) > DBUG_VOID_RETURN; > > - mysql_mutex_lock(&LOCK_gdl); > + global_ddl_log.initialized= 0; > + > free_list= global_ddl_log.first_free; > used_list= global_ddl_log.first_used; > while (used_list) Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1

Re: [Maria-developers] c80f867b0df: MDEV-23844 Atomic DROP TABLE
by Sergei Golubchik 16 May '21

16 May '21

Hi, Michael! On May 12, Michael Widenius wrote: > revision-id: c80f867b0df (mariadb-10.6.0-78-gc80f867b0df) > parent(s): eebebe8ef75 > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-05-04 21:09:11 +0300 > message: > > MDEV-23844 Atomic DROP TABLE ... > diff --git a/sql/sql_view.cc b/sql/sql_view.cc > index e6d726b30d7..ff8aa55adef 100644 > --- a/sql/sql_view.cc > +++ b/sql/sql_view.cc > @@ -1861,8 +1868,18 @@ bool mysql_drop_view(THD *thd, TABLE_LIST *views, enum_drop_mode drop_mode) > not_exists_count++; > continue; > } > + if (!first_table++) yuck. so your "first_table" actually means "not_first_view", because when it's FALSE, it's the first view. why not to rename it to "not_first_view" ? > + { > + if (ddl_log_drop_view_init(thd, &ddl_log_state)) > + DBUG_RETURN(TRUE); what about a crash here? the first entry is written, the second is not. > + } > + if (ddl_log_drop_view(thd, &ddl_log_state, &cpath, &view->db, > + &view->table_name)) > + DBUG_RETURN(TRUE); > + debug_crash_here("ddl_log_drop_before_delete_view"); > if (unlikely(mysql_file_delete(key_file_frm, path, MYF(MY_WME)))) > delete_error= TRUE; > + debug_crash_here("ddl_log_drop_after_delete_view"); > > some_views_deleted= TRUE; > > diff --git a/sql/log.cc b/sql/log.cc > index 6e0a9706ec8..dba999054e4 100644 > --- a/sql/log.cc > +++ b/sql/log.cc > @@ -10471,16 +10483,16 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name, > bool last_gtid_standalone= false; > bool last_gtid_valid= false; > #endif > + DBUG_ENTER("TC_LOG_BINLOG::recover"); > > if (! fdle->is_valid() || > - (do_xa && my_hash_init(key_memory_binlog_recover_exec, &xids, > - &my_charset_bin, TC_LOG_PAGE_SIZE/3, 0, > - sizeof(my_xid), 0, 0, MYF(0)))) > + (my_hash_init(key_memory_binlog_recover_exec, &xids, > + &my_charset_bin, TC_LOG_PAGE_SIZE/3, 0, > + sizeof(my_xid), 0, 0, MYF(0)))) > goto err1; > > - if (do_xa) > - init_alloc_root(key_memory_binlog_recover_exec, &mem_root, > - TC_LOG_PAGE_SIZE, TC_LOG_PAGE_SIZE, MYF(0)); > + init_alloc_root(key_memory_binlog_recover_exec, &mem_root, > + TC_LOG_PAGE_SIZE, TC_LOG_PAGE_SIZE, MYF(0)); I don't think these two changes are correct. do_xa is TRUE on crash recovery and FALSE on gtid state recovery on 5.5->10.0 upgade or if gtid state file was deleted. There is no point to initialize the xid hash or mem_root, as it's not a crash recovery and there can be no incomplete transactions or ddl statements. Yes, I've seen the commit comment: > - TC_LOG_BINLOG::recover() was changed to always collect Xid's from the > binary log and always call ddl_log_close_binlogged_events(). This was > needed to be able to collect DROP TABLE events with embedded Xid's > (used by ddl log). but I cannot understand how it applies here. > fdle->flags&= ~LOG_EVENT_BINLOG_IN_USE_F; // abort on the first error > > diff --git a/sql/sql_table.cc b/sql/sql_table.cc > index 7a88ffdfd1b..0a83396ae14 100644 > --- a/sql/sql_table.cc > +++ b/sql/sql_table.cc > @@ -1363,6 +1373,17 @@ int mysql_rm_table_no_locks(THD *thd, TABLE_LIST *tables, bool if_exists, > > thd->replication_flags= 0; > was_view= table_type == TABLE_TYPE_VIEW; > + > + if (!first_table++) first_table -> not_first_table > + { > + LEX_CSTRING comment= {comment_start, (size_t) comment_len}; > + if (ddl_log_drop_table_init(thd, &ddl_log_state, &comment)) > + { > + error= 1; > + goto err; > + } > + } > + > if ((table_type == TABLE_TYPE_UNKNOWN) || (was_view && !drop_view) || > (table_type != TABLE_TYPE_SEQUENCE && drop_sequence)) > { > @@ -1565,6 +1611,7 @@ int mysql_rm_table_no_locks(THD *thd, TABLE_LIST *tables, bool if_exists, > table_name.str, (uint)table_name.length); > mysql_audit_drop_table(thd, table); > } > + ddl_log_update_phase(&ddl_log_state, DDL_DROP_PHASE_COLLECT); better call it DDL_DROP_PHASE_BINLOG. It would've saved me ~10 minutes reading the code back and forth, as ""DDL_DROP_PHASE_BINLOG" immediately explains what this phase is for. > > if (!dont_log_query && > (!error || table_dropped || non_existing_table_error(error))) > diff --git a/sql/ddl_log.h b/sql/ddl_log.h > index 316e6708f22..b843404fc3c 100644 > --- a/sql/ddl_log.h > +++ b/sql/ddl_log.h > @@ -97,6 +101,14 @@ enum enum_ddl_log_rename_table_phase { > DDL_RENAME_PHASE_TABLE, > }; > > +enum enum_ddl_log_drop_table_phase { > + DDL_DROP_PHASE_TABLE=0, > + DDL_DROP_PHASE_TRIGGER, no phase for deleting data from stat tables? > + DDL_DROP_PHASE_COLLECT, > + DDL_DROP_PHASE_RESET, /* Reset found list of dropped tables */ this one is unused (also unused in further patches, I've checked) > + DDL_DROP_PHASE_END > +}; > + > /* > Setting ddl_log_entry.phase to this has the same effect as setting > action_type to DDL_IGNORE_LOG_ENTRY_CODE > diff --git a/sql/ddl_log.cc b/sql/ddl_log.cc > index 777dfdddbc7..6145b2f7318 100644 > --- a/sql/ddl_log.cc > +++ b/sql/ddl_log.cc > @@ -109,6 +112,7 @@ struct st_global_ddl_log > }; > > st_global_ddl_log global_ddl_log; > +String ddl_drop_query; // Used during startup recovery static > > mysql_mutex_t LOCK_gdl; > > @@ -1132,6 +1149,122 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > (void) update_phase(entry_pos, DDL_LOG_FINAL_PHASE); > } > break; > + case DDL_LOG_DROP_TABLE_INIT_ACTION: > + { > + LEX_CSTRING *comment= &ddl_log_entry->tmp_name; > + ddl_drop_query.length(0); > + ddl_drop_query.set_charset(system_charset_info); > + ddl_drop_query.append(STRING_WITH_LEN("DROP TABLE IF EXISTS ")); > + if (comment->length) > + { > + ddl_drop_query.append(comment); > + ddl_drop_query.append(' '); > + } > + /* We don't increment phase as we want to retry this in case of crash */ > + break; > + } > + case DDL_LOG_DROP_TABLE_ACTION: > + { > + LEX_CSTRING db, table, path; > + db= ddl_log_entry->db; > + table= ddl_log_entry->name; > + /* Note that path is without .frm extension */ > + path= ddl_log_entry->tmp_name; > + > + switch (ddl_log_entry->phase) { > + case DDL_DROP_PHASE_TABLE: > + if (hton) > + { > + if ((error= hton->drop_table(hton, path.str))) > + { > + if (!non_existing_table_error(error)) > + break; > + error= -1; > + } > + } > + else > + error= ha_delete_table_force(thd, path.str, &db, &table); > + if (error <= 0) > + { > + /* Not found or already deleted. Delete .frm if it exists */ > + strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS); > + mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT)); this should rather be inside the previous if(). no need to do it for ha_delete_table_force(), as it doesn't leave any traces and removes all remnants of a table. > + } > + if (ddl_log_increment_phase_no_lock(entry_pos)) > + break; > + (void) ddl_log_sync_no_lock(); No need to, ddl_log_increment_phase_no_lock() syncs internally > + /* Fall through */ > + case DDL_DROP_PHASE_TRIGGER: > + Table_triggers_list::drop_all_triggers(thd, &db, &table, > + MYF(MY_WME | MY_IGNORE_ENOENT)); > + if (ddl_log_increment_phase_no_lock(entry_pos)) > + break; > + (void) ddl_log_sync_no_lock(); same. and in other places too, I'd expect. > + /* Fall through */ > + > + case DDL_DROP_PHASE_COLLECT: indentation? > + append_identifier(thd, &ddl_drop_query, &db); > + ddl_drop_query.append('.'); > + append_identifier(thd, &ddl_drop_query, &table); > + ddl_drop_query.append(','); > + /* We don't increment phase as we want to retry this in case of crash */ > + > + if (!ddl_log_entry->next_entry && mysql_bin_log.is_open()) > + { > + /* Last drop table. Write query to binlog */ > + LEX_CSTRING end_comment= > + { STRING_WITH_LEN(" /* generated by ddl log */")}; "generated by ddl log" ? a "log" cannot "generate" a statement, may be "generated during ddl recovery" ? > + ddl_drop_query.length(ddl_drop_query.length()-1); > + ddl_drop_query.append(&end_comment); > + > + mysql_mutex_unlock(&LOCK_gdl); > + (void) thd->binlog_query(THD::STMT_QUERY_TYPE, ddl_drop_query.ptr(), > + ddl_drop_query.length(), TRUE, FALSE, > + FALSE, 0); > + mysql_mutex_lock(&LOCK_gdl); > + } > + break; > + } > + break; > + } > + case DDL_LOG_DROP_VIEW_INIT_ACTION: > + { > + ddl_drop_query.length(0); > + ddl_drop_query.set_charset(system_charset_info); > + ddl_drop_query.append(STRING_WITH_LEN("DROP VIEW IF EXISTS ")); > + /* We don't increment phase as we want to retry this in case of crash */ > + break; > + } > + case DDL_LOG_DROP_VIEW_ACTION: > + { > + LEX_CSTRING db, table, path; > + db= ddl_log_entry->db; > + table= ddl_log_entry->name; > + /* Note that for views path is WITH .frm extension */ > + path= ddl_log_entry->tmp_name; > + > + mysql_file_delete(key_file_frm, path.str, MYF(MY_WME|MY_IGNORE_ENOENT)); > + append_identifier(thd, &ddl_drop_query, &db); > + ddl_drop_query.append('.'); > + append_identifier(thd, &ddl_drop_query, &table); > + ddl_drop_query.append(','); > + > + if (!ddl_log_entry->next_entry) > + { > + /* Last drop view. Write query to binlog */ > + LEX_CSTRING end_comment= > + { STRING_WITH_LEN(" /* generated by ddl log */")}; > + ddl_drop_query.length(ddl_drop_query.length()-1); > + ddl_drop_query.append(&end_comment); > + > + mysql_mutex_unlock(&LOCK_gdl); > + (void) thd->binlog_query(THD::STMT_QUERY_TYPE, ddl_drop_query.ptr(), > + ddl_drop_query.length(), TRUE, FALSE, > + FALSE, 0); > + mysql_mutex_lock(&LOCK_gdl); > + } > + break; > + } > default: > DBUG_ASSERT(0); > break; > @@ -1682,6 +1823,7 @@ void ddl_log_release_entries(DDL_LOG_STATE *ddl_log_state) > next= log_entry->next_active_log_entry; > ddl_log_release_memory_entry(log_entry); > } > + ddl_log_state->list= 0; may be you should also clear ddl_log_state->execute_entry below? > > if (ddl_log_state->execute_entry) > ddl_log_release_memory_entry(ddl_log_state->execute_entry); > @@ -1776,6 +1918,33 @@ bool ddl_log_update_xid(DDL_LOG_STATE *state, ulonglong xid) > } > > > +/* > + Write ddl_log_entry and write or update ddl_execute_entry > +*/ > + > +static bool ddl_log_write(DDL_LOG_STATE *ddl_state, > + DDL_LOG_ENTRY *ddl_log_entry) could you use a more descriptive name, please? > +{ > + int error; > + DDL_LOG_MEMORY_ENTRY *log_entry; > + DBUG_ENTER("ddl_log_write"); > + > + mysql_mutex_lock(&LOCK_gdl); > + error= ((ddl_log_write_entry(ddl_log_entry, &log_entry)) || > + ddl_log_write_execute_entry(log_entry->entry_pos, > + &ddl_state->execute_entry)); > + mysql_mutex_unlock(&LOCK_gdl); > + if (error) > + { > + if (log_entry) > + ddl_log_release_memory_entry(log_entry); > + DBUG_RETURN(1); > + } > + add_log_entry(ddl_state, log_entry); > + DBUG_RETURN(0); > +} > + > + > /** > Logging of rename table > */ Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1

Re: [Maria-developers] 2be9b69f4ff: MDEV-23001 Precreate static Item_bool() to simplify code
by Sergei Golubchik 14 May '21

14 May '21

Hi, Monty! On Mar 25, Michael Widenius wrote: > revision-id: 2be9b69f4ff (mariadb-10.5.2-510-g2be9b69f4ff) > parent(s): cb545f11169 > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-03-24 14:05:15 +0200 > message: > > MDEV-23001 Precreate static Item_bool() to simplify code > > The following changes where done: > - Create global Item: Item_false and Item_true > - Replace all creation if 'FALSE' and 'TRUE' top level items used for > WHERE/HAVING/ON clauses to use Item_false and Item_true. > > The benefit are: > - No test needed if we where able to create the new item. > - Fixed possible errors if 'new' would have failed for the Item_bool's agree > - Less and faster code I don't quite agree with that, it's not less code: > 7 files changed, 65 insertions(+), 43 deletions(-) and I doubt that speedup can be measured in any benchmarks whatsoever. > diff --git a/sql/item.cc b/sql/item.cc > index 1a86b8b3114..0f160fa257b 100644 > --- a/sql/item.cc > +++ b/sql/item.cc > @@ -55,7 +55,8 @@ const char *item_empty_name=""; > const char *item_used_name= "\0"; > > static int save_field_in_field(Field *, bool *, Field *, bool); > - > +const Item_bool_static Item_false("FALSE", 0); > +const Item_bool_static Item_true("TRUE", 1); The main question - we have tons of modifiable Item members. How do you know none of them will modified in your Item_bool_static objects? > @@ -421,6 +421,9 @@ Item::Item(THD *thd): > /* Initially this item is not attached to any JOIN_TAB. */ > join_tab_idx= MAX_TABLES; > > + /* thd is NULL in case of static items like Item_true */ > + if (!thd) > + return; No, please, don't. thd can be NULL only twice, for statically initialized Item_true and Item_false. For that you removed a useful assert and added an if() that is completely unnecessary for billions of items created runtime. Better to have special contructor just for these two very special items. > /* Put item in free list so that we can free all items at end */ > next= thd->free_list; > thd->free_list= this; > @@ -3637,6 +3641,19 @@ void Item_int::print(String *str, enum_query_type query_type) > } > > > +/* > + This function is needed to ensure that Item_bool_static doesn't change > + the value of the member str_value. > +*/ > + > +void Item_bool::print(String *str, enum_query_type query_type) > +{ > + // my_charset_bin is good enough for numbers > + String tmp(value ? (char*) "1" : (char*) "0" , 1, &my_charset_bin); > + str->append(tmp); > +} This is not needed anymore. Item_bool inherits from Item_int, and Item_int::print is void Item_int::print(String *str, enum_query_type query_type) { StringBuffer<LONGLONG_BUFFER_SIZE> buf; // my_charset_bin is good enough for numbers buf.set_int(value, unsigned_flag, &my_charset_bin); str->append(buf); } You removed str_value from Item_int::print in July. > Item *Item_bool::neg_transformer(THD *thd) > { > value= !value; > diff --git a/sql/item.h b/sql/item.h > index a1c288ab1f0..1087c08869e 100644 > --- a/sql/item.h > +++ b/sql/item.h > @@ -1976,7 +1976,7 @@ class Item: public Value_source, > virtual bool limit_index_condition_pushdown_processor(void *arg) { return 0; } > virtual bool exists2in_processor(void *arg) { return 0; } > virtual bool find_selective_predicates_list_processor(void *arg) { return 0; } > - bool cleanup_is_expensive_cache_processor(void *arg) > + virtual bool cleanup_is_expensive_cache_processor(void *arg) You've added a new virtual method? First you try so hard to remove virtual methods, then you add a new one specifically for these two static items. Why does it matter what cleanup_is_expensive_cache_processor will do with them? You overwrite is_expensive() anyway. > { > is_expensive_cache= (int8)(-1); > return 0; > @@ -3220,9 +3220,11 @@ class Item_literal: public Item_basic_constant > Item_literal(THD *thd): Item_basic_constant(thd) > { } > Type type() const override { return CONST_ITEM; } > - bool check_partition_func_processor(void *) override { return false;} > + bool check_partition_func_processor(void *int_arg) override { return false;} > bool const_item() const override { return true; } > bool basic_const_item() const override { return true; } > + bool is_expensive() override { return false; } > + bool cleanup_is_expensive_cache_processor(void *arg) override { return 0; } > }; > > > diff --git a/sql/opt_index_cond_pushdown.cc b/sql/opt_index_cond_pushdown.cc > index 15bc2074e1f..e950ad1b7ca 100644 > --- a/sql/opt_index_cond_pushdown.cc > +++ b/sql/opt_index_cond_pushdown.cc > @@ -185,8 +185,8 @@ bool uses_index_fields_only(Item *item, TABLE *tbl, uint keyno, > static Item *make_cond_for_index(THD *thd, Item *cond, TABLE *table, uint keyno, > bool other_tbls_ok) > { > - if (!cond) > - return NULL; > + if (!cond || cond->basic_const_item()) > + return cond; The effect of this change is that you don't set cond->marker= ICP_COND_USES_INDEX_ONLY for basic_const_item's anymore. This looked suspicious at first. But then I noticed (and Sergey Petrunia confirmed) that ICP_COND_USES_INDEX_ONLY is not used at all. Since 5.3. So here I'd suggest to revert this your change and completely remove any ICP_COND_USES_INDEX_ONLY manipulations together with the ICP_COND_USES_INDEX_ONLY definition. > if (cond->type() == Item::COND_ITEM) > { > uint n_marked= 0; > diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc > index 119c7360f07..e63e120a02b 100644 > --- a/sql/sql_parse.cc > +++ b/sql/sql_parse.cc > @@ -2395,6 +2395,7 @@ dispatch_command_return dispatch_command(enum enum_server_command command, THD * > thd->update_server_status(); > thd->protocol->end_statement(); > query_cache_end_of_result(thd); > + /* Check that the query didn't change const items */ what does that mean? Is it a TODO comment? Or it documents what happens below? > } > if (drop_more_results) > thd->server_status&= ~SERVER_MORE_RESULTS_EXISTS; > diff --git a/sql/sql_select.cc b/sql/sql_select.cc > index 93f5d3591ed..894d511adf4 100644 > --- a/sql/sql_select.cc > +++ b/sql/sql_select.cc > @@ -2364,7 +2364,7 @@ int JOIN::optimize_stage2() > if (!conds && outer_join) > { > /* Handle the case where we have an OUTER JOIN without a WHERE */ > - conds= new (thd->mem_root) Item_bool(thd, true); // Always true > + conds= (Item*) &Item_true; Hmm, does casting const away even work in the Intel compiler? > } > > if (impossible_where) > @@ -22743,6 +22743,8 @@ make_cond_for_table_from_pred(THD *thd, Item *root_cond, Item *cond, > return new_cond; > } > } > + else if (cond->basic_const_item()) > + return cond; > > if (is_top_and_level && used_table == rand_table_bit && > (cond->used_tables() & ~OUTER_REF_TABLE_BIT) != rand_table_bit) this is rather fragile, isn't it? You skip item->set_join_tab_idx(join_tab_idx_arg), quite understandably. And currently, as far as I can see, item->get_join_tab_idx() is always used under if (!item->const_item()), so what you did is safe. But can it change in the future? Or is join_tab_idx something that conceptually doesn't make sense for const items? Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 2

Re: [Maria-developers] 9083627c50f: Make rename atomic/repeatable in MyISAM and Aria
by Sergei Golubchik 14 May '21

14 May '21

Hi, Michael! On May 04, Michael Widenius wrote: > revision-id: 9083627c50f (mariadb-10.5.2-562-g9083627c50f) > parent(s): 105cda3b1cb > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-03-25 12:06:34 +0200 > message: > > Make rename atomic/repeatable in MyISAM and Aria > > This is required to make Atomic RENAME TABLE work for these engines > > The requirement is that if we have a server crash in the middle of a > storage engine rename call, the upcoming ddl log recovery should be able > to finalize it by re-execute the rename. > > diff --git a/storage/maria/ma_rename.c b/storage/maria/ma_rename.c > index a4388596f6b..64dac32a51c 100644 > --- a/storage/maria/ma_rename.c > +++ b/storage/maria/ma_rename.c > @@ -106,28 +106,37 @@ int maria_rename(const char *old_name, const char *new_name) > _ma_reset_state(info); > maria_close(info); > > + /* > + This code is written so that it should be possible to re-run a > + failed rename (even if there is a server crash in between the > + renames) and complete it. > + */ > fn_format(from,old_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > fn_format(to,new_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > if (mysql_file_rename_with_symlink(key_file_kfile, from, to, > MYF(MY_WME | sync_dir))) > - DBUG_RETURN(my_errno); > + index_file_rename_error= my_errno; Shouldn't you continue only if index_file_rename_error == ENOENT ? also, I only very quickly looked at mysql_file_rename_with_symlink, but it didn't look atomic to me when symlinks are present. > fn_format(from,old_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > fn_format(to,new_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); ... > diff --git a/storage/myisam/mi_rename.c b/storage/myisam/mi_rename.c > index 19df2e54213..13cf8c60897 100644 > --- a/storage/myisam/mi_rename.c > +++ b/storage/myisam/mi_rename.c > @@ -22,6 +22,7 @@ > int mi_rename(const char *old_name, const char *new_name) > { > char from[FN_REFLEN],to[FN_REFLEN]; > + int save_errno= 0; > DBUG_ENTER("mi_rename"); > > #ifdef EXTRA_DEBUG > @@ -32,10 +33,13 @@ int mi_rename(const char *old_name, const char *new_name) > fn_format(from,old_name,"",MI_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > fn_format(to,new_name,"",MI_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > if (mysql_file_rename_with_symlink(mi_key_file_kfile, from, to, MYF(MY_WME))) > - DBUG_RETURN(my_errno); > + save_errno= my_errno; > fn_format(from,old_name,"",MI_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > fn_format(to,new_name,"",MI_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT); > - DBUG_RETURN(mysql_file_rename_with_symlink(mi_key_file_dfile, > - from, to, > - MYF(MY_WME)) ? my_errno : 0); > + if (mysql_file_rename_with_symlink(mi_key_file_dfile, > + from, to, > + MYF(MY_WME))) > + if (save_errno) > + save_errno= my_errno; > + DBUG_RETURN(save_errno); Why mi_rename isn't trying to undo a half-failed rename like ma_rename does? Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1

Re: [Maria-developers] 32c73138ff4: Remove some usage of Check_level_instant_set and Sql_mode_save
by Sergei Golubchik 14 May '21

14 May '21

Hi, Monty! On Apr 19, Michael Widenius wrote: > revision-id: 32c73138ff4 (mariadb-10.5.2-559-g32c73138ff4) > parent(s): 07708eb9b28 > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-03-25 12:06:34 +0200 > message: > > Remove some usage of Check_level_instant_set and Sql_mode_save > > The reason for the removal are: > - Generates more code > - Storing and retreving THD > - Causes extra code and daata to be generated to handle possible throw > exceptions (which never happens in MariaDB code) > - Uses more stack space No, Monty. We've discussed it already, this makes code more complex, fragile and bug prone. And it does *not* generates extra code or data or stack - before writing this email I've compiled mariadbd both ways and compared the generated code for the function in question, it was *identical*. > Other things: > - Changed convert_const_to_int() to use item->save_in_field_no_warnings(), > which made the code shorter and simpler. > - Added thd as argument to store_key.copy() to make function simpler > - Added thd as argument to some subselect* constructor that inherites > from Item_subselect. these changes are good, thanks Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1

Re: [Maria-developers] 6b06a99e4a4: Avoid creating the .frm file twice in some cases
by Sergei Golubchik 14 May '21

14 May '21

Hi, Monty! On Apr 19, Michael Widenius wrote: > revision-id: 6b06a99e4a4 (mariadb-10.5.2-557-g6b06a99e4a4) > parent(s): 9c1e36696f3 > author: Michael Widenius <michael.widenius(a)gmail.com> > committer: Michael Widenius <michael.widenius(a)gmail.com> > timestamp: 2021-03-25 12:06:34 +0200 > message: > > Avoid creating the .frm file twice in some cases > > diff --git a/sql/handler.cc b/sql/handler.cc > index c286aef3a9e..fe2fe6e9426 100644 > --- a/sql/handler.cc > +++ b/sql/handler.cc > @@ -5536,8 +5536,9 @@ int ha_create_table(THD *thd, const char *path, > > if (frm) > { > - bool write_frm_now= !create_info->db_type->discover_table && > - !create_info->tmp_table(); > + bool write_frm_now= (!create_info->db_type->discover_table && > + !create_info->tmp_table() && > + !create_info->frm_is_created); Is this the real problem? frm created twice? Why ha_create_table() is invoked twice at all? > > share.frm_image= frm; > > diff --git a/sql/sql_table.cc b/sql/sql_table.cc > index e00d3ee1ed2..73642a5bd97 100644 > --- a/sql/sql_table.cc > +++ b/sql/sql_table.cc > @@ -547,8 +547,13 @@ uint build_table_filename(char *buff, size_t bufflen, const char *db, > > (void) tablename_to_filename(db, dbbuff, sizeof(dbbuff)); > > - /* Check if this is a temporary table name. Allow it if a corresponding .frm file exists */ > - if (is_prefix(table_name, tmp_file_prefix) && strlen(table_name) < NAME_CHAR_LEN && > + /* > + Check if this is a temporary table name. Allow it if a corresponding .frm > + file exists. > + */ > + if (!(flags & FN_IS_TMP) && > + is_prefix(table_name, tmp_file_prefix) && > + strlen(table_name) < NAME_CHAR_LEN && good idea! > check_if_frm_exists(tbbuff, dbbuff, table_name)) > flags|= FN_IS_TMP; > > @@ -9817,7 +9827,7 @@ do_continue:; > > if (table->s->tmp_table != NO_TMP_TABLE) > { > - /* Close lock if this is a transactional table */ > + /* Unlock lock if this is a transactional temporary table */ strictly speaking, you don't "unlock lock", you "unlock a resource" (e.g. a table) or you "release a lock" > if (thd->lock) > { > if (thd->locked_tables_mode != LTM_LOCK_TABLES && Regards, Sergei VP of MariaDB Server Engineering and security(a)mariadb.org

2 1