developers
Threads by month
- ----- 2025 -----
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6814 discussions
Hi, I'm Michael Okoko, a GSOC 2021 mentee for MariaDB and I'd be working on
switching the on-disk storage format for MariaDB histograms to JSON (
https://jira.mariadb.org/browse/MDEV-21130)
I'm looking forward to working with everyone in the community and
contributing more to the MariaDB during and after GSOC.
1
0
Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync
by Sergei Golubchik 26 May '21
by Sergei Golubchik 26 May '21
26 May '21
Hi, Andrei!
On Apr 27, Andrei Elkin wrote:
> >> +--let $case = B: one engine has committed its transaction branch
> >> +# Hold off after one engine has committed.
> >> +--let $shutdown_timeout=0
> >> +--let $debug_sync_action = "commit_after_run_commit_ordered SIGNAL con1_ready WAIT_FOR signal_no_signal"
> >> +# Both debug_sync and debug-dbug are required to make sure Engines remember the commit state
> >> +# debug_sync alone will not help.
> >> +--let $restart_parameters = --rpl-semi-sync-slave-enabled=1 --debug-dbug=d,binlog_truncate_partial_commit
> >
> > in the first review I wrote
> >
> > this seems to be a rather crude way of faking a partially committed
> > transaction. better to crash after the first engine has committed,
> > that'd be much more natural.
> >
> > and you replied
> >
> > This simulation aimed at (allows for) more complicated recovery time
> > event sequences.
> > In this case, indeed, crashing by demand is about of the same efforts.
> > I can convert to that.
> >
> > [x]
>
> I had a rather sobering followup, slipped communicating to you though, sorry.
>
> It's actually turns out not to be easy. A sequence of execution events
>
> E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
>
> does not make E1 such as Innodb committed persistently in file. At
> recovery in E1 trx may be found in prepared state, unless before the
> crash I'd do something like Binlog CheckPoint (BCP) notification request
> and wait of it.
I don't understand, what a binlog checkpoint has to do with it?
> In this just a selected engine would be requested, unlike in BCP event
> case that is generated when both (in this case) have persistently committed.
> (That is we even can't involve BCP event to wait for, and the involvement
> would require some simulation as well).
>
> So I propose to keep the current method leaving the "honest" crash to
> RQG that Alice has been training the patch with.
Alice was loading the server with transactions and doing `kill -9` at
some point. Many interesting cases that we discuss (for example, the one
above) happen in a very small time window, so I don't know whether
random killing could provide an adequate test coverage.
> >> diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> --- /dev/null
> >> +++ b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> @@ -0,0 +1,143 @@
> >> +
> >> +--source include/have_innodb.inc
> >> +--source include/have_debug_sync.inc
> >> +--source include/have_binlog_format_row.inc
> >> +--let $rpl_topology=1->2
> >> +--source include/rpl_init.inc
> >
> > why not to source master-slave.inc if you're using a standard master-slave
> > topology anyway?
>
> Using numbers it's easier to manipulate with failover operatations.
> Besides it's truly uncomfortable both read and write
> when `master` would be serving the slave role as it happens in the test :-).
I believe I already replied to that earlier.
master-slave.inc creates server_1 and server_2 connections all right.
> >> diff --git a/sql/handler.h b/sql/handler.h
> >> --- a/sql/handler.h
> >> +++ b/sql/handler.h
> >> @@ -873,6 +874,15 @@ typedef struct xid_t XID;
> >> /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
> >> uint get_sql_xid(XID *xid, char *buf);
> >>
> >> +/* struct for semisync slave binlog truncate recovery */
> >> +struct xid_recovery_member
> >> +{
> >> + my_xid xid;
> >> + uint in_engine_prepare; // number of engines that have xid prepared
> >> + bool decided_to_commit;
> >> + std::pair<uint, my_off_t> binlog_coord; // semisync recovery binlog offset
> >
> > wouldn't it be clearer to have a struct with named members?
>
> I am not sure if there's a need for naming. But we do need
> comparison that std::pair provides for granted.
>
> I could consider (a) new typedef(s) to represent the binlog id and
> offset in the file to make it more readable.
>
> You say?
>
> > in fact, I'm somewhat surprised there's no such struct for binlog coords
> > already.
>
> I had started that back at HB implementation actually, but that struct
> did not attact much of attention.
I'd still suggest a structure, like
struct binlog_coords_t {
uint binlog_num;
my_off_t offset;
};
> >
> >> +};
> >> +
> >> /* for recover() handlerton call */
> >> #define MIN_XID_LIST_SIZE 128
> >> #define MAX_XID_LIST_SIZE (1024*128)
> >> @@ -4820,7 +4830,8 @@ int ha_commit_one_phase(THD *thd, bool all);
> >> int ha_commit_trans(THD *thd, bool all);
> >> int ha_rollback_trans(THD *thd, bool all);
> >> int ha_prepare(THD *thd);
> >> -int ha_recover(HASH *commit_list);
> >> +int ha_recover(HASH *commit_list, MEM_ROOT *mem_root= NULL);
> >> +uint ha_recover_complete(HASH *commit_list, std::pair<uint, my_off_t> *coord= NULL);
> >
> > is coord a truncation position?
>
> So NULL is for the normal recovery.
> Non-NULL is an estimate of the 1st to rollback.
> Trx with lesser offsets are to commit, greater of equal - to rollback.
>
> Why it's only an estimate is 'cos the function also handles
> no-truncation cases, it then passes as `coord` the last committed
> trx's coordinate.
> The last committed can't be decided to rollback because of higher precedence of
>
> xid_recovery_member::decided_to_commit
>
> which is `true` under current circumstance, in
>
> +static bool xarecover_decide(xid_recovery_member* member,
> + xid_t x, std::pair<uint, my_off_t> *coord_ptr)
> +{
> + return
> + member->decided_to_commit ? true :
> + !coord_ptr ? false :
> + (member->binlog_coord < *coord_ptr
>
> I did not reflect the subtlety of the above in function header comments.
> To be done, if you're sastified here.
yes, please, reflect it. thanks :)
> >> diff --git a/sql/log_event.h b/sql/log_event.h
> >> --- a/sql/log_event.h
> >> +++ b/sql/log_event.h
> >> @@ -482,6 +482,16 @@ class String;
> >> */
> >> #define LOG_EVENT_IGNORABLE_F 0x80
> >>
> >> +/**
> >> + @def LOG_EVENT_ACCEPT_OWN_F
> >> +
> >> + Flag sets by the semisync slave for accepting
> >> + the same server_id ("own") events which the slave must not have
> >> + in its state. Typically such events were never committed by
> >> + their originator (this server) and discared at its semisync-slave recovery.
> >> +*/
> >> +#define LOG_EVENT_ACCEPT_OWN_F 0x4000
> >
> > may be, add an assert on all received events that such a flag is not set?
> > it can only be set on events in relay log.
>
> There's no place the flag gets logged. It'd be way too fool-proof in my
> opinion :-)!
> Or if you have a case I'll do.
too fool-proof is when you do
a=1;
assert(a == 1);
if you write a binlog event to disk, another thread reads it and sends
over the network to another server, which also writes it to disk first,
then another thread reads it again and executes - I don't think it's
"too" to check that the event doesn't have this flag.
Also, I personally value a lot a documentation role of asserts. Assert,
like
DBUG_ASSERT(ev->flags & LOG_EVENT_ACCEPT_OWN_F == 0);
is basically saying
/* LOG_EVENT_ACCEPT_OWN_F must be not set here */
but it's a self-fulfilling comment, it documents the code and at the
same time guarantees that the documentation is up-to-date.
> >> @@ -3357,6 +3367,12 @@ class Gtid_log_event: public Log_event
> >> uint64 commit_id;
> >> uint32 domain_id;
> >> uchar flags2;
> >> + uint flags_extra; // more flags area placed after the regular flags2's one
> >> + /*
> >> + Extra to a "base" engine recoverable engines participating
> >> + in the transaction. Zero, when the base engine only is present.
> >
> > what's a "base engine"?
>
> The 1st ha_info one that is not binlog.
why not to say "number of participating engines minus one" ?
this won't puzzle anyone with a new concept of a "base engine"
> >
> >> + */
> >> + uint8 extra_engines;
> >>
> >> /* Flags2. */
> >>
> >> diff --git a/sql/handler.cc b/sql/handler.cc
> >> --- a/sql/handler.cc
> >> +++ b/sql/handler.cc
> >> @@ -1637,9 +1672,17 @@ commit_one_phase_2(THD *thd, bool all, THD_TRANS *trans, bool is_real_trans)
> >> DEBUG_SYNC(thd, "commit_one_phase_2");
> >> if (ha_info)
> >> {
> >> + int err;
> >> +
>
> >> + if (has_binlog_hton(ha_info) &&
> >
> > can you replace has_binlog_hton() with, like, if trx cache is not empty or
> > binlog enabled or something like that?
>
> I see you're driving at removing binlog ha_info from the list, are not
> you?
I'm not sure :)
Yes, I'd like to get rid of binlog_hton, as we discussed. But here,
perhaps, I just didn't like that you scan the list for something that
most likely could be checked much easier.
> I already made some experiments to have observed various complications
> in the way of implementing an equivalent of
> has_binlog_hton().
> The closest would be
> non-empty cache.
>
> But `binlog_commit` may need invoking even in this case, as its branch
> shows
>
> if (cache_mngr->trx_cache.empty())
> {
> /*
> we're here because cache_log was flushed in MYSQL_BIN_LOG::log_xid()
> */
> cache_mngr->reset(false, true);
>
>
> Overall, I'd stay with the current approach to defer its refinement to
> time binlog_hton stops registering for transaction.
>
> Your comment?
Okay. I just don't like scanning the list, aesthetically, when one can
check a value in some fixed location or a variable. But it's not a
performance concern, the list has usually only two elements and hardly
ever can have more than three. So okay, let's stay with the current
approach.
> > @@ -1962,8 +2008,177 @@ struct xarecover_st
>
> >> sql_print_warning("%s transaction with xid %llu",
> >
> > may be not a sql_print_warning, but a sql_print_information?
> > it's just an informational message
>
> [->]
I presume, it means [x]
> >> + member->decided_to_commit ?
> >> + "Committed" : "Rolled back", (ulonglong) member->xid);
> >> + }
> >> +
> >> + return false;
> >> +}
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
1
0
Re: [Maria-developers] 308c28e8a60: MDEV-15888 Implement FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK for views.
by Sergei Golubchik 23 May '21
by Sergei Golubchik 23 May '21
23 May '21
Hi, Alexey!
On Mar 21, Alexey Botchkov wrote:
> revision-id: 308c28e8a60 (mariadb-10.5.2-393-g308c28e8a60)
> parent(s): 786bc312b85
> author: Alexey Botchkov <holyfoot(a)mariadb.com>
> committer: Alexey Botchkov <holyfoot(a)mariadb.com>
> timestamp: 2021-02-09 00:59:55 +0400
> message:
>
> MDEV-15888 Implement FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK for views.
>
> Kind of 'naive' solution that surprisingly worked so trying to
> understand why.
>
> diff --git a/mysql-test/main/flush-innodb.result b/mysql-test/main/flush-innodb.result
> index 21e5bda7785..2c886e4f9fc 100644
> --- a/mysql-test/main/flush-innodb.result
> +++ b/mysql-test/main/flush-innodb.result
> @@ -60,7 +60,7 @@ DROP TABLE export;
> CREATE VIEW v1 AS SELECT 1;
> CREATE TEMPORARY TABLE t1 (a INT);
> FLUSH TABLES v1 FOR EXPORT;
> -ERROR HY000: 'test.v1' is not of type 'BASE TABLE'
> +UNLOCK TABLES;
good, but it only tests that there's no error.
may be you need to add tests to verify that tables are actually
locked?
> FLUSH TABLES t1 FOR EXPORT;
> ERROR 42S02: Table 'test.t1' doesn't exist
> FLUSH TABLES non_existent FOR EXPORT;
> diff --git a/sql/sql_reload.cc b/sql/sql_reload.cc
> index 8f87d633d19..c146b7693c1 100644
> --- a/sql/sql_reload.cc
> +++ b/sql/sql_reload.cc
> @@ -543,6 +543,7 @@ bool flush_tables_with_read_lock(THD *thd, TABLE_LIST *all_tables)
>
> if (thd->lex->type & REFRESH_READ_LOCK)
> {
> +#ifdef MDEV15887
> /*
> Acquire SNW locks on tables to be flushed. Don't acquire global
> IX and database-scope IX locks on the tables as this will make
> @@ -559,6 +560,7 @@ bool flush_tables_with_read_lock(THD *thd, TABLE_LIST *all_tables)
> for (auto table_list= all_tables; table_list;
> table_list= table_list->next_global)
> table_list->mdl_request.ticket= NULL;
> +#endif /*MDEV15887*/
what's that?
btw, the number is MDEV15888
> }
>
> thd->variables.option_bits|= OPTION_TABLE_LOCK;
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
2
1
Re: [Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 22 May '21
by Nikita Malyavin 22 May '21
22 May '21
Welcome to the thread, Andrei!
For everybody: we just had a productive chat with Andrei, and I'd like to
outline the results.
Let's take a look at binlog_commit, or more exactly
MYSQL_BIN_LOG::trx_group_commit_leader
called from there (follow binlog_commit_flush_trx_cache
-> THD::binlog_flush_pending_rows_event
-> MYSQL_BIN_LOG::flush_and_set_pending_rows_event
-> MYSQL_BIN_LOG::write_transaction_to_binlog call chain).
The committing is currently strongly sequenced. The transactions are
organized into
groups, and when the latest transaction is acknowledged on the first commit
phase,
they all are being committed by leader in a chosen order.
However, even here we can write the transactions in parallel, preserving
the order.
Andrei also claims, that the order can be potentially restored on the
replication side.
Anyway we technically can't send the transaction to the replication slave
before the
binlog flush&fsync, nevertheless the data will still be preserved in the
volatile append cache.
There was MDEV-20925 <https://jira.mariadb.org/browse/MDEV-20925> to store
the transaction length in the event, but it unfortunately
was rejected:
> I will be closing this issue because we have COMMIT/ROLLBACK query log
event
> in the end of transaction , whose size is difficult to determine ,
> So current plan is to do MDEv-19687 without transaction length.
The replication team decided to calculate the transaction sizes during
receiving the data
form the io, and then to store it in the hash, and no protocol
modifications would be required then.
I suggested to buffer the transaction separately, and then push it into the
relay
log in the data frame, storing the length.
Then, we have MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>,
which was the supertask for MDEV-20925
<https://jira.mariadb.org/browse/MDEV-20925>.
The aim is to parallelize the replication on the slave side. They are
implementing their
own parallel circular buffer with single-writer, multiple-reader use case:
the parallel workers are going to pick out the transaction data, and
resolve the commit order
later.
Sergei petrunya, you've been questioning:
> I see SEQ_READ_APPEND is only used for Relay Log on the slave. Afaiu, the
> relay
> log has only only one producer, the network io thread.
Andrei clarified, that there can actually be multiple sources. However they
are just going to
create a working queue for each source.
===============================================================
To underline, the rationale is changed for this IO_CACHE improvement:
* The replication relay log is not the use-case anymore, since a separate,
much simplified
circular buffer is going to be implemented (AFAIU, Sachin is in the middle
of the progress
of MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>).
* The binlog commit is instead a good case! Most likely I have forgotten
about it, when
I was writing the rationale. But anyway it wasn't clear for me, can the
transactions be written
in parallel there.
On the reader side, the network replication sender should not block the
committing process
by reading from the binlog.
Regards,
Nikita
1
0
[Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 19 May '21
by Nikita Malyavin 19 May '21
19 May '21
Hello, dear MariaDB community!
I am glad to present the design for IO_CACHE's SEQ_READ_APPEND mode
concurrency improvements. Jira task is MDEV-24676.
A student, Vladislav Kakurin, has applied to GSoC for this task, and
has passed, so he will be engaged in implementing it this summer.
The discussion is very important here, since I could make the mistakes,
and Vladislav can miss them, but the success of the task completion
may depend on it. We're hoping to succeed in implementing the multi-producer
till GSoC end.
So please don't hesitate to ask for clarifications.
Note that reader/writer and producer/consumer pairs are used interchangeably
in the text.
The text is attached as PDF, but for the sake of convenient quoting it
is additionally embodied in the email with shrinked markup and no
illustrations.
Sincerely,
Nikita Malyavin
Software Engineer of MariaDB
====================================================================
A fine-grained concurrent ring buffer mode for IO_CACHE
1. Rationale.
---------------------------------------------------
Current implementation of IO_CACHE’s SEQ_READ_APPEND mode behaves
coarsely grained on its write buffer: every read and every write is
protected by append_buffer_lock.
int _my_b_seq_read(IO_CACHE *info, ...) {
lock_append_buffer(info);
... // read logic
unlock_append_buffer(info);
return Count ? 1 : 0;
}
int my_b_append(IO_CACHE *info, ...) {
lock_append_buffer(info);
... // append logic
unlock_append_buffer(info);
return 0;
}
Despite the separate read buffer is read-only, and therefore is
accessed wait-free, the write buffer can have a contention with
medium-sized transactions.
The design described hereafter is going to solve this issue, and an
extension for a parallel multi-producer workflow is additionally
provided.
Furthermore, the API extension for multi-producer approach support is
proposed, and the multi-consumerness is discussed.
2. The single-producer, single-consumer case.
---------------------------------------------------
Idea.
The memcpy operations of consumer and producer never overlap, therefore
they can be freed of locks.
Overflow and emptiness
We cannot begin writing in the area still involved in reading.
Therefore, reader should not update the pointers before it finishes
reading. This means that we should lock in the beginning to atomically
read the data, and in the end, to write the new reader data.
The same for vice-versa, we cannot read from the area still involved
into writing, therefore a read should finish with EMPTY error
(currently _my_b_seq_read just returns 1)
When we reach a “buffer is full” condition, we can flip the read and
write (append) buffers, if we were reading from an append buffer.
Otherwise, the append buffer is flushed.
The algorithm.
The following pseudocode will describe the single-consumer,
single-producer approach.
It is assumed that reading from the read buffer is handled in the usual
way.
io->total_size and io->read_buffer are considered to be accessed
atomically.
io_err_t read(IO_CACHE *io, uchar *buffer, size_t sz)
{
if (sz > io->total_size)
return E_IO_EMPTY;
uchar *read_buffer = io->read_buffer;
if (io->read_pos points to read_buffer)
sz_read = read_from_read_buffer(io, buffer, sz);
buffer += sz_read;
sz -= sz_read;
io->total_size -= sz_read;
if (sz == 0)
return 0;
// else copy from append buffer
lock(io->append_buffer_lock);
// copy the local variables
uchar *read_pos = io->read_pos;
uchar *read_buffer = io->read_buffer;
uchar *append_start_pos = i->append_start_pos;
uchar *append_size = io->append_size;
uchar *append_pos = io->append_pos;
// etc, if needed
unlock(io->append_buffer_lock);
read from append buffer;
lock(io->append_buffer_lock);
// update the variables
io->append_size -= sz;
io->append_start_pos += sz;
if (i->append_start_pos >= io->append_buf + io->cache_size)
io->append_start_pos -= io->cache_size;
unlock(io->append_buffer_lock);
io->total_size -= sz;
}
The first read()’s part tries to read from a read-only buffer.
If it’s empty, it moves the effort to a volatile append buffer.
All the metadata is copied in the first critical section, before the
data copying, to the stack. It is updated back in the second critical
section, after the data copying.
io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz)
{
lock(io->append_buffer_lock);
if (append_buffer is full and io->total_size <= io->append_size)
swap(io->append_buffer, io->read_buffer);
else flush the append buffer if needed;
write to disk directly, if the data is too large;
uchar *write_pos = io->write_pos;
unlock(io->append_buffer_lock);
write to append buffer;
lock(io->append_buffer_lock);
io->write_pos = new_write_pos;
unlock(io->append_buffer_lock);
io->total_size += sz;
}
The important note here is that we access io->read_buffer in the
reader’s thread without the lock (the accesses are marked bold).
However this access happens only once in the beginning and is safe:
Only writer changes read_buffer.
The writer can change it only once during one read()
if io->read_buffer is considered reads-only, then it will not flip
again, and continue to be consistent, until io->total_size is changed:
io->total_size -= sz_read;
Then the lock happens. It should be fine to read from a flipped buffer
on that stage.
3. Multi-producer concurrency
---------------------------------------------------
Idea.
Writes start from io->write_start, which is to update immediately.
Reads are possible only until io->read_end, which is updated, as soon
as writes are finished.
Medium-grained approach
io->write_start is updated immediately to allow parallel writes.
However, we cannot update io->read_end immediately after this thread’s
write ends, because earlier writes can still be in progress. We should
wait for them i.e. we wait while (io->read_end != local_read_end)
Algorithm (medium-grained).
Medium-grained approach will modify write() function as follows (the
changed lines and locks are bolded):
io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz)
{
lock(io->append_buffer_lock);
if (buffer flip of flush is needed)
wait until all the writes are finished;
if (append_buffer is full &&
io->write_total_size <= io->append_size)
swap(io->append_buffer, io->read_buffer);
else flush the append buffer if needed;
write to disk directly, if the data is too large;
uchar *local_write_start = io->write_start;
io->write_total_size += sz;
io->write_start += sz;
if (io->write_start > io->append_buffer + io->cache_size)
io->write_start -= io->cache_size;
unlock(io->append_buffer_lock);
write to append buffer;
lock(io->write_event_lock)
while(local_write_start != io->read_end)
cond_wait(io->write_event, io->write_event_lock);
unlock(io->write_event_lock)
lock(io->append_buffer_lock);
io->read_end = new_read_end;
unlock(io->append_buffer_lock);
cond_signal(io->write_event);
io->total_size += sz;
}
The read function should be modified mostly cosmetically.
Fine graining.
The writers are still waiting for each other’s finish. The approach
described here defers waiting through helping pattern by introducing
progress slots.
Each time a writer begins progress it allocates a slot in the dedicated
(fixed size) array.
When the writer finishes its job, it checks whether it is the leftmost
one (relative to its read_end value. If it is, it updates read_end for
itself, and for all the consecutive writers already finished.
The slot allocation will be controlled by a semaphore to prevent
overflow. Therefore, only a fixed number of producers can work
simultaneously.
The slot array is made of elements of private cache_slot_t structure:
struct cache_slot_t {
bool vacant: 1;
bool finished: 1;
uint next: size_bits(uint) - 2;
uint pos;
}
The slot is acquired whenever a write begins by searching an array cell
with vacant=1. When it’s found, vacant = 0, finished = 0 is set. The
last_slot variable holds the slot index for the latest write. slots[
last_slot].next is set to a new index, and last_slot itself is updated.
The following example demonstrates how the slots work:
+-------------------------------------------------------------+
| | write1 | write2 | write3 | |
+-------------------------------------------------------------+
A B C D
+-------------------------------------------------+
| vacant=0 | vacant=1 | vacant=0 | vacant=0 |
| finished=0 | | finished=1 | finished=1 |
| next=2 | | next=3 | next=? |
| pos=A | | pos=B | pos=C |
+-------------------------------------------------+
there were three writes currently running in parallel. write2 and
write3 are finished, but write1 is still running. When it finishes, it
will hop through slot.next while vacant==0 and finished==1 and pos !=
io->write_start. Therefore, read_end will be updated to C if no other
write will begin in parallel.
If another write begins in parallel before write1 finishes, it
allocates slots[1] and sets pos=D. slots[3].next would be set to 1, and
last_slot will be updated from 3 to 1.
The slot run through expected complexity is O(1). The proof for
acquisition is however not that obvious to prove the same, and no
effort was spent for proving it (It’s only obvious that it’s O(slots)).
4. Arbitrary data sources support
---------------------------------------------------
The widely spread use-case is pouring from another IO_CACHE source
(like a statement or transaction cache). The operation may require
several consecutive write() calls with an external lock:
lock(write_lock);
uchar buffer[SIZE];
while(cache_out is not empty) {
read(ceche_out, buffer, SIZE);
write(cache_in, buffer, SIZE);
}
unlock(write_lock);
This case destroys all the parallel design described.
However, let’s make api changes to allow blocks of predicted size be
written in parallel:
/**
Allocates the slot of a requested size for a writer. Returns new
slot id.
*/
slot_id_t append_allocate(IO_CAHCE*, size_t block_size);
/**
Frees the slot and propogates the data to be available for reading
*/
void append_commit(IO_CACHE*, slot_id_t);
These two functions just decompose our write() function:
append_allocate would include the first critical section and
append_commit would include the second one.
The use-case will be changed slightly:
slot_id_t slot = append_allocate(cache_out, append_tell(cache_in));
uchar buffer[SIZE];
while(cache_out is not empty) {
read(ceche_out, buffer, SIZE);
write(cache_in, buffer, SIZE);
}
append_commit(cache_out, slot);
5. Multi-consumerness
---------------------------------------------------
We currently have no cases with several readers working in parallel in
SEQ_READ_APPEND mode. It is only used by the replication thread to read
out the log, where it is delegated to a dedicated worker. The first
problem is that parallel readout would require additional coordination
-- the order of event application can be important.
Another problem is that a variable-sized blocks require at least two
consecutive reads if the structure is not known. If the length is
stored, it can be read out with exactly two reads (first reads length,
second reads the body).
The slot allocation strategy can be applied, and api can be added
similar to a new write api:
/** lock the cache and allocate the read slot */
slot_id_t read_allocate_lock(IO_CACHE*);
/** Allocate a read zone of the requested size and unlock the cache */
void read_allocate_unlock(IO_CACHE*, slot_id_t, size_t size);
/** Finish reading; deallocate the read slot */
viod read_commit(IO_CACHE*, slot_id_t);
Reading api needs one function more than writing api -- the allocation
is split on two phases: locking phase (to compute the block length),
and the actual requesting phase.
This approach has several disadvantages:
1. The read buffer access is no longer lock-free
2. read_allocate_lock leaves the IO_CACHE in a locked state, which can
be potentially misused.
Additionally, two SX locks can be used (one for readers and one for
writers) for extra parallelism.
2
2
[Maria-developers] Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
by Nikita Malyavin 18 May '21
by Nikita Malyavin 18 May '21
18 May '21
Hello, dear MariaDB community!
I am glad to present the design for IO_CACHE's SEQ_READ_APPEND mode
concurrency improvements. Jira task is MDEV-24676.
A student, Vladislav Kakurin, has applied to GSoC for this task, and
has passed, so he will be engaged in implementing it this summer.
The discussion is very important here, since I could make the mistakes,
and Vladislav can miss them, but the success of the task completion
may depend on it. We're hoping to succeed in implementing the multi-producer
till GSoC end.
So please don't hesitate to ask for clarifications.
Note that reader/writer and producer/consumer pairs are used interchangeably
in the text.
The text is attached as PDF, but for the sake of convenient quoting it
is additionally embodied in the email with shrinked markup and no
illustrations.
Sincerely,
Nikita Malyavin
Software Engineer of MariaDB
====================================================================
A fine-grained concurrent ring buffer mode for IO_CACHE
1. Rationale.
---------------------------------------------------
Current implementation of IO_CACHE’s SEQ_READ_APPEND mode behaves
coarsely grained on its write buffer: every read and every write is
protected by append_buffer_lock.
int _my_b_seq_read(IO_CACHE *info, ...) {
lock_append_buffer(info);
... // read logic
unlock_append_buffer(info);
return Count ? 1 : 0;
}
int my_b_append(IO_CACHE *info, ...) {
lock_append_buffer(info);
... // append logic
unlock_append_buffer(info);
return 0;
}
Despite the separate read buffer is read-only, and therefore is
accessed wait-free, the write buffer can have a contention with
medium-sized transactions.
The design described hereafter is going to solve this issue, and an
extension for a parallel multi-producer workflow is additionally
provided.
Furthermore, the API extension for multi-producer approach support is
proposed, and the multi-consumerness is discussed.
2. The single-producer, single-consumer case.
---------------------------------------------------
Idea.
The memcpy operations of consumer and producer never overlap, therefore
they can be freed of locks.
Overflow and emptiness
We cannot begin writing in the area still involved in reading.
Therefore, reader should not update the pointers before it finishes
reading. This means that we should lock in the beginning to atomically
read the data, and in the end, to write the new reader data.
The same for vice-versa, we cannot read from the area still involved
into writing, therefore a read should finish with EMPTY error
(currently _my_b_seq_read just returns 1)
When we reach a “buffer is full” condition, we can flip the read and
write (append) buffers, if we were reading from an append buffer.
Otherwise, the append buffer is flushed.
The algorithm.
The following pseudocode will describe the single-consumer,
single-producer approach.
It is assumed that reading from the read buffer is handled in the usual
way.
io->total_size and io->read_buffer are considered to be accessed
atomically.
io_err_t read(IO_CACHE *io, uchar *buffer, size_t sz)
{
if (sz > io->total_size)
return E_IO_EMPTY;
uchar *read_buffer = io->read_buffer;
if (io->read_pos points to read_buffer)
sz_read = read_from_read_buffer(io, buffer, sz);
buffer += sz_read;
sz -= sz_read;
io->total_size -= sz_read;
if (sz == 0)
return 0;
// else copy from append buffer
lock(io->append_buffer_lock);
// copy the local variables
uchar *read_pos = io->read_pos;
uchar *read_buffer = io->read_buffer;
uchar *append_start_pos = i->append_start_pos;
uchar *append_size = io->append_size;
uchar *append_pos = io->append_pos;
// etc, if needed
unlock(io->append_buffer_lock);
read from append buffer;
lock(io->append_buffer_lock);
// update the variables
io->append_size -= sz;
io->append_start_pos += sz;
if (i->append_start_pos >= io->append_buf + io->cache_size)
io->append_start_pos -= io->cache_size;
unlock(io->append_buffer_lock);
io->total_size -= sz;
}
The first read()’s part tries to read from a read-only buffer.
If it’s empty, it moves the effort to a volatile append buffer.
All the metadata is copied in the first critical section, before the
data copying, to the stack. It is updated back in the second critical
section, after the data copying.
io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz)
{
lock(io->append_buffer_lock);
if (append_buffer is full and io->total_size <= io->append_size)
swap(io->append_buffer, io->read_buffer);
else flush the append buffer if needed;
write to disk directly, if the data is too large;
uchar *write_pos = io->write_pos;
unlock(io->append_buffer_lock);
write to append buffer;
lock(io->append_buffer_lock);
io->write_pos = new_write_pos;
unlock(io->append_buffer_lock);
io->total_size += sz;
}
The important note here is that we access io->read_buffer in the
reader’s thread without the lock (the accesses are marked bold).
However this access happens only once in the beginning and is safe:
Only writer changes read_buffer.
The writer can change it only once during one read()
if io->read_buffer is considered reads-only, then it will not flip
again, and continue to be consistent, until io->total_size is changed:
io->total_size -= sz_read;
Then the lock happens. It should be fine to read from a flipped buffer
on that stage.
3. Multi-producer concurrency
---------------------------------------------------
Idea.
Writes start from io->write_start, which is to update immediately.
Reads are possible only until io->read_end, which is updated, as soon
as writes are finished.
Medium-grained approach
io->write_start is updated immediately to allow parallel writes.
However, we cannot update io->read_end immediately after this thread’s
write ends, because earlier writes can still be in progress. We should
wait for them i.e. we wait while (io->read_end != local_read_end)
Algorithm (medium-grained).
Medium-grained approach will modify write() function as follows (the
changed lines and locks are bolded):
io_err_t write(IO_CACHE *io, uchar *buffer, size_t sz)
{
lock(io->append_buffer_lock);
if (buffer flip of flush is needed)
wait until all the writes are finished;
if (append_buffer is full &&
io->write_total_size <= io->append_size)
swap(io->append_buffer, io->read_buffer);
else flush the append buffer if needed;
write to disk directly, if the data is too large;
uchar *local_write_start = io->write_start;
io->write_total_size += sz;
io->write_start += sz;
if (io->write_start > io->append_buffer + io->cache_size)
io->write_start -= io->cache_size;
unlock(io->append_buffer_lock);
write to append buffer;
lock(io->write_event_lock)
while(local_write_start != io->read_end)
cond_wait(io->write_event, io->write_event_lock);
unlock(io->write_event_lock)
lock(io->append_buffer_lock);
io->read_end = new_read_end;
unlock(io->append_buffer_lock);
cond_signal(io->write_event);
io->total_size += sz;
}
The read function should be modified mostly cosmetically.
Fine graining.
The writers are still waiting for each other’s finish. The approach
described here defers waiting through helping pattern by introducing
progress slots.
Each time a writer begins progress it allocates a slot in the dedicated
(fixed size) array.
When the writer finishes its job, it checks whether it is the leftmost
one (relative to its read_end value. If it is, it updates read_end for
itself, and for all the consecutive writers already finished.
The slot allocation will be controlled by a semaphore to prevent
overflow. Therefore, only a fixed number of producers can work
simultaneously.
The slot array is made of elements of private cache_slot_t structure:
struct cache_slot_t {
bool vacant: 1;
bool finished: 1;
uint next: size_bits(uint) - 2;
uint pos;
}
The slot is acquired whenever a write begins by searching an array cell
with vacant=1. When it’s found, vacant = 0, finished = 0 is set. The
last_slot variable holds the slot index for the latest write. slots[
last_slot].next is set to a new index, and last_slot itself is updated.
The following example demonstrates how the slots work:
+-------------------------------------------------------------+
| | write1 | write2 | write3 | |
+-------------------------------------------------------------+
A B C D
+-------------------------------------------------+
| vacant=0 | vacant=1 | vacant=0 | vacant=0 |
| finished=0 | | finished=1 | finished=1 |
| next=2 | | next=3 | next=? |
| pos=A | | pos=B | pos=C |
+-------------------------------------------------+
there were three writes currently running in parallel. write2 and
write3 are finished, but write1 is still running. When it finishes, it
will hop through slot.next while vacant==0 and finished==1 and pos !=
io->write_start. Therefore, read_end will be updated to C if no other
write will begin in parallel.
If another write begins in parallel before write1 finishes, it
allocates slots[1] and sets pos=D. slots[3].next would be set to 1, and
last_slot will be updated from 3 to 1.
The slot run through expected complexity is O(1). The proof for
acquisition is however not that obvious to prove the same, and no
effort was spent for proving it (It’s only obvious that it’s O(slots)).
4. Arbitrary data sources support
---------------------------------------------------
The widely spread use-case is pouring from another IO_CACHE source
(like a statement or transaction cache). The operation may require
several consecutive write() calls with an external lock:
lock(write_lock);
uchar buffer[SIZE];
while(cache_out is not empty) {
read(ceche_out, buffer, SIZE);
write(cache_in, buffer, SIZE);
}
unlock(write_lock);
This case destroys all the parallel design described.
However, let’s make api changes to allow blocks of predicted size be
written in parallel:
/**
Allocates the slot of a requested size for a writer. Returns new
slot id.
*/
slot_id_t append_allocate(IO_CAHCE*, size_t block_size);
/**
Frees the slot and propogates the data to be available for reading
*/
void append_commit(IO_CACHE*, slot_id_t);
These two functions just decompose our write() function:
append_allocate would include the first critical section and
append_commit would include the second one.
The use-case will be changed slightly:
slot_id_t slot = append_allocate(cache_out, append_tell(cache_in));
uchar buffer[SIZE];
while(cache_out is not empty) {
read(ceche_out, buffer, SIZE);
write(cache_in, buffer, SIZE);
}
append_commit(cache_out, slot);
5. Multi-consumerness
---------------------------------------------------
We currently have no cases with several readers working in parallel in
SEQ_READ_APPEND mode. It is only used by the replication thread to read
out the log, where it is delegated to a dedicated worker. The first
problem is that parallel readout would require additional coordination
-- the order of event application can be important.
Another problem is that a variable-sized blocks require at least two
consecutive reads if the structure is not known. If the length is
stored, it can be read out with exactly two reads (first reads length,
second reads the body).
The slot allocation strategy can be applied, and api can be added
similar to a new write api:
/** lock the cache and allocate the read slot */
slot_id_t read_allocate_lock(IO_CACHE*);
/** Allocate a read zone of the requested size and unlock the cache */
void read_allocate_unlock(IO_CACHE*, slot_id_t, size_t size);
/** Finish reading; deallocate the read slot */
viod read_commit(IO_CACHE*, slot_id_t);
Reading api needs one function more than writing api -- the allocation
is split on two phases: locking phase (to compute the block length),
and the actual requesting phase.
This approach has several disadvantages:
1. The read buffer access is no longer lock-free
2. read_allocate_lock leaves the IO_CACHE in a locked state, which can
be potentially misused.
Additionally, two SX locks can be used (one for readers and one for
writers) for extra parallelism.
1
0
16 May '21
Hi, Monty!
On May 10, Michael Widenius wrote:
> revision-id: eebebe8ef75 (mariadb-10.6.0-77-geebebe8ef75)
> parent(s): 5a602e2ce7f
> author: Michael Widenius <michael.widenius(a)gmail.com>
> committer: Michael Widenius <michael.widenius(a)gmail.com>
> timestamp: 2021-05-04 21:09:10 +0300
> message:
>
> MDEV-23842 Atomic RENAME TABLE
> diff --git a/client/mysqltest.cc b/client/mysqltest.cc
> index 350c55edee2..133d1f76839 100644
> --- a/client/mysqltest.cc
> +++ b/client/mysqltest.cc
> @@ -8095,9 +8095,10 @@ void handle_error(struct st_command *command,
> const char *err_sqlstate, DYNAMIC_STRING *ds)
> {
> int i;
> -
> DBUG_ENTER("handle_error");
>
> + var_set_int("$errno", err_errno);
there was $mysql_errno already, was is not something
you could've used?
> +
> command->used_replace= 1;
> if (command->require_file)
> {
> diff --git a/mysql-test/suite/atomic/rename_case.result b/mysql-test/suite/atomic/rename_case.result
> new file mode 100644
> index 00000000000..4b58c555cf2
> --- /dev/null
> +++ b/mysql-test/suite/atomic/rename_case.result
> @@ -0,0 +1,52 @@
> +create database test2;
> +#
> +# Testing rename error in different places
> +#
> +create table t1 (a int);
> +create table T2 (b int);
> +create table t3 (c int);
> +create table T4 (d int);
> +insert into t1 values(1);
> +insert into T2 values(2);
> +insert into t3 values(3);
> +insert into T4 values(4);
> +create temporary table tmp1 (a int);
> +create temporary table tmp2 (b int);
> +create temporary table tmp3 (c int);
> +create temporary table tmp4 (d int);
> +insert into tmp1 values(11);
> +insert into tmp2 values(22);
> +insert into tmp3 values(33);
> +insert into tmp4 values(44);
> +rename table t3 to T4, t1 to t5, T2 to t1, t5 to T2;
> +ERROR 42S01: Table 'T4' already exists
> +rename table t1 to t5, t3 to T4, T2 to t1, t5 to T2;
> +ERROR 42S01: Table 'T4' already exists
> +rename table t1 to t5, T2 to t1, t3 to T4, t5 to T2;
> +ERROR 42S01: Table 'T4' already exists
> +rename table t1 to t5, T2 to t1, t5 to T2, t3 to T4;
> +ERROR 42S01: Table 'T4' already exists
> +# Try failed rename using two databases
> +rename table test.t1 to test2.t5, test.T2 to test.t1, t5 to test.T2;
> +ERROR 42S02: Table 'test.t5' doesn't exist
> +select t1.a+T2.b+t3.c+T4.d from t1,T2,t3,T4;
better do
select t1.a,T2.b,t3.c,T4.d from t1,T2,t3,T4;
same one row, but it can detect when tables are swapped.
or if you need one scalar value (you do below) it could ne
select CONCAT(t1.a,T2.b,t3.c,T4.d) from t1,T2,t3,T4;
> +t1.a+T2.b+t3.c+T4.d
> +10
> +select * from t5;
> +ERROR 42S02: Table 'test.t5' doesn't exist
> +T2.MYD
> +T2.MYI
> +T2.frm
> +T4.MYD
> +T4.MYI
> +T4.frm
> +db.opt
> +t1.MYD
> +t1.MYI
> +t1.frm
> +t3.MYD
> +t3.MYI
> +t3.frm
> +# Cleanup
> +drop table t1,T2,t3,T4;
> +drop database test2;
> diff --git a/sql/sql_class.h b/sql/sql_class.h
> index 0dcb2e04ba4..5ae0866672d 100644
> --- a/sql/sql_class.h
> +++ b/sql/sql_class.h
> @@ -2835,6 +2835,7 @@ class THD: public THD_count, /* this must be first */
>
> #ifndef MYSQL_CLIENT
> binlog_cache_mngr * binlog_setup_trx_data();
> + ulonglong binlog_xid; /* binlog request storing of xid */
"Tell binlog to store thd->query_id in the next Query_log_event"
>
> /*
> Public interface to write RBR events to the binlog
> diff --git a/sql/log_event_client.cc b/sql/log_event_client.cc
> index 6b5d71348e1..733fc14c697 100644
> --- a/sql/log_event_client.cc
> +++ b/sql/log_event_client.cc
> @@ -1822,7 +1822,7 @@ bool Query_log_event::print_query_header(IO_CACHE* file,
> my_b_printf(file,
> "\t%s\tthread_id=%lu\texec_time=%lu\terror_code=%d\n",
> get_type_str(), (ulong) thread_id, (ulong) exec_time,
> - error_code))
> + error_code, (ulong) xid))
huh? you didn't add "\nxid=%lu" to the format string
> goto err;
> }
>
> diff --git a/sql/debug_sync.h b/sql/debug_sync.h
> index 3b8aa8815e1..4e3e10fcc51 100644
> --- a/sql/debug_sync.h
> +++ b/sql/debug_sync.h
> @@ -53,4 +53,12 @@ static inline bool debug_sync_set_action(THD *, const char *, size_t)
> { return false; }
> #endif /* defined(ENABLED_DEBUG_SYNC) */
>
> +#ifndef DBUG_OFF
> +void debug_crash_here(const char *keyword);
> +bool debug_simulate_error(const char *keyword, uint error);
> +#else
> +#define debug_crash_here(A) do { } while(0)
> +#define debug_simulate_error(A, B) 0
> +#endif
this doesn't belong to debug_sync.h or debug_sync.cc
put it, hmm, may be in sql_priv.h ? or sql_class.h/sql_class.cc
> +
> #endif /* DEBUG_SYNC_INCLUDED */
> diff --git a/sql/debug_sync.cc b/sql/debug_sync.cc
> index 1fbb15592a4..f523e22f6ce 100644
> --- a/sql/debug_sync.cc
> +++ b/sql/debug_sync.cc
> @@ -1628,3 +1628,74 @@ bool debug_sync_set_action(THD *thd, const char *action_str, size_t len)
> /* prevent linker/lib warning about file without public symbols */
> int debug_sync_dummy;
> #endif /* defined(ENABLED_DEBUG_SYNC) */
> +
> +
> +/**
> + Debug utility to do crash after a set number of executions
> +
> + The user variable, either @debug_crash_counter or @debug_error_counter,
this is very hackish. better to use @@debug_dbug_counter,
something like that. also it could be global for an extra benefit
if you'd need to crash a non-user thread.
> + is decremented each time debug_crash() or debug_simulate_error is called
> + if the keyword is set with @@debug_push, like
> + @@debug_push="d+frm_data_type_info_emulate"
> +
> + If the variable is not set or is not an integer it will be ignored.
> +*/
> +
> +#ifndef DBUG_OFF
> +
> +static const LEX_CSTRING debug_crash_counter=
> +{ STRING_WITH_LEN("debug_crash_counter") };
> +static const LEX_CSTRING debug_error_counter=
> +{ STRING_WITH_LEN("debug_error_counter") };
> +
> +static bool debug_decrement_counter(const LEX_CSTRING *name)
> +{
> + THD *thd= current_thd;
> + user_var_entry *entry= (user_var_entry*)
> + my_hash_search(&thd->user_vars, (uchar*) name->str, name->length);
> + if (!entry || entry->type != INT_RESULT || ! entry->value)
> + return 0;
> + (*(ulonglong*) entry->value)= (*(ulonglong*) entry->value)-1;
> + return !*(ulonglong*) entry->value;
> +}
> +
> +void debug_crash_here(const char *keyword)
> +{
> + DBUG_ENTER("debug_crash_here");
> + DBUG_PRINT("enter", ("keyword: %s", keyword));
> +
> + DBUG_EXECUTE_IF(keyword,
> + if (debug_decrement_counter(&debug_crash_counter))
> + {
> + my_printf_error(ER_INTERNAL_ERROR,
> + "Crashing at %s",
> + MYF(ME_ERROR_LOG | ME_NOTE), keyword);
> + DBUG_SUICIDE();
> + });
> + DBUG_VOID_RETURN;
> +}
> +
> +/*
> + This can be used as debug_counter to simulate an error at a specific
> + position.
> +
> + Typical usage would be
> + if (debug_simualte_error("keyword"))
> + error= 1;
> +*/
> +
> +bool debug_simulate_error(const char *keyword, uint error)
> +{
> + DBUG_ENTER("debug_crash_here");
> + DBUG_PRINT("enter", ("keyword: %s", keyword));
> + DBUG_EXECUTE_IF(keyword,
> + if (debug_decrement_counter(&debug_error_counter))
> + {
> + my_printf_error(error,
> + "Simulating error for '%s'",
> + MYF(ME_ERROR_LOG), keyword);
> + DBUG_RETURN(1);
> + });
> + DBUG_RETURN(0);
> +}
> +#endif /* DBUG_OFF */
> diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
> index cca05f1fcca..0cf037e3a6f 100644
> --- a/sql/log_event_server.cc
> +++ b/sql/log_event_server.cc
> @@ -1294,6 +1294,15 @@ bool Query_log_event::write()
> int3store(start, when_sec_part);
> start+= 3;
> }
> +
> + /* xid's is used with ddl_log handling */
it's not really an XID, it's a query_id or the DDL statetent.
XID is a combination of a 64-byte global transaction identifier and
a 64-byte branch qualifier.
It would be much less confusing not to pretend that the query_id
is an "xid" is some shape or form. Better to rename it to, say
ddl_query_id. Or even ddl_id, if you like it short.
> + if (thd && thd->binlog_xid)
> + {
> + *start++= Q_XID;
> + int8store(start, thd->query_id);
> + start+= 8;
> + }
> +
> /*
> NOTE: When adding new status vars, please don't forget to update
> the MAX_SIZE_LOG_EVENT_STATUS in log_event.h and update the function
> diff --git a/sql/log.cc b/sql/log.cc
> index 17c599627cb..6e0a9706ec8 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -10654,6 +10669,8 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
> {
> if (ha_recover(&xids))
> goto err2;
> + if (ddl_log_close_binlogged_events(&xids))
your xids and normal Xid_log_event xids are from two different
non-intersecting sets and have even different semantics.
It'd be safer not to have them in the same hash.
> + goto err2;
>
> free_root(&mem_root, MYF(0));
> my_hash_free(&xids);
> diff --git a/sql/sql_view.cc b/sql/sql_view.cc
> index 5df5ecb76a9..e6d726b30d7 100644
> --- a/sql/sql_view.cc
> +++ b/sql/sql_view.cc
> @@ -2231,16 +2234,25 @@ mysql_rename_view(THD *thd,
> (uchar*)&view_def, view_parameters))
> {
> /* restore renamed view in case of error */
> - rename_in_schema_file(thd, new_db->str, new_name->str, view->db.str,
> - view->table_name.str);
> + rename_in_schema_file(thd, new_db->str, new_name->str, old_db->str,
> + old_name->str);
> goto err;
> }
> - } else
> + }
> + else
> DBUG_RETURN(1);
>
> /* remove cache entries */
> - query_cache_invalidate3(thd, view, 0);
> - sp_cache_invalidate();
> + {
> + char key[NAME_LEN*2+1], *ptr;
> + memcpy(key, old_db->str, old_db->length);
> + ptr= key+ old_db->length;
> + *ptr++= 0;
> + memcpy(key, old_name->str, old_name->length);
> + ptr= key+ old_db->length;
> + *ptr++= 0;
there's tdc_create_key for that.
> + query_cache.invalidate(thd, key, (size_t) (ptr-key), 0);
> + }
> error= FALSE;
>
> err:
> diff --git a/sql/ddl_log.h b/sql/ddl_log.h
> index 0747699cd85..316e6708f22 100644
> --- a/sql/ddl_log.h
> +++ b/sql/ddl_log.h
> @@ -24,6 +24,9 @@
> enum ddl_log_entry_code
> {
> /*
> + DDL_LOG_UNKOWN
s/UNKOWN/UNKNOWN/
> + Here mainly to detect blocks that are all zero
> +
> DDL_LOG_EXECUTE_CODE:
> This is a code that indicates that this is a log entry to
> be executed, from this entry a linked list of log entries
> @@ -34,48 +37,95 @@ enum ddl_log_entry_code
> DDL_IGNORE_LOG_ENTRY_CODE:
> An entry that is to be ignored
> */
> - DDL_LOG_EXECUTE_CODE = 'e',
> - DDL_LOG_ENTRY_CODE = 'l',
> - DDL_IGNORE_LOG_ENTRY_CODE = 'i'
> + DDL_LOG_UNKNOWN= 0,
> + DDL_LOG_EXECUTE_CODE= 1,
> + DDL_LOG_ENTRY_CODE= 2,
> + DDL_IGNORE_LOG_ENTRY_CODE= 3,
> + DDL_LOG_ENTRY_CODE_LAST= 4
> };
>
> +
> +/*
> + When adding things below, also add an entry to ddl_log_entry_phases in
> + ddl_log.cc
> +*/
> +
> enum ddl_log_action_code
> {
> /*
> The type of action that a DDL_LOG_ENTRY_CODE entry is to
> perform.
> - DDL_LOG_DELETE_ACTION:
> - Delete an entity
> - DDL_LOG_RENAME_ACTION:
> - Rename an entity
> - DDL_LOG_REPLACE_ACTION:
> - Rename an entity after removing the previous entry with the
> - new name, that is replace this entry.
> - DDL_LOG_EXCHANGE_ACTION:
> - Exchange two entities by renaming them a -> tmp, b -> a, tmp -> b.
> */
> - DDL_LOG_DELETE_ACTION = 'd',
> - DDL_LOG_RENAME_ACTION = 'r',
> - DDL_LOG_REPLACE_ACTION = 's',
> - DDL_LOG_EXCHANGE_ACTION = 'e'
> + DDL_LOG_UNKNOWN_ACTION= 0,
> +
> + /* Delete a .frm file or a table in the partition engine */
> + DDL_LOG_DELETE_ACTION= 1,
> +
> + /* Rename a .frm fire a table in the partition engine */
> + DDL_LOG_RENAME_ACTION= 2,
> +
> + /*
> + Rename an entity after removing the previous entry with the
> + new name, that is replace this entry.
> + */
> + DDL_LOG_REPLACE_ACTION= 3,
> +
> + /* Exchange two entities by renaming them a -> tmp, b -> a, tmp -> b */
> + DDL_LOG_EXCHANGE_ACTION= 4,
> + /*
> + log do_rename(): Rename of .frm file, table, stat_tables and triggers
> + */
> + DDL_LOG_RENAME_TABLE_ACTION= 5,
> + DDL_LOG_RENAME_VIEW_ACTION= 6,
> + DDL_LOG_LAST_ACTION /* End marker */
> };
>
> +
> +/* Number of phases for each ddl_log_action_code */
> +extern const uchar ddl_log_entry_phases[DDL_LOG_LAST_ACTION];
> +
> +
> enum enum_ddl_log_exchange_phase {
> EXCH_PHASE_NAME_TO_TEMP= 0,
> EXCH_PHASE_FROM_TO_NAME= 1,
> EXCH_PHASE_TEMP_TO_FROM= 2
> };
>
> +enum enum_ddl_log_rename_table_phase {
> + DDL_RENAME_PHASE_TRIGGER= 0,
> + DDL_RENAME_PHASE_STAT,
> + DDL_RENAME_PHASE_TABLE,
> +};
> +
> +/*
> + Setting ddl_log_entry.phase to this has the same effect as setting
> + action_type to DDL_IGNORE_LOG_ENTRY_CODE
this doesn't make much sense. May be you can add a line or two
about why it's needed and one cannot simply set action_type
to DDL_IGNORE_LOG_ENTRY_CODE instead?
> +*/
> +
> +#define DDL_LOG_FINAL_PHASE ((uchar) 0xff)
>
> typedef struct st_ddl_log_entry
> {
> - const char *name;
> - const char *from_name;
> - const char *handler_name;
> - const char *tmp_name;
> + LEX_CSTRING name;
> + LEX_CSTRING from_name;
> + LEX_CSTRING handler_name;
> + LEX_CSTRING tmp_name;
> + LEX_CSTRING db;
> + LEX_CSTRING from_db;
> + LEX_CSTRING from_handler_name;
> + uchar uuid[MY_UUID_SIZE]; // UUID for new frm file
> +
> + ulonglong xid; // Xid stored in the binary log
> uint next_entry;
> - uint entry_pos;
> - enum ddl_log_entry_code entry_type;
> + uint entry_pos; // Set by write_dll_log_entry()
> + /*
> + unique_id can be used to store a unique number to check current state.
> + Currently it is used to store new size of frm file or link to a ddl log
> + entry.
is that so? I don't see it storing a file version, a new size of the frm or a
link, I see it stores some kind of a counter to abort recovery when it was
repeated more than DDL_LOG_MAX_RETRY times
> + */
> + uint32 unique_id; // Unique id for file version
> + uint16 flags; // Flags unique for each command
> + enum ddl_log_entry_code entry_type; // Set automatically
> enum ddl_log_action_code action_type;
> /*
> Most actions have only one phase. REPLACE does however have two
> @@ -95,17 +145,52 @@ typedef struct st_ddl_log_memory_entry
> } DDL_LOG_MEMORY_ENTRY;
>
>
> +typedef struct st_ddl_log_state
> +{
> + /* List of ddl log entries */
> + DDL_LOG_MEMORY_ENTRY *list;
> + /* One execute entry per list */
> + DDL_LOG_MEMORY_ENTRY *execute_entry;
> +} DDL_LOG_STATE;
would be nice to have a small comment, explaining what is a "state"
and how two lists can represent it.
is execute_entry a pointer to a some element from the list?
> +
> +
> +/* These functions are for recovery */
> +bool ddl_log_initialize();
> +void ddl_log_release();
> +bool ddl_log_close_binlogged_events(HASH *xids);
> +int ddl_log_execute_recovery();
> +
> +/* functions for updating the ddl log */
> bool ddl_log_write_entry(DDL_LOG_ENTRY *ddl_log_entry,
> DDL_LOG_MEMORY_ENTRY **active_entry);
> +
> bool ddl_log_write_execute_entry(uint first_entry,
> - bool complete,
> - DDL_LOG_MEMORY_ENTRY **active_entry);
> -bool ddl_log_increment_phase(uint entry_no);
> + DDL_LOG_MEMORY_ENTRY **active_entry);
> +bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry);
> +
> +void ddl_log_complete(DDL_LOG_STATE *ddl_log_state);
> +void ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state);
> +
> +bool ddl_log_update_phase(DDL_LOG_STATE *entry, uchar phase);
> +bool ddl_log_update_xid(DDL_LOG_STATE *state, ulonglong xid);
> +bool ddl_log_disable_entry(DDL_LOG_STATE *state);
> +bool ddl_log_increment_phase(uint entry_pos);
> void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry);
> bool ddl_log_sync();
> -void ddl_log_release();
> -void ddl_log_execute_recovery();
> bool ddl_log_execute_entry(THD *thd, uint first_entry);
>
> +void ddl_log_release_entries(DDL_LOG_STATE *ddl_log_state);
> +bool ddl_log_rename_table(THD *thd, DDL_LOG_STATE *ddl_state,
> + handlerton *hton,
> + const LEX_CSTRING *org_db,
> + const LEX_CSTRING *org_alias,
> + const LEX_CSTRING *new_db,
> + const LEX_CSTRING *new_alias);
> +bool ddl_log_rename_view(THD *thd, DDL_LOG_STATE *ddl_state,
> + const LEX_CSTRING *org_db,
> + const LEX_CSTRING *org_alias,
> + const LEX_CSTRING *new_db,
> + const LEX_CSTRING *new_alias);
> +
> extern mysql_mutex_t LOCK_gdl;
> #endif /* DDL_LOG_INCLUDED */
> diff --git a/sql/sql_rename.cc b/sql/sql_rename.cc
> index b7aed97a8a2..2b2f32cf126 100644
> --- a/sql/sql_rename.cc
> +++ b/sql/sql_rename.cc
> @@ -186,49 +172,43 @@ bool mysql_rename_tables(THD *thd, TABLE_LIST *table_list, bool silent,
> /* Add IF EXISTS to binary log */
> thd->variables.option_bits|= OPTION_IF_EXISTS;
> }
> +
> + debug_crash_here("ddl_log_rename_before_binlog");
> + /*
> + Store xid in ddl log and binary log so that we can check on ddl recovery
> + if the item is in the binary log (and thus the operation was complete
> + */
> + thd->binlog_xid= thd->query_id;
> + ddl_log_update_xid(&ddl_log_state, thd->binlog_xid);
> binlog_error= write_bin_log(thd, TRUE, thd->query(), thd->query_length());
I'd prefer to avoid blowing up THD with arbitrary flags
that are only needed rarely in one specific place in the code
to pass some argument few stack frames down.
Perhaps you can do instead something like
bool Query_log_event::write()
{
...
/* xid's is used with ddl_log handling */
if (ddl_log_entry_incomplete())
{
*start++= Q_XID;
int8store(start, thd->query_id);
start+= 8;
}
or at least you could replace ulonglong binlog_xid
with a one bit state, like thd->server_status & SERVER_STATUS_IN_DDL
or with some other bitmap of flags.
> + if (binlog_error)
> + error= 1;
> + thd->binlog_xid= 0;
> thd->variables.option_bits= save_option_bits;
> + debug_crash_here("ddl_log_rename_after_binlog");
>
> if (likely(!binlog_error))
> my_ok(thd);
> }
>
> if (likely(!error))
> + {
> query_cache_invalidate3(thd, table_list, 0);
> + ddl_log_complete(&ddl_log_state);
> + }
> + else
> + {
> + /* Revert the renames of normal tables with the help of the ddl log */
> + ddl_log_revert(thd, &ddl_log_state);
> + }
>
> err:
> DBUG_RETURN(error || binlog_error);
> }
>
>
> -/*
> - reverse table list
> -
> - SYNOPSIS
> - reverse_table_list()
> - table_list pointer to table _list
> -
> - RETURN
> - pointer to new (reversed) list
> -*/
> -static TABLE_LIST *reverse_table_list(TABLE_LIST *table_list)
> -{
> - TABLE_LIST *prev= 0;
> -
> - while (table_list)
> - {
> - TABLE_LIST *next= table_list->next_local;
> - table_list->next_local= prev;
> - prev= table_list;
> - table_list= next;
> - }
> - return (prev);
> -}
> -
> -
> static bool
> -do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table,
> - bool skip_error)
> +do_rename_temporary(THD *thd, TABLE_LIST *ren_table, TABLE_LIST *new_table)
I wonder why we never got any warnings about it
> {
> LEX_CSTRING *new_alias;
> DBUG_ENTER("do_rename_temporary");
> diff --git a/sql/ddl_log.cc b/sql/ddl_log.cc
> index bad786b3799..777dfdddbc7 100644
> --- a/sql/ddl_log.cc
> +++ b/sql/ddl_log.cc
> @@ -22,176 +22,333 @@
> #include "log.h" // sql_print_error()
> #include "ddl_log.h"
> #include "ha_partition.h" // PAR_EXT
> +#include "sql_table.h" // build_table_filename
> +#include "sql_statistics.h" // rename_table_in_stats_tables
> +#include "sql_view.h" // mysql_rename_view()
> +#include "strfunc.h" // strconvert
> +#include <mysys_err.h> // EE_LINK
>
>
> /*--------------------------------------------------------------------------
>
> MODULE: DDL log
> -----------------
>
> - This module is used to ensure that we can recover from crashes that occur
> - in the middle of a meta-data operation in MySQL. E.g. DROP TABLE t1, t2;
> - We need to ensure that both t1 and t2 are dropped and not only t1 and
> - also that each table drop is entirely done and not "half-baked".
> + This module is used to ensure that we can recover from crashes that
> + occur in the middle of a meta-data operation in MySQL. E.g. DROP
> + TABLE t1, t2; We need to ensure that both t1 and t2 are dropped and
> + not only t1 and also that each table drop is entirely done and not
> + "half-baked".
>
> - To support this we create log entries for each meta-data statement in the
> - ddl log while we are executing. These entries are dropped when the
> - operation is completed.
> + To support this we create log entries for each meta-data statement
> + in the ddl log while we are executing. These entries are dropped
> + when the operation is completed.
>
> At recovery those entries that were not completed will be executed.
>
> There is only one ddl log in the system and it is protected by a mutex
> and there is a global struct that contains information about its current
> state.
>
> + DDl recovery after a crash works the following way:
DDL or ddl
> +
> + - ddl_log_initialize() initializes the global global_ddl_log variable
> + and opens the binary log if it exists. If it doesn't exists a new one
> + is created.
it doesn't seem to be opening a binary log. Did you mean ddl log?
> + - ddl_log_close_binlogged_events() loops over all log events and checks if
> + their xid (stored in the EXECUTE_CODE event) is in the binary log. If xid
> + exists in the binary log the entry is marked as finished in the ddl log.
> + - After a new binary log is created and is open for new entries,
> + ddl_log_execute_recovery() is executed on remaining open events:
> + - Loop over all events
> + - For each entry with DDL_LOG_ENTRY_CODE execute the remaining phases
> + in ddl_log_execute_entry_no_lock()
> +
> + The ddl_log.log file is created at startup and deleted when server goes down.
> + After the final recovery phase is done, the file is truncated.
> +
> History:
> First version written in 2006 by Mikael Ronstrom
> - Second version written in 2020 by Monty
> + Second version in 2020 by Monty
> --------------------------------------------------------------------------*/
>
> +#define DDL_LOG_MAGIC_LENGTH 4
> +/* How many times to try to execute a ddl log entry that causes crashes */
> +#define DDL_LOG_MAX_RETRY 3
> +
> +uchar ddl_log_file_magic[]=
> +{ (uchar) 254, (uchar) 254, (uchar) 11, (uchar) 2 };
> +
> +/* Action names for ddl_log_action_code */
> +
> +const char *ddl_log_action_name[DDL_LOG_LAST_ACTION]=
> +{
> + "Unknown", "partitioning delete", "partitioning rename",
> + "partitioning replace", "partitioning exchange",
> + "rename table", "rename view"
> +};
> +
> +/* Number of phases per entry */
> +const uchar ddl_log_entry_phases[DDL_LOG_LAST_ACTION]=
> +{
> + 1, 1, 2, 3, 4, 1
> +};
> +
> +
> struct st_global_ddl_log
> {
> - /*
> - We need to adjust buffer size to be able to handle downgrades/upgrades
> - where IO_SIZE has changed. We'll set the buffer size such that we can
> - handle that the buffer size was upto 4 times bigger in the version
> - that wrote the DDL log.
> - */
> - char file_entry_buf[4*IO_SIZE];
> - char file_name_str[FN_REFLEN];
> - char *file_name;
> + uchar *file_entry_buf;
> DDL_LOG_MEMORY_ENTRY *first_free;
> DDL_LOG_MEMORY_ENTRY *first_used;
> File file_id;
> - uint name_len;
> uint num_entries;
> + uint name_pos;
> uint io_size;
> - bool inited;
> - bool do_release;
> - bool recovery_phase;
> - st_global_ddl_log() : inited(false), do_release(false) {}
> + bool initialized;
> + bool open;
> };
>
> st_global_ddl_log global_ddl_log;
static?
>
> mysql_mutex_t LOCK_gdl;
>
> +/* Positions to different data in a ddl log block */
> #define DDL_LOG_ENTRY_TYPE_POS 0
> +/*
> + Note that ACTION_TYPE and PHASE_POS must be after each other.
> + See update_phase()
> +*/
> #define DDL_LOG_ACTION_TYPE_POS 1
> #define DDL_LOG_PHASE_POS 2
> #define DDL_LOG_NEXT_ENTRY_POS 4
> -#define DDL_LOG_NAME_POS 8
> +/* Flags to remember something unique about the query, like if .frm was used */
> +#define DDL_LOG_FLAG_POS 8
> +/* Used to store XID entry that was written to binary log */
> +#define DDL_LOG_XID_POS 10
> +/* Used to store unique uuid from the .frm file */
"unique uuid" is a weird choice of words
> +#define DDL_LOG_UUID_POS 18
> +/* ID_POS can be used to store something unique, like file size (4 bytes) */
> +#define DDL_LOG_ID_POS DDL_LOG_UUID_POS + MY_UUID_SIZE
> +#define DDL_LOG_END_POS DDL_LOG_ID_POS + 8
>
> -#define DDL_LOG_NUM_ENTRY_POS 0
> -#define DDL_LOG_NAME_LEN_POS 4
> -#define DDL_LOG_IO_SIZE_POS 8
> +/*
> + Position to where names are stored in the ddl log blocks. The current
> + value is stored in the header and can thus be changed if we need more
> + space for constants in the header than what is between DDL_LOG_ID_POS and
> + DDL_LOG_TMP_NAME_POS.
> +*/
> +#define DDL_LOG_TMP_NAME_POS 56
>
> +/* Definitions for the ddl log header, the first block in the file */
> +/* IO_SIZE is stored in the header and can thus be changed */
> +#define DDL_LOG_IO_SIZE IO_SIZE
> +
> +/* Header is stored in positions 0-3 */
> +#define DDL_LOG_IO_SIZE_POS 4
> +#define DDL_LOG_NAME_OFFSET_POS 6
> +/* Sum of the above variables */
> +#define DDL_LOG_HEADER_SIZE 4+2+2
> +
> /**
> + Sync the ddl log file.
> +
> + @return Operation status
> + @retval FALSE Success
> + @retval TRUE Error
> +*/
> +
> +static bool ddl_log_sync_file()
> +{
> + DBUG_ENTER("ddl_log_sync_file");
> + DBUG_RETURN(mysql_file_sync(global_ddl_log.file_id, MYF(MY_WME)));
> +}
I don't quite understand that, are there cases when you want to
sync the ddl log and you *don't know* whether LOCK_gdl is locked or not?
because if you know there should be no lock, you'd better use
mysql_mutex_assert_not_owner here.
> +
> +/* Same as above, but ensure we have the LOCK_gdb locked */
that's an interesting Freudian slip
> +
> +static bool ddl_log_sync_no_lock()
> +{
> + DBUG_ENTER("ddl_log_sync_no_lock");
> +
> + mysql_mutex_assert_owner(&LOCK_gdl);
> + DBUG_RETURN(ddl_log_sync_file());
> +}
...
> +/*
> + Store and read strings in ddl log buffers
> +
> + Format is:
> + 2 byte: length (not counting end \0)
> + X byte: string value of length 'length'
> + 1 byte: \0
a bit redundant, don't you think? Why did you do it?
> +*/
> +
> +static uchar *store_string(uchar *pos, uchar *end, const LEX_CSTRING *str)
> {
> + uint32 length= (uint32) str->length;
> + if (unlikely(pos + 2 + length + 1 > end))
> + {
> + DBUG_ASSERT(0);
> + return end; // Overflow
> }
> +
> + int2store(pos, length);
> + if (likely(length))
> + memcpy(pos+2, str->str, length);
> + pos[2+length]= 0; // Store end \0
> + return pos + 2 + length +1;
> +}
> +
> +
> +static LEX_CSTRING get_string(uchar **pos, const uchar *end)
> {
> + LEX_CSTRING tmp;
> + uint32 length;
> + if (likely(*pos + 3 <= end))
> + {
> + length= uint2korr(*pos);
> + if (likely(*pos + 2 + length + 1 <= end))
> + {
> + char *str= (char*) *pos+2;
> + *pos= *pos + 2 + length + 1;
> + tmp.str= str;
> + tmp.length= length;
> + return tmp;
> }
> + }
> + /*
> + Overflow on read, should never happen
> + Set *pos to end to ensure any future calls also returns empty string
> + */
> + DBUG_ASSERT(0);
> + *pos= (uchar*) end;
> + tmp.str= "";
> + tmp.length= 0;
> + return tmp;
> }
>
>
> @@ -256,38 +484,32 @@ static uint read_ddl_log_header()
>
> static void set_global_from_ddl_log_entry(const DDL_LOG_ENTRY *ddl_log_entry)
it'd be good to have some comment explaining why you need this
function. Couldn't you write directly from the ddl_log_entry?
or use a local buffer? why to serialize into a buffer inside global_ddl_log?
> {
> + uchar *file_entry_buf= global_ddl_log.file_entry_buf, *pos, *end;
> +
> mysql_mutex_assert_owner(&LOCK_gdl);
it would be logical to have LOCK_gdl inside the global_ddl_log struct
> - global_ddl_log.file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]=
> - (char)DDL_LOG_ENTRY_CODE;
> - global_ddl_log.file_entry_buf[DDL_LOG_ACTION_TYPE_POS]=
> - (char)ddl_log_entry->action_type;
> - global_ddl_log.file_entry_buf[DDL_LOG_PHASE_POS]= 0;
> - int4store(&global_ddl_log.file_entry_buf[DDL_LOG_NEXT_ENTRY_POS],
> - ddl_log_entry->next_entry);
> - DBUG_ASSERT(strlen(ddl_log_entry->name) < FN_REFLEN);
> - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS],
> - ddl_log_entry->name, FN_REFLEN - 1);
> - if (ddl_log_entry->action_type == DDL_LOG_RENAME_ACTION ||
> - ddl_log_entry->action_type == DDL_LOG_REPLACE_ACTION ||
> - ddl_log_entry->action_type == DDL_LOG_EXCHANGE_ACTION)
> - {
> - DBUG_ASSERT(strlen(ddl_log_entry->from_name) < FN_REFLEN);
> - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN],
> - ddl_log_entry->from_name, FN_REFLEN - 1);
> - }
> - else
> - global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN]= 0;
> - DBUG_ASSERT(strlen(ddl_log_entry->handler_name) < FN_REFLEN);
> - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (2*FN_REFLEN)],
> - ddl_log_entry->handler_name, FN_REFLEN - 1);
> - if (ddl_log_entry->action_type == DDL_LOG_EXCHANGE_ACTION)
> - {
> - DBUG_ASSERT(strlen(ddl_log_entry->tmp_name) < FN_REFLEN);
> - strmake(&global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (3*FN_REFLEN)],
> - ddl_log_entry->tmp_name, FN_REFLEN - 1);
> - }
> - else
> - global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS + (3*FN_REFLEN)]= 0;
> +
> + file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (uchar) ddl_log_entry->entry_type;
> + file_entry_buf[DDL_LOG_ACTION_TYPE_POS]= (uchar) ddl_log_entry->action_type;
> + file_entry_buf[DDL_LOG_PHASE_POS]= (uchar) ddl_log_entry->phase;
> + int4store(file_entry_buf+DDL_LOG_NEXT_ENTRY_POS, ddl_log_entry->next_entry);
> + int2store(file_entry_buf+DDL_LOG_FLAG_POS, ddl_log_entry->flags);
> + int8store(file_entry_buf+DDL_LOG_XID_POS, ddl_log_entry->xid);
> + memcpy(file_entry_buf+DDL_LOG_UUID_POS, ddl_log_entry->uuid, MY_UUID_SIZE);
> + int8store(file_entry_buf+DDL_LOG_ID_POS, ddl_log_entry->unique_id);
> + bzero(file_entry_buf+DDL_LOG_END_POS,
> + global_ddl_log.name_pos - DDL_LOG_END_POS);
> +
> + pos= file_entry_buf + global_ddl_log.name_pos;
> + end= file_entry_buf + global_ddl_log.io_size;
> +
> + pos= store_string(pos, end, &ddl_log_entry->handler_name);
> + pos= store_string(pos, end, &ddl_log_entry->db);
> + pos= store_string(pos, end, &ddl_log_entry->name);
> + pos= store_string(pos, end, &ddl_log_entry->from_handler_name);
> + pos= store_string(pos, end, &ddl_log_entry->from_db);
> + pos= store_string(pos, end, &ddl_log_entry->from_name);
> + pos= store_string(pos, end, &ddl_log_entry->tmp_name);
you have DBUG_ASSERT(0) that all those strings won't overflow io_size.
How can you guarantee that?
> + bzero(pos, global_ddl_log.io_size - (pos - file_entry_buf));
> }
>
>
> @@ -442,63 +691,120 @@ static bool ddl_log_sync_no_lock()
...
> +
> +
> +/*
> + Build a filename for a table, trigger file or .frm
> + Delete also any temporary file suffixed with ~
Where do you get these *~ files from? What creates them?
> +*/
> +
> +static void build_filename_and_delete_tmp_file(char *path, size_t path_length,
> + const LEX_CSTRING *db,
> + const LEX_CSTRING *name,
> + const char *ext,
> + PSI_file_key psi_key)
> +{
> + uint length= build_table_filename(path, path_length-1,
> + db->str, name->str, ext, 0);
> +
> + path[length]= '~';
> + path[length+1]= 0;
> + (void) mysql_file_delete(psi_key, path, MYF(0));
> + path[length]= 0;
> }
>
>
> @@ -806,154 +1296,132 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry)
> bool ddl_log_write_entry(DDL_LOG_ENTRY *ddl_log_entry,
> DDL_LOG_MEMORY_ENTRY **active_entry)
> {
> - bool error, write_header;
> + bool error;
> + uchar *pos, *end;
> DBUG_ENTER("ddl_log_write_entry");
>
> mysql_mutex_assert_owner(&LOCK_gdl);
> - if (init_ddl_log())
> + if (!global_ddl_log.open)
> DBUG_RETURN(TRUE);
> +
> + ddl_log_entry->entry_type= DDL_LOG_ENTRY_CODE;
> set_global_from_ddl_log_entry(ddl_log_entry);
> - if (get_free_ddl_log_entry(active_entry, &write_header))
> + if (ddl_log_get_free_entry(active_entry))
> DBUG_RETURN(TRUE);
> +
> error= FALSE;
> + pos= global_ddl_log.file_entry_buf + global_ddl_log.name_pos;
> + end= global_ddl_log.file_entry_buf + global_ddl_log.io_size;
> DBUG_PRINT("ddl_log",
> - ("write type %c next %u name '%s' from_name '%s' handler '%s'"
> - " tmp_name '%s'",
> + ("type: %c next: %u handler: %s "
> + "to_name: '%s.%s' from_name: '%s.%s' "
> + "tmp_name: '%s'",
> (char) global_ddl_log.file_entry_buf[DDL_LOG_ACTION_TYPE_POS],
> ddl_log_entry->next_entry,
> - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS],
> - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
> - + FN_REFLEN],
> - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
> - + (2*FN_REFLEN)],
> - (char*) &global_ddl_log.file_entry_buf[DDL_LOG_NAME_POS
> - + (3*FN_REFLEN)]));
> + get_string(&pos, end).str, // Handler
> + get_string(&pos, end).str, // to db.table
> + get_string(&pos, end).str,
> + get_string(&pos, end).str, // From db.table
> + get_string(&pos, end).str,
> + get_string(&pos, end).str)); // Tmp name
This is not a critical bug, because it's in a DBUG_PRINT,
but it's still a bug. C++ (and C) standard both say that
the order of evaluation of function arguments is *unspecified*.
> +
> if (unlikely(write_ddl_log_file_entry((*active_entry)->entry_pos)))
> {
> + ddl_log_release_memory_entry(*active_entry);
> + *active_entry= 0;
> error= TRUE;
> - sql_print_error("Failed to write entry_no = %u",
> + sql_print_error("DDL_LOG: Failed to write entry %u",
> (*active_entry)->entry_pos);
> }
> - if (write_header && likely(!error))
> - {
> - (void) ddl_log_sync_no_lock();
> - if (write_ddl_log_header())
> - error= TRUE;
> - }
> - if (unlikely(error))
> - ddl_log_release_memory_entry(*active_entry);
> DBUG_RETURN(error);
> }
>
>
> /**
> - @brief Write final entry in the ddl log.
> + @brief Write or update execute entry in the ddl log.
>
> - @details This is the last write in the ddl log. The previous log entries
> - have already been written but not yet synched to disk.
> - We write a couple of log entries that describes action to perform.
> - This entries are set-up in a linked list, however only when a first
> - execute entry is put as the first entry these will be executed.
> - This routine writes this first.
> + @details An execute entry points to the first entry that should
> + be excuted during recovery. In some cases it's only written once,
> + in other cases it's updated for each log entry to point to the new
> + header for the list.
>
> + When called, the previous log entries have already been written but not yet
> + synched to disk. We write a couple of log entries that describes
> + action to perform. This entries are set-up in a linked list,
> + however only when an execute entry is put as the first entry these will be
> + executed during recovery.
> +
> @param first_entry First entry in linked list of entries
> - to execute, if 0 = NULL it means that
> - the entry is removed and the entries
> - are put into the free list.
> - @param complete Flag indicating we are simply writing
> - info about that entry has been completed
> + to execute.
> @param[in,out] active_entry Entry to execute, 0 = NULL if the entry
> is written first time and needs to be
> returned. In this case the entry written
> is returned in this parameter
> -
> @return Operation status
> @retval TRUE Error
> @retval FALSE Success
> */
>
> bool ddl_log_write_execute_entry(uint first_entry,
> - bool complete,
> DDL_LOG_MEMORY_ENTRY **active_entry)
> {
> - bool write_header= FALSE;
> - char *file_entry_buf= (char*)global_ddl_log.file_entry_buf;
> + uchar *file_entry_buf= global_ddl_log.file_entry_buf;
> + bool got_free_entry= 0;
> DBUG_ENTER("ddl_log_write_execute_entry");
>
> mysql_mutex_assert_owner(&LOCK_gdl);
> - if (init_ddl_log())
> - {
> - DBUG_RETURN(TRUE);
> - }
> - if (!complete)
> - {
> /*
> We haven't synched the log entries yet, we synch them now before
> writing the execute entry. If complete is true we haven't written
there's no 'complete' anymore
> any log entries before, we are only here to write the execute
> entry to indicate it is done.
> */
> (void) ddl_log_sync_no_lock();
Why do you need to sync previous log entries? As far as I understand,
they're useless without DDL_LOG_EXECUTE_CODE entry, so why would it matter
whether they're on disk or not after a crash?
> - file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (char)DDL_LOG_EXECUTE_CODE;
> - }
> - else
> - file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (char)DDL_IGNORE_LOG_ENTRY_CODE;
> - file_entry_buf[DDL_LOG_ACTION_TYPE_POS]= 0; /* Ignored for execute entries */
> - file_entry_buf[DDL_LOG_PHASE_POS]= 0;
> - int4store(&file_entry_buf[DDL_LOG_NEXT_ENTRY_POS], first_entry);
> - file_entry_buf[DDL_LOG_NAME_POS]= 0;
> - file_entry_buf[DDL_LOG_NAME_POS + FN_REFLEN]= 0;
> - file_entry_buf[DDL_LOG_NAME_POS + 2*FN_REFLEN]= 0;
> + bzero(file_entry_buf, global_ddl_log.io_size);
> +
> + file_entry_buf[DDL_LOG_ENTRY_TYPE_POS]= (uchar)DDL_LOG_EXECUTE_CODE;
> + int4store(file_entry_buf + DDL_LOG_NEXT_ENTRY_POS, first_entry);
> +
> if (!(*active_entry))
> {
> - if (get_free_ddl_log_entry(active_entry, &write_header))
> - {
> + if (ddl_log_get_free_entry(active_entry))
> DBUG_RETURN(TRUE);
> + got_free_entry= TRUE;
> }
> - write_header= TRUE;
> - }
> if (write_ddl_log_file_entry((*active_entry)->entry_pos))
> {
> - sql_print_error("Error writing execute entry in ddl log");
> - ddl_log_release_memory_entry(*active_entry);
> - DBUG_RETURN(TRUE);
> - }
> - (void) ddl_log_sync_no_lock();
> - if (write_header)
> + if (got_free_entry)
> {
> - if (write_ddl_log_header())
> - {
> ddl_log_release_memory_entry(*active_entry);
> + *active_entry= 0;
> + }
> + sql_print_error("DDL_LOG: Error writing execute entry %u",
> + (*active_entry)->entry_pos);
> DBUG_RETURN(TRUE);
> }
> - }
> + (void) ddl_log_sync_no_lock();
> DBUG_RETURN(FALSE);
> }
>
> -
> /**
> - Deactivate an individual entry.
> + Increment phase for enty. Will deactivate entry after all phases are done
>
> @details see ddl_log_increment_phase_no_lock.
>
> - @param entry_no Entry position of record to change
> + @param entry_pos Entry position of record to change
>
> @return Operation status
> @retval TRUE Error
> @retval FALSE Success
> */
>
> -bool ddl_log_increment_phase(uint entry_no)
> +bool ddl_log_increment_phase(uint entry_pos)
> {
> bool error;
> DBUG_ENTER("ddl_log_increment_phase");
>
> mysql_mutex_lock(&LOCK_gdl);
> - error= ddl_log_increment_phase_no_lock(entry_no);
> + error= ddl_log_increment_phase_no_lock(entry_pos);
> mysql_mutex_unlock(&LOCK_gdl);
> DBUG_RETURN(error);
> }
> @@ -1042,87 +1484,157 @@ static void close_ddl_log()
> (void) mysql_file_close(global_ddl_log.file_id, MYF(MY_WME));
> global_ddl_log.file_id= (File) -1;
> }
> + global_ddl_log.open= 0;
> DBUG_VOID_RETURN;
> }
>
>
> /**
> + Loop over ddl log excute entries and mark those that are already stored
> + in the binary log as completed
> +
> + @return
> + @retval 0 ok
> + @return 1 fail (write error)
> +
> +*/
> +
> +bool ddl_log_close_binlogged_events(HASH *xids)
> +{
> + uint i;
> + DDL_LOG_ENTRY ddl_log_entry;
> + DBUG_ENTER("ddl_log_close_binlogged_events");
> +
> + if (global_ddl_log.num_entries == 0 || xids->records == 0)
> + DBUG_RETURN(0);
> +
> + mysql_mutex_lock(&LOCK_gdl);
> + for (i= 1; i <= global_ddl_log.num_entries; i++)
> + {
> + if (read_ddl_log_entry(i, &ddl_log_entry))
> + break; // Read error. Ignore
it's not really "ignore", it's more like "stop reading"
> + if (ddl_log_entry.entry_type == DDL_LOG_EXECUTE_CODE &&
> + ddl_log_entry.xid != 0 &&
> + my_hash_search(xids, (uchar*) &ddl_log_entry.xid,
> + sizeof(ddl_log_entry.xid)))
> + {
> + if (disable_execute_entry(i))
> + {
> + mysql_mutex_unlock(&LOCK_gdl);
> + DBUG_RETURN(1); // Write error. Fatal!
> + }
> + }
> + }
> + (void) ddl_log_sync_no_lock();
> + mysql_mutex_unlock(&LOCK_gdl);
> + DBUG_RETURN(0);
> +}
> +
> +
> +/**
> Execute the ddl log at recovery of MySQL Server.
> +
> + @return
> + @retval 0 Ok.
> + @retval > 0 Fatal error. We have to abort (can't create ddl log)
> + @return < -1 Recovery failed, but new log exists and is usable
> +
> */
>
> -void ddl_log_execute_recovery()
> +int ddl_log_execute_recovery()
> {
> - uint num_entries, i;
> + uint i, count= 0;
> + int error= 0;
> THD *thd;
> DDL_LOG_ENTRY ddl_log_entry;
> - char file_name[FN_REFLEN];
> static char recover_query_string[]= "INTERNAL DDL LOG RECOVER IN PROGRESS";
> DBUG_ENTER("ddl_log_execute_recovery");
>
> - /*
> - Initialise global_ddl_log struct
> - */
> - bzero(global_ddl_log.file_entry_buf, sizeof(global_ddl_log.file_entry_buf));
> - global_ddl_log.inited= FALSE;
> - global_ddl_log.recovery_phase= TRUE;
> - global_ddl_log.io_size= IO_SIZE;
> - global_ddl_log.file_id= (File) -1;
> + if (global_ddl_log.num_entries == 0)
> + DBUG_RETURN(0);
>
> /*
> To be able to run this from boot, we allocate a temporary THD
> */
> if (!(thd=new THD(0)))
> - DBUG_VOID_RETURN;
> + {
> + DBUG_ASSERT(0); // Fatal error
> + DBUG_RETURN(1);
> + }
> thd->thread_stack= (char*) &thd;
> thd->store_globals();
> + thd->log_all_errors= (global_system_variables.log_warnings >= 3);
>
> thd->set_query(recover_query_string, strlen(recover_query_string));
>
> - /* this also initialize LOCK_gdl */
> - num_entries= read_ddl_log_header();
> mysql_mutex_lock(&LOCK_gdl);
> - for (i= 1; i < num_entries + 1; i++)
> + for (i= 1; i <= global_ddl_log.num_entries; i++)
> {
> if (read_ddl_log_entry(i, &ddl_log_entry))
> {
> - sql_print_error("Failed to read entry no = %u from ddl log",
> - i);
> + error= -1;
> continue;
> }
> if (ddl_log_entry.entry_type == DDL_LOG_EXECUTE_CODE)
> {
> + /* purecov: begin tested */
really? you tested with purify?
> + if (ddl_log_entry.unique_id > DDL_LOG_MAX_RETRY)
> + {
> + error= -1;
> + continue;
> + }
> + update_unique_id(i, ++ddl_log_entry.unique_id);
> + if (ddl_log_entry.unique_id > DDL_LOG_MAX_RETRY)
> + {
> + sql_print_error("DDL_LOG: Aborting executing entry %u after %llu "
> + "retries", i, ddl_log_entry.unique_id);
> + error= -1;
> + continue;
> + }
> + /* purecov: end tested */
> if (ddl_log_execute_entry_no_lock(thd, ddl_log_entry.next_entry))
> {
> - /* Real unpleasant scenario but we continue anyways. */
> + /* Real unpleasant scenario but we have to continue anyway */
why can this happen? do you have an example?
besides hardware failure, like a disk cannot be read
> + error= -1;
> continue;
> }
> + count++;
> }
> }
> close_ddl_log();
> - create_ddl_log_file_name(file_name);
> - (void) mysql_file_delete(key_file_global_ddl_log, file_name, MYF(0));
> - global_ddl_log.recovery_phase= FALSE;
> mysql_mutex_unlock(&LOCK_gdl);
> thd->reset_query();
> delete thd;
> - DBUG_VOID_RETURN;
> +
> + /*
> + Create a new ddl_log to get rid of old stuff and ensure that header matches
> + the current source version
> + */
because here you delete/overwrite the old ddl log, it'd make sense
to log all details of every entry that you failed to execute but decided
to continue anyway. Now you log only the entry number on failure, which is
rather useless, because it's a number in a ddl log that you overwrite.
> + if (create_ddl_log())
> + DBUG_RETURN(1);
> + if (count > 0)
> + sql_print_information("DDL_LOG: Crash recovery executed %u entries",
> + count);
> + DBUG_RETURN(error);
> }
>
>
> /**
> - Release all memory allocated to the ddl log.
> + Release all memory allocated to the ddl log and delete the ddl log
> */
>
> void ddl_log_release()
> {
> + char file_name[FN_REFLEN];
> DDL_LOG_MEMORY_ENTRY *free_list;
> DDL_LOG_MEMORY_ENTRY *used_list;
> DBUG_ENTER("ddl_log_release");
>
> - if (!global_ddl_log.do_release)
> + if (!global_ddl_log.initialized)
> DBUG_VOID_RETURN;
>
> - mysql_mutex_lock(&LOCK_gdl);
> + global_ddl_log.initialized= 0;
> +
> free_list= global_ddl_log.first_free;
> used_list= global_ddl_log.first_used;
> while (used_list)
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
2
1
16 May '21
Hi, Michael!
On May 12, Michael Widenius wrote:
> revision-id: c80f867b0df (mariadb-10.6.0-78-gc80f867b0df)
> parent(s): eebebe8ef75
> author: Michael Widenius <michael.widenius(a)gmail.com>
> committer: Michael Widenius <michael.widenius(a)gmail.com>
> timestamp: 2021-05-04 21:09:11 +0300
> message:
>
> MDEV-23844 Atomic DROP TABLE
...
> diff --git a/sql/sql_view.cc b/sql/sql_view.cc
> index e6d726b30d7..ff8aa55adef 100644
> --- a/sql/sql_view.cc
> +++ b/sql/sql_view.cc
> @@ -1861,8 +1868,18 @@ bool mysql_drop_view(THD *thd, TABLE_LIST *views, enum_drop_mode drop_mode)
> not_exists_count++;
> continue;
> }
> + if (!first_table++)
yuck. so your "first_table" actually means "not_first_view",
because when it's FALSE, it's the first view.
why not to rename it to "not_first_view" ?
> + {
> + if (ddl_log_drop_view_init(thd, &ddl_log_state))
> + DBUG_RETURN(TRUE);
what about a crash here? the first entry is written, the second is not.
> + }
> + if (ddl_log_drop_view(thd, &ddl_log_state, &cpath, &view->db,
> + &view->table_name))
> + DBUG_RETURN(TRUE);
> + debug_crash_here("ddl_log_drop_before_delete_view");
> if (unlikely(mysql_file_delete(key_file_frm, path, MYF(MY_WME))))
> delete_error= TRUE;
> + debug_crash_here("ddl_log_drop_after_delete_view");
>
> some_views_deleted= TRUE;
>
> diff --git a/sql/log.cc b/sql/log.cc
> index 6e0a9706ec8..dba999054e4 100644
> --- a/sql/log.cc
> +++ b/sql/log.cc
> @@ -10471,16 +10483,16 @@ int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
> bool last_gtid_standalone= false;
> bool last_gtid_valid= false;
> #endif
> + DBUG_ENTER("TC_LOG_BINLOG::recover");
>
> if (! fdle->is_valid() ||
> - (do_xa && my_hash_init(key_memory_binlog_recover_exec, &xids,
> - &my_charset_bin, TC_LOG_PAGE_SIZE/3, 0,
> - sizeof(my_xid), 0, 0, MYF(0))))
> + (my_hash_init(key_memory_binlog_recover_exec, &xids,
> + &my_charset_bin, TC_LOG_PAGE_SIZE/3, 0,
> + sizeof(my_xid), 0, 0, MYF(0))))
> goto err1;
>
> - if (do_xa)
> - init_alloc_root(key_memory_binlog_recover_exec, &mem_root,
> - TC_LOG_PAGE_SIZE, TC_LOG_PAGE_SIZE, MYF(0));
> + init_alloc_root(key_memory_binlog_recover_exec, &mem_root,
> + TC_LOG_PAGE_SIZE, TC_LOG_PAGE_SIZE, MYF(0));
I don't think these two changes are correct. do_xa is TRUE on crash recovery
and FALSE on gtid state recovery on 5.5->10.0 upgade or if gtid state
file was deleted. There is no point to initialize the xid hash or mem_root,
as it's not a crash recovery and there can be no incomplete transactions
or ddl statements.
Yes, I've seen the commit comment:
> - TC_LOG_BINLOG::recover() was changed to always collect Xid's from the
> binary log and always call ddl_log_close_binlogged_events(). This was
> needed to be able to collect DROP TABLE events with embedded Xid's
> (used by ddl log).
but I cannot understand how it applies here.
> fdle->flags&= ~LOG_EVENT_BINLOG_IN_USE_F; // abort on the first error
>
> diff --git a/sql/sql_table.cc b/sql/sql_table.cc
> index 7a88ffdfd1b..0a83396ae14 100644
> --- a/sql/sql_table.cc
> +++ b/sql/sql_table.cc
> @@ -1363,6 +1373,17 @@ int mysql_rm_table_no_locks(THD *thd, TABLE_LIST *tables, bool if_exists,
>
> thd->replication_flags= 0;
> was_view= table_type == TABLE_TYPE_VIEW;
> +
> + if (!first_table++)
first_table -> not_first_table
> + {
> + LEX_CSTRING comment= {comment_start, (size_t) comment_len};
> + if (ddl_log_drop_table_init(thd, &ddl_log_state, &comment))
> + {
> + error= 1;
> + goto err;
> + }
> + }
> +
> if ((table_type == TABLE_TYPE_UNKNOWN) || (was_view && !drop_view) ||
> (table_type != TABLE_TYPE_SEQUENCE && drop_sequence))
> {
> @@ -1565,6 +1611,7 @@ int mysql_rm_table_no_locks(THD *thd, TABLE_LIST *tables, bool if_exists,
> table_name.str, (uint)table_name.length);
> mysql_audit_drop_table(thd, table);
> }
> + ddl_log_update_phase(&ddl_log_state, DDL_DROP_PHASE_COLLECT);
better call it DDL_DROP_PHASE_BINLOG. It would've saved me ~10 minutes
reading the code back and forth, as ""DDL_DROP_PHASE_BINLOG" immediately
explains what this phase is for.
>
> if (!dont_log_query &&
> (!error || table_dropped || non_existing_table_error(error)))
> diff --git a/sql/ddl_log.h b/sql/ddl_log.h
> index 316e6708f22..b843404fc3c 100644
> --- a/sql/ddl_log.h
> +++ b/sql/ddl_log.h
> @@ -97,6 +101,14 @@ enum enum_ddl_log_rename_table_phase {
> DDL_RENAME_PHASE_TABLE,
> };
>
> +enum enum_ddl_log_drop_table_phase {
> + DDL_DROP_PHASE_TABLE=0,
> + DDL_DROP_PHASE_TRIGGER,
no phase for deleting data from stat tables?
> + DDL_DROP_PHASE_COLLECT,
> + DDL_DROP_PHASE_RESET, /* Reset found list of dropped tables */
this one is unused
(also unused in further patches, I've checked)
> + DDL_DROP_PHASE_END
> +};
> +
> /*
> Setting ddl_log_entry.phase to this has the same effect as setting
> action_type to DDL_IGNORE_LOG_ENTRY_CODE
> diff --git a/sql/ddl_log.cc b/sql/ddl_log.cc
> index 777dfdddbc7..6145b2f7318 100644
> --- a/sql/ddl_log.cc
> +++ b/sql/ddl_log.cc
> @@ -109,6 +112,7 @@ struct st_global_ddl_log
> };
>
> st_global_ddl_log global_ddl_log;
> +String ddl_drop_query; // Used during startup recovery
static
>
> mysql_mutex_t LOCK_gdl;
>
> @@ -1132,6 +1149,122 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
> (void) update_phase(entry_pos, DDL_LOG_FINAL_PHASE);
> }
> break;
> + case DDL_LOG_DROP_TABLE_INIT_ACTION:
> + {
> + LEX_CSTRING *comment= &ddl_log_entry->tmp_name;
> + ddl_drop_query.length(0);
> + ddl_drop_query.set_charset(system_charset_info);
> + ddl_drop_query.append(STRING_WITH_LEN("DROP TABLE IF EXISTS "));
> + if (comment->length)
> + {
> + ddl_drop_query.append(comment);
> + ddl_drop_query.append(' ');
> + }
> + /* We don't increment phase as we want to retry this in case of crash */
> + break;
> + }
> + case DDL_LOG_DROP_TABLE_ACTION:
> + {
> + LEX_CSTRING db, table, path;
> + db= ddl_log_entry->db;
> + table= ddl_log_entry->name;
> + /* Note that path is without .frm extension */
> + path= ddl_log_entry->tmp_name;
> +
> + switch (ddl_log_entry->phase) {
> + case DDL_DROP_PHASE_TABLE:
> + if (hton)
> + {
> + if ((error= hton->drop_table(hton, path.str)))
> + {
> + if (!non_existing_table_error(error))
> + break;
> + error= -1;
> + }
> + }
> + else
> + error= ha_delete_table_force(thd, path.str, &db, &table);
> + if (error <= 0)
> + {
> + /* Not found or already deleted. Delete .frm if it exists */
> + strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS);
> + mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT));
this should rather be inside the previous if().
no need to do it for ha_delete_table_force(), as it doesn't
leave any traces and removes all remnants of a table.
> + }
> + if (ddl_log_increment_phase_no_lock(entry_pos))
> + break;
> + (void) ddl_log_sync_no_lock();
No need to, ddl_log_increment_phase_no_lock() syncs internally
> + /* Fall through */
> + case DDL_DROP_PHASE_TRIGGER:
> + Table_triggers_list::drop_all_triggers(thd, &db, &table,
> + MYF(MY_WME | MY_IGNORE_ENOENT));
> + if (ddl_log_increment_phase_no_lock(entry_pos))
> + break;
> + (void) ddl_log_sync_no_lock();
same. and in other places too, I'd expect.
> + /* Fall through */
> +
> + case DDL_DROP_PHASE_COLLECT:
indentation?
> + append_identifier(thd, &ddl_drop_query, &db);
> + ddl_drop_query.append('.');
> + append_identifier(thd, &ddl_drop_query, &table);
> + ddl_drop_query.append(',');
> + /* We don't increment phase as we want to retry this in case of crash */
> +
> + if (!ddl_log_entry->next_entry && mysql_bin_log.is_open())
> + {
> + /* Last drop table. Write query to binlog */
> + LEX_CSTRING end_comment=
> + { STRING_WITH_LEN(" /* generated by ddl log */")};
"generated by ddl log" ? a "log" cannot "generate" a statement,
may be "generated during ddl recovery" ?
> + ddl_drop_query.length(ddl_drop_query.length()-1);
> + ddl_drop_query.append(&end_comment);
> +
> + mysql_mutex_unlock(&LOCK_gdl);
> + (void) thd->binlog_query(THD::STMT_QUERY_TYPE, ddl_drop_query.ptr(),
> + ddl_drop_query.length(), TRUE, FALSE,
> + FALSE, 0);
> + mysql_mutex_lock(&LOCK_gdl);
> + }
> + break;
> + }
> + break;
> + }
> + case DDL_LOG_DROP_VIEW_INIT_ACTION:
> + {
> + ddl_drop_query.length(0);
> + ddl_drop_query.set_charset(system_charset_info);
> + ddl_drop_query.append(STRING_WITH_LEN("DROP VIEW IF EXISTS "));
> + /* We don't increment phase as we want to retry this in case of crash */
> + break;
> + }
> + case DDL_LOG_DROP_VIEW_ACTION:
> + {
> + LEX_CSTRING db, table, path;
> + db= ddl_log_entry->db;
> + table= ddl_log_entry->name;
> + /* Note that for views path is WITH .frm extension */
> + path= ddl_log_entry->tmp_name;
> +
> + mysql_file_delete(key_file_frm, path.str, MYF(MY_WME|MY_IGNORE_ENOENT));
> + append_identifier(thd, &ddl_drop_query, &db);
> + ddl_drop_query.append('.');
> + append_identifier(thd, &ddl_drop_query, &table);
> + ddl_drop_query.append(',');
> +
> + if (!ddl_log_entry->next_entry)
> + {
> + /* Last drop view. Write query to binlog */
> + LEX_CSTRING end_comment=
> + { STRING_WITH_LEN(" /* generated by ddl log */")};
> + ddl_drop_query.length(ddl_drop_query.length()-1);
> + ddl_drop_query.append(&end_comment);
> +
> + mysql_mutex_unlock(&LOCK_gdl);
> + (void) thd->binlog_query(THD::STMT_QUERY_TYPE, ddl_drop_query.ptr(),
> + ddl_drop_query.length(), TRUE, FALSE,
> + FALSE, 0);
> + mysql_mutex_lock(&LOCK_gdl);
> + }
> + break;
> + }
> default:
> DBUG_ASSERT(0);
> break;
> @@ -1682,6 +1823,7 @@ void ddl_log_release_entries(DDL_LOG_STATE *ddl_log_state)
> next= log_entry->next_active_log_entry;
> ddl_log_release_memory_entry(log_entry);
> }
> + ddl_log_state->list= 0;
may be you should also clear ddl_log_state->execute_entry below?
>
> if (ddl_log_state->execute_entry)
> ddl_log_release_memory_entry(ddl_log_state->execute_entry);
> @@ -1776,6 +1918,33 @@ bool ddl_log_update_xid(DDL_LOG_STATE *state, ulonglong xid)
> }
>
>
> +/*
> + Write ddl_log_entry and write or update ddl_execute_entry
> +*/
> +
> +static bool ddl_log_write(DDL_LOG_STATE *ddl_state,
> + DDL_LOG_ENTRY *ddl_log_entry)
could you use a more descriptive name, please?
> +{
> + int error;
> + DDL_LOG_MEMORY_ENTRY *log_entry;
> + DBUG_ENTER("ddl_log_write");
> +
> + mysql_mutex_lock(&LOCK_gdl);
> + error= ((ddl_log_write_entry(ddl_log_entry, &log_entry)) ||
> + ddl_log_write_execute_entry(log_entry->entry_pos,
> + &ddl_state->execute_entry));
> + mysql_mutex_unlock(&LOCK_gdl);
> + if (error)
> + {
> + if (log_entry)
> + ddl_log_release_memory_entry(log_entry);
> + DBUG_RETURN(1);
> + }
> + add_log_entry(ddl_state, log_entry);
> + DBUG_RETURN(0);
> +}
> +
> +
> /**
> Logging of rename table
> */
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
2
1
Re: [Maria-developers] 2be9b69f4ff: MDEV-23001 Precreate static Item_bool() to simplify code
by Sergei Golubchik 14 May '21
by Sergei Golubchik 14 May '21
14 May '21
Hi, Monty!
On Mar 25, Michael Widenius wrote:
> revision-id: 2be9b69f4ff (mariadb-10.5.2-510-g2be9b69f4ff)
> parent(s): cb545f11169
> author: Michael Widenius <michael.widenius(a)gmail.com>
> committer: Michael Widenius <michael.widenius(a)gmail.com>
> timestamp: 2021-03-24 14:05:15 +0200
> message:
>
> MDEV-23001 Precreate static Item_bool() to simplify code
>
> The following changes where done:
> - Create global Item: Item_false and Item_true
> - Replace all creation if 'FALSE' and 'TRUE' top level items used for
> WHERE/HAVING/ON clauses to use Item_false and Item_true.
>
> The benefit are:
> - No test needed if we where able to create the new item.
> - Fixed possible errors if 'new' would have failed for the Item_bool's
agree
> - Less and faster code
I don't quite agree with that, it's not less code:
> 7 files changed, 65 insertions(+), 43 deletions(-)
and I doubt that speedup can be measured in any benchmarks whatsoever.
> diff --git a/sql/item.cc b/sql/item.cc
> index 1a86b8b3114..0f160fa257b 100644
> --- a/sql/item.cc
> +++ b/sql/item.cc
> @@ -55,7 +55,8 @@ const char *item_empty_name="";
> const char *item_used_name= "\0";
>
> static int save_field_in_field(Field *, bool *, Field *, bool);
> -
> +const Item_bool_static Item_false("FALSE", 0);
> +const Item_bool_static Item_true("TRUE", 1);
The main question - we have tons of modifiable Item members. How do you
know none of them will modified in your Item_bool_static objects?
> @@ -421,6 +421,9 @@ Item::Item(THD *thd):
> /* Initially this item is not attached to any JOIN_TAB. */
> join_tab_idx= MAX_TABLES;
>
> + /* thd is NULL in case of static items like Item_true */
> + if (!thd)
> + return;
No, please, don't.
thd can be NULL only twice, for statically initialized Item_true and
Item_false. For that you removed a useful assert and added an if() that
is completely unnecessary for billions of items created runtime.
Better to have special contructor just for these two very special items.
> /* Put item in free list so that we can free all items at end */
> next= thd->free_list;
> thd->free_list= this;
> @@ -3637,6 +3641,19 @@ void Item_int::print(String *str, enum_query_type query_type)
> }
>
>
> +/*
> + This function is needed to ensure that Item_bool_static doesn't change
> + the value of the member str_value.
> +*/
> +
> +void Item_bool::print(String *str, enum_query_type query_type)
> +{
> + // my_charset_bin is good enough for numbers
> + String tmp(value ? (char*) "1" : (char*) "0" , 1, &my_charset_bin);
> + str->append(tmp);
> +}
This is not needed anymore. Item_bool inherits from Item_int,
and Item_int::print is
void Item_int::print(String *str, enum_query_type query_type)
{
StringBuffer<LONGLONG_BUFFER_SIZE> buf;
// my_charset_bin is good enough for numbers
buf.set_int(value, unsigned_flag, &my_charset_bin);
str->append(buf);
}
You removed str_value from Item_int::print in July.
> Item *Item_bool::neg_transformer(THD *thd)
> {
> value= !value;
> diff --git a/sql/item.h b/sql/item.h
> index a1c288ab1f0..1087c08869e 100644
> --- a/sql/item.h
> +++ b/sql/item.h
> @@ -1976,7 +1976,7 @@ class Item: public Value_source,
> virtual bool limit_index_condition_pushdown_processor(void *arg) { return 0; }
> virtual bool exists2in_processor(void *arg) { return 0; }
> virtual bool find_selective_predicates_list_processor(void *arg) { return 0; }
> - bool cleanup_is_expensive_cache_processor(void *arg)
> + virtual bool cleanup_is_expensive_cache_processor(void *arg)
You've added a new virtual method? First you try so hard to remove
virtual methods, then you add a new one specifically for these two
static items. Why does it matter what
cleanup_is_expensive_cache_processor will do with them? You overwrite
is_expensive() anyway.
> {
> is_expensive_cache= (int8)(-1);
> return 0;
> @@ -3220,9 +3220,11 @@ class Item_literal: public Item_basic_constant
> Item_literal(THD *thd): Item_basic_constant(thd)
> { }
> Type type() const override { return CONST_ITEM; }
> - bool check_partition_func_processor(void *) override { return false;}
> + bool check_partition_func_processor(void *int_arg) override { return false;}
> bool const_item() const override { return true; }
> bool basic_const_item() const override { return true; }
> + bool is_expensive() override { return false; }
> + bool cleanup_is_expensive_cache_processor(void *arg) override { return 0; }
> };
>
>
> diff --git a/sql/opt_index_cond_pushdown.cc b/sql/opt_index_cond_pushdown.cc
> index 15bc2074e1f..e950ad1b7ca 100644
> --- a/sql/opt_index_cond_pushdown.cc
> +++ b/sql/opt_index_cond_pushdown.cc
> @@ -185,8 +185,8 @@ bool uses_index_fields_only(Item *item, TABLE *tbl, uint keyno,
> static Item *make_cond_for_index(THD *thd, Item *cond, TABLE *table, uint keyno,
> bool other_tbls_ok)
> {
> - if (!cond)
> - return NULL;
> + if (!cond || cond->basic_const_item())
> + return cond;
The effect of this change is that you don't set
cond->marker= ICP_COND_USES_INDEX_ONLY for basic_const_item's anymore.
This looked suspicious at first. But then I noticed (and Sergey Petrunia
confirmed) that ICP_COND_USES_INDEX_ONLY is not used at all. Since 5.3.
So here I'd suggest to revert this your change and completely remove any
ICP_COND_USES_INDEX_ONLY manipulations together with the
ICP_COND_USES_INDEX_ONLY definition.
> if (cond->type() == Item::COND_ITEM)
> {
> uint n_marked= 0;
> diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
> index 119c7360f07..e63e120a02b 100644
> --- a/sql/sql_parse.cc
> +++ b/sql/sql_parse.cc
> @@ -2395,6 +2395,7 @@ dispatch_command_return dispatch_command(enum enum_server_command command, THD *
> thd->update_server_status();
> thd->protocol->end_statement();
> query_cache_end_of_result(thd);
> + /* Check that the query didn't change const items */
what does that mean? Is it a TODO comment? Or it documents what happens
below?
> }
> if (drop_more_results)
> thd->server_status&= ~SERVER_MORE_RESULTS_EXISTS;
> diff --git a/sql/sql_select.cc b/sql/sql_select.cc
> index 93f5d3591ed..894d511adf4 100644
> --- a/sql/sql_select.cc
> +++ b/sql/sql_select.cc
> @@ -2364,7 +2364,7 @@ int JOIN::optimize_stage2()
> if (!conds && outer_join)
> {
> /* Handle the case where we have an OUTER JOIN without a WHERE */
> - conds= new (thd->mem_root) Item_bool(thd, true); // Always true
> + conds= (Item*) &Item_true;
Hmm, does casting const away even work in the Intel compiler?
> }
>
> if (impossible_where)
> @@ -22743,6 +22743,8 @@ make_cond_for_table_from_pred(THD *thd, Item *root_cond, Item *cond,
> return new_cond;
> }
> }
> + else if (cond->basic_const_item())
> + return cond;
>
> if (is_top_and_level && used_table == rand_table_bit &&
> (cond->used_tables() & ~OUTER_REF_TABLE_BIT) != rand_table_bit)
this is rather fragile, isn't it?
You skip item->set_join_tab_idx(join_tab_idx_arg),
quite understandably.
And currently, as far as I can see, item->get_join_tab_idx()
is always used under if (!item->const_item()), so what you did is safe.
But can it change in the future? Or is join_tab_idx something that
conceptually doesn't make sense for const items?
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
2
2
Re: [Maria-developers] 9083627c50f: Make rename atomic/repeatable in MyISAM and Aria
by Sergei Golubchik 14 May '21
by Sergei Golubchik 14 May '21
14 May '21
Hi, Michael!
On May 04, Michael Widenius wrote:
> revision-id: 9083627c50f (mariadb-10.5.2-562-g9083627c50f)
> parent(s): 105cda3b1cb
> author: Michael Widenius <michael.widenius(a)gmail.com>
> committer: Michael Widenius <michael.widenius(a)gmail.com>
> timestamp: 2021-03-25 12:06:34 +0200
> message:
>
> Make rename atomic/repeatable in MyISAM and Aria
>
> This is required to make Atomic RENAME TABLE work for these engines
>
> The requirement is that if we have a server crash in the middle of a
> storage engine rename call, the upcoming ddl log recovery should be able
> to finalize it by re-execute the rename.
>
> diff --git a/storage/maria/ma_rename.c b/storage/maria/ma_rename.c
> index a4388596f6b..64dac32a51c 100644
> --- a/storage/maria/ma_rename.c
> +++ b/storage/maria/ma_rename.c
> @@ -106,28 +106,37 @@ int maria_rename(const char *old_name, const char *new_name)
> _ma_reset_state(info);
> maria_close(info);
>
> + /*
> + This code is written so that it should be possible to re-run a
> + failed rename (even if there is a server crash in between the
> + renames) and complete it.
> + */
> fn_format(from,old_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> fn_format(to,new_name,"",MARIA_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> if (mysql_file_rename_with_symlink(key_file_kfile, from, to,
> MYF(MY_WME | sync_dir)))
> - DBUG_RETURN(my_errno);
> + index_file_rename_error= my_errno;
Shouldn't you continue only if index_file_rename_error == ENOENT ?
also, I only very quickly looked at mysql_file_rename_with_symlink,
but it didn't look atomic to me when symlinks are present.
> fn_format(from,old_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> fn_format(to,new_name,"",MARIA_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
...
> diff --git a/storage/myisam/mi_rename.c b/storage/myisam/mi_rename.c
> index 19df2e54213..13cf8c60897 100644
> --- a/storage/myisam/mi_rename.c
> +++ b/storage/myisam/mi_rename.c
> @@ -22,6 +22,7 @@
> int mi_rename(const char *old_name, const char *new_name)
> {
> char from[FN_REFLEN],to[FN_REFLEN];
> + int save_errno= 0;
> DBUG_ENTER("mi_rename");
>
> #ifdef EXTRA_DEBUG
> @@ -32,10 +33,13 @@ int mi_rename(const char *old_name, const char *new_name)
> fn_format(from,old_name,"",MI_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> fn_format(to,new_name,"",MI_NAME_IEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> if (mysql_file_rename_with_symlink(mi_key_file_kfile, from, to, MYF(MY_WME)))
> - DBUG_RETURN(my_errno);
> + save_errno= my_errno;
> fn_format(from,old_name,"",MI_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> fn_format(to,new_name,"",MI_NAME_DEXT,MY_UNPACK_FILENAME|MY_APPEND_EXT);
> - DBUG_RETURN(mysql_file_rename_with_symlink(mi_key_file_dfile,
> - from, to,
> - MYF(MY_WME)) ? my_errno : 0);
> + if (mysql_file_rename_with_symlink(mi_key_file_dfile,
> + from, to,
> + MYF(MY_WME)))
> + if (save_errno)
> + save_errno= my_errno;
> + DBUG_RETURN(save_errno);
Why mi_rename isn't trying to undo a half-failed rename like ma_rename does?
Regards,
Sergei
VP of MariaDB Server Engineering
and security(a)mariadb.org
2
1