Here is a reproduction test case. I took the vanilla tarball of 10.0.8, applied to it the following patch: @@ -131,6 +131,11 @@ bool trans_begin(THD *thd, uint flags) DBUG_ASSERT(!thd->locked_tables_mode); +#ifdef HAVE_REPLICATION + if (thd->slave_thread && (thd->variables.option_bits & OPTION_BEGIN)) + abort(); +#endif + if (thd->in_multi_stmt_transaction_mode() || (thd->variables.option_bits & OPTION_TABLE_LOCK)) { Then I compiled and ran the following test: --source include/master-slave.inc connection master; create table t (n int); insert into t values (1); show binlog events; sync_slave_with_master; That test had this output: include/master-slave.inc [connection master] create table t (n int); insert into t values (1); show binlog events; Log_name Pos Event_type Server_id End_log_pos Info master-bin.000001 4 Format_desc 1 248 Server ver: 10.0.8-MariaDB-debug-log, Binlog ver: 4 master-bin.000001 248 Gtid_list 1 273 [] master-bin.000001 273 Binlog_checkpoint 1 313 master-bin.000001 master-bin.000001 313 Gtid 1 351 GTID 0-1-1 master-bin.000001 351 Query 1 436 use `test`; create table t (n int) master-bin.000001 436 Gtid 1 474 BEGIN GTID 0-1-2 master-bin.000001 474 Query 1 561 use `test`; insert into t values (1) master-bin.000001 561 Query 1 630 COMMIT And then it said that slave died with the stack trace sql/transaction.cc:139(trans_begin(THD*, unsigned int))[0x788e20] sql/log_event.cc:6478(Gtid_log_event::do_apply_event(rpl_group_info*))[0x93a685] sql/log_event.h:1341(Log_event::apply_event(rpl_group_info*))[0x5ca108] sql/slave.cc:3191(apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*))[0x5c0da8] sql/slave.cc:3464(exec_relay_log_event)[0x5c1498] sql/slave.cc:4516(handle_slave_sql)[0x5c44e9] Which means that slave tries to execute BEGIN event while OPTION_BEGIN is set which shouldn't ever happen. And to answer all of your other questions, our main concern is simple: master and slave should always have absolutely the same database contents, absolutely the same tables and absolutely the same data in those tables. Any difference in those can be created only by humans and must be resolved only by humans. Absolutely no magic please, it's unacceptable, whenever inconsistency is detected replication must stop and wait for human intervention. It's not enough to have the same data eventually. And if any DBA requests a different behavior he doesn't understand what kind of troubles waits him in the future. As a consequence to that slave shouldn't execute any implicit commits, because it's impossible to generate binlogs on master that will require implicit commits. Another consequence is CREATE TABLE statement should never automatically delete the table if it already exists. Who knows how the existing table was created and how important the data that is stored in it? Definitely not MariaDB. These questions should be answered by human and human should decide whether it's ok to delete existing table. Again for the same reason DROP TABLE should never be silently ignored if the table doesn't exist -- who knows what happened and why it doesn't exist when it did exist on master? That should be investigated by human. Of course world is not perfect. If slave can crash in the middle of CREATE TABLE and not rollback the table creation on restart, that's a problem. But MariaDB should not assume that if table is exists then it's there because of a crash, there could be other reasons. If slave can crash while executing DROP TABLE and not rollback that on restart, that's a problem too, but again it must be resolved by human (or by code that does a proper rollback). And as you rightfully noted temp tables can behave weirdly with replication, that's why we have code to prohibit creation of temp tables on masters. CREATE IF NOT EXISTS can result in different data on master and slave, that's why we prohibit execution of such statements (as well as DROP IF EXISTS). And for any other feature that may misbehave in replication we will put some blocks in place to avoid any breakage. So that's our main concern and our main expectation of how MariaDB should behave. And we would really appreciate if that behavior didn't silently change to break the "no magic by default" expectation. Pavel On Tue, Feb 25, 2014 at 5:06 PM, Michael Widenius <monty@askmonty.org> wrote:
Hi!
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> And now I found that this change is actually buggy. It turns out that Pavel> when slave executes a standalone CREATE TABLE event now it will set Pavel> OPTION_BEGIN flag in thd->variables.option_bits and won't reset it. I Pavel> don't know whether slave keeps transaction actually not committed Pavel> and/or whether it doesn't clean up some other transaction data, but Pavel> execution of the next event will always think there is a transaction Pavel> open and it needs to be auto-committed.
I checked my patch, but I could not find any cases where I had added setting OPTION_BEGIN, except in connection with OPTION_GTID_BEGIN. OPTION_GTID_BEGIN is only set when we *know* that there will be a COMMIT event following in the log.
I also try to verfiy this by running a test that does this on the master:
"create table t2 (a int) engine=myisam"
I added a breakpoint for the slave in "mysql_create_table"
Neiter when the function was entered or exited was the OPTION_BEGIN flag set.
Can you give me an example of where things goes wrong, preferably with an extract from the binary log that shows what is actually logged.
For example, here is how a normal create table is logged. (From suite/rpl/r/create_or_replace_row.result)
slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; create table t2 (a int) engine=myisam slave-bin.000001 # Gtid # # BEGIN GTID #-#-#
The GTID above should not set OPTON_BEGIN or OPTION_GTID_BEGIN on the slave.
However a CREATE ... SELECT will look like:
master-bin.000001 # Gtid # # BEGIN GTID #-#-# master-bin.000001 # Query # # use `test`; CREATE TABLE `t1` ( `f1` int(1) NOT NULL DEFAULT '0' ) master-bin.000001 # Table_map # # table_id: # (tes t.t1) master-bin.000001 # Write_rows_v1 # # table_id: # flag s: STMT_END_F master-bin.000001 # Query # # COMMIT
The above will set the OPTION_BEGIN and OPTION_GTID_BEGIN for the CREATE STATEMENT and this will be reset by the COMMIT (that is guaranteed to follow).
Pavel> But that also means that this Pavel> state cannot be distinguished from the case when slave received BEGIN Pavel> event, but didn't receive COMMIT event, i.e. either binlog on master Pavel> is corrupted or slave somehow skipped some events.
- Corrupted binary logs should not be a concern. In this case the binary log can contain anything, including wrong DROP DATABASE commands that could do anything. - If the master fails, the slave will notice this because it finds a 'binlog start event', which will reset the BEGIN bits. - In other words, there will always be a COMMIT event (either explicit or implicite, like with a binlog start event) - The slave can only skip events with slave_skip_counter, but in this case it will not be in BEGIN mode. During slave_skip_counter COMMIT events will be noticed and the bit will be reset.
How can the binlog be corrupted? How do you expect the master to handle corruption? Why is CREATE TABLE a special case you are concerned about, compared to other things like DELETE FROM TABLE in row based replication? (DELETE FROM expect a BEGIN, table_id, many delete-row-events, COMMIT).
Pavel> Would MariaDB consider this as a serious problem?
Please show me a test case first so that I can understand the problem.
Regards, Monty