[Maria-developers] slave_ddl_exec_mode and incompatible change in MariaDB 10.0.8
Hi, I've discovered a change in MariaDB 10.0.8 that I don't quite understand motivation for and which looks really surprising to me. So I would appreciate if you could tell me what's the motivation and why you think it's appropriate to do that. The change in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/3961.1.7. I see several issues with it: 1. It introduces a pretty significant variable slave_ddl_exec_mode which wasn't announced in the Release Notes. 2. The default variable value was set to IDEMPOTENT which changes MariaDB slave behavior in an incompatible way. Not everyone will want to have this variable set to IDEMPOTENT (I'd actually argue that no one should set it to IDEMPOTENT), but without any mentioning of this variable in the Release Notes people basically have the only chance to get knowledge about this change (let me stress out: _incompatible_ change) from production incidents or from accidental code inspection. 3. When slave_ddl_exec_mode is set to IDEMPOTENT every "DROP TABLE" event in master's binlog is converted into "DROP TABLE IF EXISTS" in slave's binlog, which I believe is a major no-no for replication. Binlogs on master and slave should be identical, otherwise after failover even if one would set slave_ddl_exec_mode = STRICT it will still execute "DROP TABLE IF EXISTS" because new master has that in binlogs. 4. Why this change introduces OPTION_GTID_BEGIN in the code? Why OPTION_BEGIN wasn't enough? From reading the code the only thing I understand is that with broken binlogs and sequence of events "GTID BEGIN; CREATE TABLE; CREATE TABLE" MariaDB will try to execute both CREATE TABLE statements in the same transaction without auto-committing at the end. This sounds pretty crazy to me. And given the fact that in normal situation such sequence of events on the master is not possible it seems that existence of both OPTION_BEGIN and OPTION_GTID_BEGIN only introduces confusion (MariaDB have GTID always on, right? So it should always execute GTID_BEGIN and never BEGIN without GTID, right?). So what is your word on this? Can you at least set default value for slave_ddl_exec_mode to STRICT so that slave behavior didn't change with the version upgrade? Thank you, Pavel
And now I found that this change is actually buggy. It turns out that when slave executes a standalone CREATE TABLE event now it will set OPTION_BEGIN flag in thd->variables.option_bits and won't reset it. I don't know whether slave keeps transaction actually not committed and/or whether it doesn't clean up some other transaction data, but execution of the next event will always think there is a transaction open and it needs to be auto-committed. But that also means that this state cannot be distinguished from the case when slave received BEGIN event, but didn't receive COMMIT event, i.e. either binlog on master is corrupted or slave somehow skipped some events. Would MariaDB consider this as a serious problem? On Fri, Feb 21, 2014 at 12:30 AM, Pavel Ivanov <pivanof@google.com> wrote:
Hi,
I've discovered a change in MariaDB 10.0.8 that I don't quite understand motivation for and which looks really surprising to me. So I would appreciate if you could tell me what's the motivation and why you think it's appropriate to do that.
The change in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/3961.1.7. I see several issues with it:
1. It introduces a pretty significant variable slave_ddl_exec_mode which wasn't announced in the Release Notes.
2. The default variable value was set to IDEMPOTENT which changes MariaDB slave behavior in an incompatible way. Not everyone will want to have this variable set to IDEMPOTENT (I'd actually argue that no one should set it to IDEMPOTENT), but without any mentioning of this variable in the Release Notes people basically have the only chance to get knowledge about this change (let me stress out: _incompatible_ change) from production incidents or from accidental code inspection.
3. When slave_ddl_exec_mode is set to IDEMPOTENT every "DROP TABLE" event in master's binlog is converted into "DROP TABLE IF EXISTS" in slave's binlog, which I believe is a major no-no for replication. Binlogs on master and slave should be identical, otherwise after failover even if one would set slave_ddl_exec_mode = STRICT it will still execute "DROP TABLE IF EXISTS" because new master has that in binlogs.
4. Why this change introduces OPTION_GTID_BEGIN in the code? Why OPTION_BEGIN wasn't enough? From reading the code the only thing I understand is that with broken binlogs and sequence of events "GTID BEGIN; CREATE TABLE; CREATE TABLE" MariaDB will try to execute both CREATE TABLE statements in the same transaction without auto-committing at the end. This sounds pretty crazy to me. And given the fact that in normal situation such sequence of events on the master is not possible it seems that existence of both OPTION_BEGIN and OPTION_GTID_BEGIN only introduces confusion (MariaDB have GTID always on, right? So it should always execute GTID_BEGIN and never BEGIN without GTID, right?).
So what is your word on this? Can you at least set default value for slave_ddl_exec_mode to STRICT so that slave behavior didn't change with the version upgrade?
Thank you, Pavel
Hi!
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> And now I found that this change is actually buggy. It turns out that Pavel> when slave executes a standalone CREATE TABLE event now it will set Pavel> OPTION_BEGIN flag in thd->variables.option_bits and won't reset it. I Pavel> don't know whether slave keeps transaction actually not committed Pavel> and/or whether it doesn't clean up some other transaction data, but Pavel> execution of the next event will always think there is a transaction Pavel> open and it needs to be auto-committed. I checked my patch, but I could not find any cases where I had added setting OPTION_BEGIN, except in connection with OPTION_GTID_BEGIN. OPTION_GTID_BEGIN is only set when we *know* that there will be a COMMIT event following in the log. I also try to verfiy this by running a test that does this on the master: "create table t2 (a int) engine=myisam" I added a breakpoint for the slave in "mysql_create_table" Neiter when the function was entered or exited was the OPTION_BEGIN flag set. Can you give me an example of where things goes wrong, preferably with an extract from the binary log that shows what is actually logged. For example, here is how a normal create table is logged. (From suite/rpl/r/create_or_replace_row.result) slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; create table t2 (a int) engine=myisam slave-bin.000001 # Gtid # # BEGIN GTID #-#-# The GTID above should not set OPTON_BEGIN or OPTION_GTID_BEGIN on the slave. However a CREATE ... SELECT will look like: master-bin.000001 # Gtid # # BEGIN GTID #-#-# master-bin.000001 # Query # # use `test`; CREATE TABLE `t1` ( `f1` int(1) NOT NULL DEFAULT '0' ) master-bin.000001 # Table_map # # table_id: # (tes t.t1) master-bin.000001 # Write_rows_v1 # # table_id: # flag s: STMT_END_F master-bin.000001 # Query # # COMMIT The above will set the OPTION_BEGIN and OPTION_GTID_BEGIN for the CREATE STATEMENT and this will be reset by the COMMIT (that is guaranteed to follow). Pavel> But that also means that this Pavel> state cannot be distinguished from the case when slave received BEGIN Pavel> event, but didn't receive COMMIT event, i.e. either binlog on master Pavel> is corrupted or slave somehow skipped some events. - Corrupted binary logs should not be a concern. In this case the binary log can contain anything, including wrong DROP DATABASE commands that could do anything. - If the master fails, the slave will notice this because it finds a 'binlog start event', which will reset the BEGIN bits. - In other words, there will always be a COMMIT event (either explicit or implicite, like with a binlog start event) - The slave can only skip events with slave_skip_counter, but in this case it will not be in BEGIN mode. During slave_skip_counter COMMIT events will be noticed and the bit will be reset. How can the binlog be corrupted? How do you expect the master to handle corruption? Why is CREATE TABLE a special case you are concerned about, compared to other things like DELETE FROM TABLE in row based replication? (DELETE FROM expect a BEGIN, table_id, many delete-row-events, COMMIT). Pavel> Would MariaDB consider this as a serious problem? Please show me a test case first so that I can understand the problem. Regards, Monty
Here is a reproduction test case. I took the vanilla tarball of 10.0.8, applied to it the following patch: @@ -131,6 +131,11 @@ bool trans_begin(THD *thd, uint flags) DBUG_ASSERT(!thd->locked_tables_mode); +#ifdef HAVE_REPLICATION + if (thd->slave_thread && (thd->variables.option_bits & OPTION_BEGIN)) + abort(); +#endif + if (thd->in_multi_stmt_transaction_mode() || (thd->variables.option_bits & OPTION_TABLE_LOCK)) { Then I compiled and ran the following test: --source include/master-slave.inc connection master; create table t (n int); insert into t values (1); show binlog events; sync_slave_with_master; That test had this output: include/master-slave.inc [connection master] create table t (n int); insert into t values (1); show binlog events; Log_name Pos Event_type Server_id End_log_pos Info master-bin.000001 4 Format_desc 1 248 Server ver: 10.0.8-MariaDB-debug-log, Binlog ver: 4 master-bin.000001 248 Gtid_list 1 273 [] master-bin.000001 273 Binlog_checkpoint 1 313 master-bin.000001 master-bin.000001 313 Gtid 1 351 GTID 0-1-1 master-bin.000001 351 Query 1 436 use `test`; create table t (n int) master-bin.000001 436 Gtid 1 474 BEGIN GTID 0-1-2 master-bin.000001 474 Query 1 561 use `test`; insert into t values (1) master-bin.000001 561 Query 1 630 COMMIT And then it said that slave died with the stack trace sql/transaction.cc:139(trans_begin(THD*, unsigned int))[0x788e20] sql/log_event.cc:6478(Gtid_log_event::do_apply_event(rpl_group_info*))[0x93a685] sql/log_event.h:1341(Log_event::apply_event(rpl_group_info*))[0x5ca108] sql/slave.cc:3191(apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*))[0x5c0da8] sql/slave.cc:3464(exec_relay_log_event)[0x5c1498] sql/slave.cc:4516(handle_slave_sql)[0x5c44e9] Which means that slave tries to execute BEGIN event while OPTION_BEGIN is set which shouldn't ever happen. And to answer all of your other questions, our main concern is simple: master and slave should always have absolutely the same database contents, absolutely the same tables and absolutely the same data in those tables. Any difference in those can be created only by humans and must be resolved only by humans. Absolutely no magic please, it's unacceptable, whenever inconsistency is detected replication must stop and wait for human intervention. It's not enough to have the same data eventually. And if any DBA requests a different behavior he doesn't understand what kind of troubles waits him in the future. As a consequence to that slave shouldn't execute any implicit commits, because it's impossible to generate binlogs on master that will require implicit commits. Another consequence is CREATE TABLE statement should never automatically delete the table if it already exists. Who knows how the existing table was created and how important the data that is stored in it? Definitely not MariaDB. These questions should be answered by human and human should decide whether it's ok to delete existing table. Again for the same reason DROP TABLE should never be silently ignored if the table doesn't exist -- who knows what happened and why it doesn't exist when it did exist on master? That should be investigated by human. Of course world is not perfect. If slave can crash in the middle of CREATE TABLE and not rollback the table creation on restart, that's a problem. But MariaDB should not assume that if table is exists then it's there because of a crash, there could be other reasons. If slave can crash while executing DROP TABLE and not rollback that on restart, that's a problem too, but again it must be resolved by human (or by code that does a proper rollback). And as you rightfully noted temp tables can behave weirdly with replication, that's why we have code to prohibit creation of temp tables on masters. CREATE IF NOT EXISTS can result in different data on master and slave, that's why we prohibit execution of such statements (as well as DROP IF EXISTS). And for any other feature that may misbehave in replication we will put some blocks in place to avoid any breakage. So that's our main concern and our main expectation of how MariaDB should behave. And we would really appreciate if that behavior didn't silently change to break the "no magic by default" expectation. Pavel On Tue, Feb 25, 2014 at 5:06 PM, Michael Widenius <monty@askmonty.org> wrote:
Hi!
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> And now I found that this change is actually buggy. It turns out that Pavel> when slave executes a standalone CREATE TABLE event now it will set Pavel> OPTION_BEGIN flag in thd->variables.option_bits and won't reset it. I Pavel> don't know whether slave keeps transaction actually not committed Pavel> and/or whether it doesn't clean up some other transaction data, but Pavel> execution of the next event will always think there is a transaction Pavel> open and it needs to be auto-committed.
I checked my patch, but I could not find any cases where I had added setting OPTION_BEGIN, except in connection with OPTION_GTID_BEGIN. OPTION_GTID_BEGIN is only set when we *know* that there will be a COMMIT event following in the log.
I also try to verfiy this by running a test that does this on the master:
"create table t2 (a int) engine=myisam"
I added a breakpoint for the slave in "mysql_create_table"
Neiter when the function was entered or exited was the OPTION_BEGIN flag set.
Can you give me an example of where things goes wrong, preferably with an extract from the binary log that shows what is actually logged.
For example, here is how a normal create table is logged. (From suite/rpl/r/create_or_replace_row.result)
slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; create table t2 (a int) engine=myisam slave-bin.000001 # Gtid # # BEGIN GTID #-#-#
The GTID above should not set OPTON_BEGIN or OPTION_GTID_BEGIN on the slave.
However a CREATE ... SELECT will look like:
master-bin.000001 # Gtid # # BEGIN GTID #-#-# master-bin.000001 # Query # # use `test`; CREATE TABLE `t1` ( `f1` int(1) NOT NULL DEFAULT '0' ) master-bin.000001 # Table_map # # table_id: # (tes t.t1) master-bin.000001 # Write_rows_v1 # # table_id: # flag s: STMT_END_F master-bin.000001 # Query # # COMMIT
The above will set the OPTION_BEGIN and OPTION_GTID_BEGIN for the CREATE STATEMENT and this will be reset by the COMMIT (that is guaranteed to follow).
Pavel> But that also means that this Pavel> state cannot be distinguished from the case when slave received BEGIN Pavel> event, but didn't receive COMMIT event, i.e. either binlog on master Pavel> is corrupted or slave somehow skipped some events.
- Corrupted binary logs should not be a concern. In this case the binary log can contain anything, including wrong DROP DATABASE commands that could do anything. - If the master fails, the slave will notice this because it finds a 'binlog start event', which will reset the BEGIN bits. - In other words, there will always be a COMMIT event (either explicit or implicite, like with a binlog start event) - The slave can only skip events with slave_skip_counter, but in this case it will not be in BEGIN mode. During slave_skip_counter COMMIT events will be noticed and the bit will be reset.
How can the binlog be corrupted? How do you expect the master to handle corruption? Why is CREATE TABLE a special case you are concerned about, compared to other things like DELETE FROM TABLE in row based replication? (DELETE FROM expect a BEGIN, table_id, many delete-row-events, COMMIT).
Pavel> Would MariaDB consider this as a serious problem?
Please show me a test case first so that I can understand the problem.
Regards, Monty
Hi!
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> Here is a reproduction test case. I took the vanilla tarball of Pavel> 10.0.8, applied to it the following patch: Pavel> @@ -131,6 +131,11 @@ bool trans_begin(THD *thd, uint flags) Pavel> DBUG_ASSERT(!thd->locked_tables_mode); Pavel> +#ifdef HAVE_REPLICATION Pavel> + if (thd->slave_thread && (thd->variables.option_bits & OPTION_BEGIN)) Pavel> + abort(); Pavel> +#endif Pavel> + Pavel> if (thd->in_multi_stmt_transaction_mode() || Pavel> (thd->variables.option_bits & OPTION_TABLE_LOCK)) Pavel> { Pavel> Then I compiled and ran the following test: Pavel> --source include/master-slave.inc Pavel> connection master; Pavel> create table t (n int); Pavel> insert into t values (1); Pavel> show binlog events; Pavel> sync_slave_with_master; Pavel> That test had this output: Pavel> include/master-slave.inc Pavel> [connection master] Pavel> create table t (n int); Pavel> insert into t values (1); Pavel> show binlog events; Pavel> Log_name Pos Event_type Server_id End_log_pos Info Pavel> master-bin.000001 4 Format_desc 1 248 Server ver: Pavel> 10.0.8-MariaDB-debug-log, Binlog ver: 4 Pavel> master-bin.000001 248 Gtid_list 1 273 [] Pavel> master-bin.000001 273 Binlog_checkpoint 1 313 master-bin.000001 Pavel> master-bin.000001 313 Gtid 1 351 GTID 0-1-1 Pavel> master-bin.000001 351 Query 1 436 use `test`; create table t (n int) Pavel> master-bin.000001 436 Gtid 1 474 BEGIN GTID 0-1-2 Pavel> master-bin.000001 474 Query 1 561 use `test`; insert into t values (1) Pavel> master-bin.000001 561 Query 1 630 COMMIT Pavel> And then it said that slave died with the stack trace Pavel> sql/transaction.cc:139(trans_begin(THD*, unsigned int))[0x788e20] Pavel> sql/log_event.cc:6478(Gtid_log_event::do_apply_event(rpl_group_info*))[0x93a685] Pavel> sql/log_event.h:1341(Log_event::apply_event(rpl_group_info*))[0x5ca108] Pavel> sql/slave.cc:3191(apply_event_and_update_pos(Log_event*, THD*, Pavel> rpl_group_info*, rpl_parallel_thread*))[0x5c0da8] Pavel> sql/slave.cc:3464(exec_relay_log_event)[0x5c1498] Pavel> sql/slave.cc:4516(handle_slave_sql)[0x5c44e9] Pavel> Which means that slave tries to execute BEGIN event while OPTION_BEGIN Pavel> is set which shouldn't ever happen. The assert you put in the code doesn't show that anything is wrong. The reason is the following code in log_event.cc: thd->variables.option_bits|= OPTION_BEGIN | OPTION_GTID_BEGIN; DBUG_PRINT("info", ("Set OPTION_GTID_BEGIN")); trans_begin(thd, 0); In other words, we do set OPTION_BEGIN just before calling trans_begin(), so the assert is wrong. Pavel> And to answer all of your other questions, our main concern is simple: Pavel> master and slave should always have absolutely the same database Pavel> contents, absolutely the same tables and absolutely the same data in Pavel> those tables. Correct. Pavel> Any difference in those can be created only by humans Pavel> and must be resolved only by humans. Absolutely no magic please, it's Pavel> unacceptable, whenever inconsistency is detected replication must stop Pavel> and wait for human intervention. It's not enough to have the same data Pavel> eventually. And if any DBA requests a different behavior he doesn't Pavel> understand what kind of troubles waits him in the future. The problem you are not taking into account for is that while executing CREATE TABLE or DROP TABLE, the slave is inconsistent until we write things to the binary log and there was no before no automatic way to resolve this without user intervention. The new code I added will not make the slave less reliable in any relevant scenario I can think off. If there is an inconsistency between slave and master, it will be detected when data is applied. Pavel> As a consequence to that slave shouldn't execute any implicit commits, Pavel> because it's impossible to generate binlogs on master that will Pavel> require implicit commits. The original code that only tested OPTION_BEGIN did generate implicit commits in a lot of cases. That is why I introduced OPTION_GTID_BEGIN, just to avoid implicit commits. Pavel> Another consequence is CREATE TABLE Pavel> statement should never automatically delete the table if it already Pavel> exists. Who knows how the existing table was created and how important Pavel> the data that is stored in it? This scenario doesn't really matter for a slave. One should never create a table on the slave that does not exist on the master. The scenarios when this makes sense is very very rare and in this case on can set slave_ddl_exec_mode to IGNORE to handle this. What we do know when we see a CREATE TABLE in the binary log is that the table did not exist on the master and we should create such an empty table slave. The most likely scenario for this to happen is that the slave died while a CREATE TABLE was executed and before the binary log was written. The new code is done exactly to cope with this scenario. Pavel> Definitely not MariaDB. These questions Pavel> should be answered by human and human should decide whether it's ok to Pavel> delete existing table. Again for the same reason DROP TABLE should Pavel> never be silently ignored if the table doesn't exist -- who knows what Pavel> happened and why it doesn't exist when it did exist on master? That Pavel> should be investigated by human. The most likely reason for that table not existing is again that the slave died in the middle of a drop. The problem with the old way of handling CREATE & DROP is that if you do a lot of CREATE and DROP tables (which a lot of applications are actually doing) there is a very big change that a slave will not be able to recover from a crash. The new code is safe for all practical purposes. Any inconsistencies in data, which is not likely to occur of the new behavior with CREATE or DROP TABLE will be detected during execution of DML's. Pavel> Of course world is not perfect. If slave can crash in the middle of Pavel> CREATE TABLE and not rollback the table creation on restart, that's a Pavel> problem. But MariaDB should not assume that if table is exists then Pavel> it's there because of a crash, there could be other reasons. If slave Pavel> can crash while executing DROP TABLE and not rollback that on restart, Pavel> that's a problem too, but again it must be resolved by human (or by Pavel> code that does a proper rollback). There is no possiblity to rollback CREATE or DROP in the current code. The only possiblity we have now is to allow these DDL's to succeed. I personally think this is the safest way to go forwards, as we in this case do know that for this table we will be consistent between master and slave after the execution of the statement. This is not true for the normal slave_exec_mode, which is why this is not on by default. Pavel> And as you rightfully noted temp tables can behave weirdly with Pavel> replication, that's why we have code to prohibit creation of temp Pavel> tables on masters. This may be ok for Google, but it's not ok for most other MariaDB users. Pavel> CREATE IF NOT EXISTS can result in different data Pavel> on master and slave, that's why we prohibit execution of such Pavel> statements (as well as DROP IF EXISTS). And for any other feature that Pavel> may misbehave in replication we will put some blocks in place to avoid Pavel> any breakage. Agree that CREATE IF NOT EXIST can in theory give different data between master and slave. This is however not the case with DROP IF EXISTS. Pavel> So that's our main concern and our main expectation of how MariaDB Pavel> should behave. And we would really appreciate if that behavior didn't Pavel> silently change to break the "no magic by default" expectation. The change we did was not silently done. It was documented and blogged in several places. I added the slave_ddl_exec_mode variable to ensure that companies like yours should be able to change the behavior to the old one if they would have strange master/slave setups where it's ok that the data is different between master and slave. For normal people that needs something that will 'just work' and do not know what to do if the slave suddenly would fail to restart, the current defaults are better as behavior for CREATE or DROP will never cause inconsistencies between master and slave and will make the slave more reliable on restarts. Regards, Monty
On Wed, Feb 26, 2014 at 8:12 PM, Michael Widenius <monty@askmonty.org> wrote:
Pavel> And then it said that slave died with the stack trace
Pavel> sql/transaction.cc:139(trans_begin(THD*, unsigned int))[0x788e20] Pavel> sql/log_event.cc:6478(Gtid_log_event::do_apply_event(rpl_group_info*))[0x93a685] Pavel> sql/log_event.h:1341(Log_event::apply_event(rpl_group_info*))[0x5ca108] Pavel> sql/slave.cc:3191(apply_event_and_update_pos(Log_event*, THD*, Pavel> rpl_group_info*, rpl_parallel_thread*))[0x5c0da8] Pavel> sql/slave.cc:3464(exec_relay_log_event)[0x5c1498] Pavel> sql/slave.cc:4516(handle_slave_sql)[0x5c44e9]
Pavel> Which means that slave tries to execute BEGIN event while OPTION_BEGIN Pavel> is set which shouldn't ever happen.
The assert you put in the code doesn't show that anything is wrong.
The reason is the following code in log_event.cc:
thd->variables.option_bits|= OPTION_BEGIN | OPTION_GTID_BEGIN; DBUG_PRINT("info", ("Set OPTION_GTID_BEGIN")); trans_begin(thd, 0);
In other words, we do set OPTION_BEGIN just before calling trans_begin(), so the assert is wrong.
To me this code looks clearly wrong. You set OPTION_BEGIN just before calling trans_begin() which forces trans_begin() to kick off the commit machinery. And even though it ends up doing nothing, I don't know how trivial is the number of CPU cycles it spends on that. But why set OPTION_BEGIN if it will be set in the trans_begin() anyway? So I fixed that and made the line to look like thd->variables.option_bits|= OPTION_GTID_BEGIN; But testing that more and looking at the code I realized there's another problem in these 3 lines: why did you add the call to trans_begin() at all? Right after it mysql_parse() is called to execute "BEGIN" statement, which again kicks off the commit machinery without any necessity to do that. So FYI: I removed setting OPTION_BEGIN here and removed the call to trans_begin() and all tests passed (including my additional assert). Pavel
Hi!
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> sql/transaction.cc:139(trans_begin(THD*, unsigned int))[0x788e20] Pavel> sql/log_event.cc:6478(Gtid_log_event::do_apply_event(rpl_group_info*))[0x93a685] Pavel> sql/log_event.h:1341(Log_event::apply_event(rpl_group_info*))[0x5ca108] Pavel> sql/slave.cc:3191(apply_event_and_update_pos(Log_event*, THD*, Pavel> rpl_group_info*, rpl_parallel_thread*))[0x5c0da8] Pavel> sql/slave.cc:3464(exec_relay_log_event)[0x5c1498] Pavel> sql/slave.cc:4516(handle_slave_sql)[0x5c44e9]
Pavel> Which means that slave tries to execute BEGIN event while OPTION_BEGIN Pavel> is set which shouldn't ever happen.
The assert you put in the code doesn't show that anything is wrong.
The reason is the following code in log_event.cc:
Pavel> On Wed, Feb 26, 2014 at 8:12 PM, Michael Widenius <monty@askmonty.org> wrote: Pavel> And then it said that slave died with the stack trace thd-> variables.option_bits|= OPTION_BEGIN | OPTION_GTID_BEGIN;
DBUG_PRINT("info", ("Set OPTION_GTID_BEGIN")); trans_begin(thd, 0);
In other words, we do set OPTION_BEGIN just before calling trans_begin(), so the assert is wrong.
Pavel> To me this code looks clearly wrong. You set OPTION_BEGIN just before Pavel> calling trans_begin() which forces trans_begin() to kick off the Pavel> commit machinery. And even though it ends up doing nothing, I don't Pavel> know how trivial is the number of CPU cycles it spends on that. But Pavel> why set OPTION_BEGIN if it will be set in the trans_begin() anyway? The reason was simply that I wanted almost all lines that was using OPTION_GTID_BEGIN to set and reset OPTION_BEGIN. This was mainly to make it trivial to ensure with 'grep' that I did not miss any cases. What I had missed was that trans_begin() did check for OPTION_BEGIN and when this was set did a few extra unnecessary tests to commit non-existing things. I agree that it should not be set here. Pavel> So I fixed that and made the line to look like thd-> variables.option_bits|= OPTION_GTID_BEGIN; Corect. Pavel> But testing that more and looking at the code I realized there's Pavel> another problem in these 3 lines: why did you add the call to Pavel> trans_begin() at all? Right after it mysql_parse() is called to Pavel> execute "BEGIN" statement, which again kicks off the commit machinery Pavel> without any necessity to do that. The original reason was to not have to call mysql_parse() at all for this case (no reason to parse something that we know what it is). Apparently I missed to remove the extra calls. I will fix that now. Thanks a lot for noticing this. Pavel> So FYI: I removed setting OPTION_BEGIN here and removed the call to Pavel> trans_begin() and all tests passed (including my additional assert). Thanks for testing this! Regards, Monty
Hi! First, very sorry that I was not able to answer this question until now. I have been traveling in Asia for 2 weeks with very limited time for internet + email. Now I am however back and hope to answer all possible questions you have on this topic.
"Pavel" == Pavel Ivanov <pivanof@google.com> writes:
Pavel> Hi, Pavel> I've discovered a change in MariaDB 10.0.8 that I don't quite Pavel> understand motivation for and which looks really surprising to me. So Pavel> I would appreciate if you could tell me what's the motivation and why Pavel> you think it's appropriate to do that. Pavel> The change in question is Pavel> http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/3961.1.7. Pavel> I see several issues with it: Pavel> 1. It introduces a pretty significant variable slave_ddl_exec_mode Pavel> which wasn't announced in the Release Notes. It looks like Daniel accidently forget to mention this in the release notes. I have now added thus to: https://mariadb.com/kb/en/mariadb-1008-release-notes/ I did however originally document this change in: https://mariadb.com/kb/en/create-table/ https://mariadb.com/kb/en/drop-table/ https://mariadb.com/kb/en/mysqld-options-full-list/ https://mariadb.com/kb/en/what-is-mariadb-100/ https://mariadb.com/kb/en/replication-and-binary-log-server-system-variables... and I wrote a very detail blog post why this change was needed: http://monty-says.blogspot.fi/2014/02/the-final-piece-of-puzzle.html Pavel> 2. The default variable value was set to IDEMPOTENT which changes Pavel> MariaDB slave behavior in an incompatible way. Not everyone will want Pavel> to have this variable set to IDEMPOTENT (I'd actually argue that no Pavel> one should set it to IDEMPOTENT), but without any mentioning of this Pavel> variable in the Release Notes people basically have the only chance to Pavel> get knowledge about this change (let me stress out: _incompatible_ Pavel> change) from production incidents or from accidental code inspection. Incompatible in which way? Can you please show me some common scenarions when the current default would not be what you want or expect? My argument is that most people would expect the following from the MariaDB replication: (Most of the following was not true before my patch and is not true for MySQL) - That row based, statement based and mixed mode replication would produce the same number of GTID. - One should be able to have different replication mode on master and slave and still get same number of GTID's. - Row based, statement based and mixed mode replication would produce the same data (was not true before my patch). - If a slave would die, it should be able restart without user intervention as long as the data on the slave and master will be exactly the same. - Statements should always be executed in the same order on the slave and master even if the slave and master used different storage engines (transactional or not). Yes, things are not 100 % compatible with how MySQL does things, but in almost all cases things are better than before. The slave will be able to handle restarts in a lot of common scenarios it could not handle before. Bcause of this I would argue that almost every single user would prefer the new mode compared to how things where before. Pavel> 3. When slave_ddl_exec_mode is set to IDEMPOTENT every "DROP TABLE" Pavel> event in master's binlog is converted into "DROP TABLE IF EXISTS" in Pavel> slave's binlog, which I believe is a major no-no for replication. Why ? In which scenario is this not desirable? Pavel> Binlogs on master and slave should be identical, otherwise after Pavel> failover even if one would set slave_ddl_exec_mode = STRICT it will Pavel> still execute "DROP TABLE IF EXISTS" because new master has that in Pavel> binlogs. The binlogs on master and slave have never been identical, especially if you are using different replication modes. Here is some strange usage: - For DROP of temporary tables IF EXISTS was added in some cases, in other cases the DROP was never written to the binary log. - CREATE IF NOT EXISTS .... LIKE other_table could create different tables based on replication mode. - CREATE ... SELECT was replicated with 1 GTID on statement based mode and 2 GTID's in row based mode. - MySQL 5.6 doesn't even allow one to use CREATE ... SELECT with GTID. The CREATE ... REPLACE patch fixes many cases that before causes data and tables to be different on master and slave. Here is some scenarios of things can go wrong and how the new mode will handle it better than the old one: 1) Repeatable DROP master: DROP TABLE t1,t2; old slave: DROP TABLE t1 crash; restart DROP TABLE t1 - Slave will stop as t1 doesn't exist anymore. New slave: DROP TABLE IF EXISTS t1 ; Drops t1 crash; restart DROP TABLE IF EXISTS t1 DROP TABLE IF EXISTS t2 End result, master and slave are consistent. 2) Repeatable CREATE master: CREATE TABLE t1 (a int); old slave: CREATE TABLE t1 (a int); - Crash before binary log is written ; restart CREATE TABLE t1 (a int); - Fail new slave: CREATE TABLE t1 (a int); - Crash before binary log is written ; restart CREATE OR REPLACE TABLE t1 (a int); This will work (or can easilbe be made to work) in most cases even if the slave fails between the .frm file being rewritten and the table is created in the storage engine. 3) Repatable CREATE ... SELECT master: CREATE TABLE t1 SELECT * from t2; old slave: CREATE TABLE t1 (a int); INSERT INTO t1 SELECT * from t2; - Crash during select ; restart Roll back insert (as part of recovery) CREATE TABLE t1 (a int); - Fail old slave: CREATE TABLE t1 (a int); INSERT INTO t1 SELECT * from t2; - Crash during select ; restart Roll back insert (as part of recovery) DROP TABLE t1; CREATE TABLE t1 (a int); INSERT INTO t1 SELECT * from t2; I don't know of any case where the new default mode would create and inconsistency between master and slave. Why do you think it's matter that DROP TABLE is replicated as DROP TABLE IF EXISTS ? When the slave sees the DROP TABLE, it knows that the table doesn't exists anymore on the master (it doesn't really matter if the table existed or not on the master, does it)? There are two cases how this can be run on the slave: - If the table existed, it will be deleted. - If the table doesn't exist, nothing will happen. In both scenarios the slave will be identical to the master (when it comes to this table). Pavel> 4. Why this change introduces OPTION_GTID_BEGIN in the code? Why Pavel> OPTION_BEGIN wasn't enough? Because the MySQL replication code doesn't honor OPTION_BEGIN. - It may do an implicit commit in many different scenarios. (On example is CREATE TABLE, but there are others) - It may reorder statements within a transaction (if some tables are transactional and others are not) - It allows us to have DDL's in a GTID transaction. When OPTION_GITD_BEGIN is set, all tables are treated as equal, in spite of table type, and no implicit commits are allowed. This help ensures that the slave and master's data will be equal after the statements where executed. (The original code didn't allow that). For example: master (where all tables are InnoDB): BEGIN INSERT into t1 values (1); INSERT into t1 select * from t2; INSERT into t2 values (2); COMMIT; slave (where t2 is MyISAM). The binlog will look like this: BEGIN INSERT into t2 values (2); INSERT into t1 select * from t2; INSERT into t1 values (1,2,3); COMMIT; This is because, by default, MySQL puts all statements that changes non transactional tables first in the binary log. Pavel> From reading the code the only thing I Pavel> understand is that with broken binlogs and sequence of events "GTID Pavel> BEGIN; CREATE TABLE; CREATE TABLE" MariaDB will try to execute both Pavel> CREATE TABLE statements in the same transaction without Pavel> auto-committing at the end. What do you mean with 'broken binlogs' ? How would you be able to create a broken binlog? If the master restarts, it will write a new 'log start' statement that will reset all 'BEGIN' flags, so there is no possible scenario, I know of, that would cause a binary log with two create statements after each other. If it's possible to get such a binary log, then you can also get similar problems with DML's. My patch had nothing to do with trying to work with broken binary logs. Pavel> This sounds pretty crazy to me. Trying to execute broken binary log's sounds very crazy to me too. By probably I don't understand what you really think is a problem. Lets look at this from a different angel: One part of the patch solves how CREATE ... SELECT is logged. Before it was logged in statement based replication: GTID BEGIN CREATE ... SELECT COMMIT and in row based replication as: GTID BEGIN CREATE COMMIT INSERT INTO ... SELECT COMMIT After my patch the second version is logged as: GTID BEGIN CREATE INSERT INTO ... SELECT COMMIT so we have the same number of GTID's in both replication modes. Is this more clear? Pavel> And given Pavel> the fact that in normal situation such sequence of events on the Pavel> master is not possible it seems that existence of both OPTION_BEGIN Pavel> and OPTION_GTID_BEGIN only introduces confusion (MariaDB have GTID Pavel> always on, right? So it should always execute GTID_BEGIN and never Pavel> BEGIN without GTID, right?). As I tried to explain above, OPTION_BEGIN doesn't make sense for a slave, that is why we need OPTION_GTID_BEGIN. And yes, a slave should never execute BEGIN without a GTID. Pavel> So what is your word on this? Can you at least set default value for Pavel> slave_ddl_exec_mode to STRICT so that slave behavior didn't change Pavel> with the version upgrade? I don't think that having a more stable slave is a behavior change. Sorry, but I really don't understand what concerns you have with the new behavior. Is it just that it was not clearly explained, or is there some scenario you are working with where it would be better to get the slave to fail than continue (even when we know that the data on master and slave will be identical). Please give me some examples of the scenarios that you are concerned about! Regards, Monty
participants (2)
-
Michael Widenius
-
Pavel Ivanov