Re: [Maria-developers] [Commits] Rev 3879: MDEV-6593 : domain_id based replication filters in lp:~maria-captains/maria/maria-10.0-galera
Nirbhay Choubey <nirbhay@skysql.com> writes:
------------------------------------------------------------ revno: 3879 revision-id: nirbhay@skysql.com-20140912181622-s33e7r0vtgpju95q parent: nirbhay@skysql.com-20140814224304-rtkeal7p9lk06li1 committer: Nirbhay Choubey <nirbhay@skysql.com> branch nick: maria-10.0-galera_6593 timestamp: Fri 2014-09-12 14:16:22 -0400 message: MDEV-6593 : domain_id based replication filters
Added support for IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS.
Here is my review. Some general comments, followed by some detailed comments in-line in the patch: 1. Did you prepare any documentation? You could either add it to the knowledgebase directly, or write up a text file and ask Daniel to add it in relevant places. 2. Did you run the full test suite? I would have expected at least some tests in the multi_source suite to need modification. The chance touches enough things that you should test it in a feature tree in Buildbot to ensure that it is green before pushing it to a main tree. 3. I think some extra test cases are needed: - You should test (and document) what happens when the same domain_id is in both the DO and the IGNORE list. Is it actually ever sensible to have both DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS ? If not, I think it should be an error to try. - You should also add a test using non-GTID mode. - You should also add a test where parallel replication is used (--slave-parallel-threads > 0). - Add a test for the following scenario when using GTID mode: Some domain id is ignored, some events are received from the master and ignored. Then the slave is stopped, and the ignore rule is removed. What happens when the slave is restarted and reconnects? I think the previously ignored domain id events will now be re-fetched and replicated. I guess that's ok, but it should be documented. Also test what happens when binlog files on the master have been rotated and purged, so that the events are no longer available on the master (an error should occur). - Add a similar test for when not using GTID mode, the last event replicated by the slave is ignored, the slave is stopped, the ignore rule removed and the slave restarted; will the last event now be re-fetched or not (I think not). - Add a test that checks that seconds_behind_master is properly updated to zero even when the last event is ignored due to DO_DOMAIN_IDS/IGNORE_DOMAIN_IDS. It can perhaps be a bit tricky to test, but I think you can use this existing test as a basis: mysql-test/suite/rpl/t/rpl_parallel2.test - It would also be good to check that the code works when used on an old master (which does not send GTID events). You can do this by creating a binlog file with a 5.5 server and storing it in mysql-test/std_data/, and then installing it in a master; there should be several other test cases that already do this. 4. Also see detailed comments for some possible problems with the implementation. The most serious is probably to ensure that events are not skipped after the end of the group, we need a couple of tests for this, see below. Also, there is a small memory leak, see below.
+DO_DOMAIN_IDS (before) : 1 +IGNORE_DOMAIN_IDS (before) : 2 +CHANGE MASTER TO DO_DOMAIN_IDS=(4,4,5,1,7,7,7,1,1,2,6,8,1,4,5,5,9,3), MASTER_USE_GTID=current_pos;
So if a domain_id is specified multiple times, the redundant ones will be ignored. It would seem preferable to give an error... but as far as I can see the behaviour is the same for IGNORE_SERVER_IDS, so better to be consistent as you did in the code. Just be sure to add this to the documentation.
=== added file 'mysql-test/suite/rpl/t/rpl_domain_id_filter.test' --- a/mysql-test/suite/rpl/t/rpl_domain_id_filter.test 1970-01-01 00:00:00 +0000 +++ b/mysql-test/suite/rpl/t/rpl_domain_id_filter.test 2014-09-12 18:16:22 +0000
+SET @gtid_domain_id_saved = @@session.gtid_domain_id;
This is not needed, nor do you use it anywhere, you can remove it.
+##### Case 3: When both DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS are non-empty + +source include/stop_slave.inc; +let $do_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Do_Domain_Ids, 1); +let $ignore_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Ignore_Domain_Ids, 1); +--echo DO_DOMAIN_IDS (before) : $do_domain_ids_before +--echo IGNORE_DOMAIN_IDS (before) : $ignore_domain_ids_before + +# Replicate events belonging to "domain_id 1". (DO_DOMAIN_IDS should win!) +CHANGE MASTER TO DO_DOMAIN_IDS=(1), IGNORE_DOMAIN_IDS=(2), MASTER_USE_GTID=current_pos;
Did you intend here to test overlapping ids in DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS (due to the comment "DO_DOMAIN_IDS should win") ? But if I understand correctly, IGNORE_DOMAIN_IDS will be ignored if DO_DOMAIN_IDS is non-empty. So I think it should be an error to try to set both - do you agree?
+--source include/rpl_end.inc
Please add additional testcases, as suggested at the start of review.
=== modified file 'sql/rpl_mi.cc' --- a/sql/rpl_mi.cc 2014-02-19 10:05:15 +0000 +++ b/sql/rpl_mi.cc 2014-09-12 18:16:22 +0000 @@ -513,6 +513,25 @@ else mi->using_gtid= Master_info::USE_GTID_NO; } + + // DO_DOMAIN_IDS + if (0 == strncmp(buf, STRING_WITH_LEN("do_domain_ids=")) && + mi->domain_id_filter.init_ids(buf + sizeof("do_domain_ids"), + Domain_id_filter::DO_DOMAIN_IDS)) + { + sql_print_error("Failed to initialize master info do_domain_ids"); + goto errwithmsg; + } + + // IGNORE_DOMAIN_IDS + if (0 == strncmp(buf, STRING_WITH_LEN("ignore_domain_ids=")) && + mi->domain_id_filter.init_ids(buf + sizeof("ignore_domain_ids"), + Domain_id_filter::IGNORE_DOMAIN_IDS)) + { + sql_print_error("Failed to initialize master info " + "ignore_domain_ids"); + goto errwithmsg; + }
So here you use the same style as adjacent code (that I wrote). However, I have since learned that the style (0 == strncmp(...)) is not used in the server. So please change this, as well as my existing code, to use either strncmp(...) == 0 or !strncmp(), as you prefer. Another thing from my old code is the sizeof("do_domain_ids"), which highly confused me for a bit before I realised that while the '=' is not included in the sizeof(), the terminating '\0' _is_ included. So since even I myself is confused by this code, better clarify it ;-) Perhaps it would be clearer if changed to (sizeof("do_domain_ids=")-1) (both in my old code and in your new code). (Normally it's good to follow the style of existing code as you did, I just noticed this in the review and realised that my own "existing code" would benefit from these small improvements).
@@ -634,6 +653,16 @@ } }
+ char* do_domain_ids_buf= + mi->domain_id_filter.as_string(Domain_id_filter::DO_DOMAIN_IDS); + if (do_domain_ids_buf == NULL) + DBUG_RETURN(1); + + char* ignore_domain_ids_buf= + mi->domain_id_filter.as_string(Domain_id_filter::IGNORE_DOMAIN_IDS); + if (ignore_domain_ids_buf == NULL) + DBUG_RETURN(1);
You have a memory leak here, do_domain_ids_buf will not be freed if ignore_domain_ids_buf fails due to out-of-memory.
@@ -1348,4 +1382,218 @@ DBUG_RETURN(result); }
+/* ctor */ +Domain_id_filter::Domain_id_filter() : m_filter(false) +{ + for (int i= DO_DOMAIN_IDS; i <= IGNORE_DOMAIN_IDS; i ++) + { + my_init_dynamic_array(&m_domain_ids[i], sizeof(uint32), 16, 16, MYF(0)); + } +} + +/* dtor */ +Domain_id_filter::~Domain_id_filter()
I don't think comments like /* ctor */ and /* dtor */ are useful, better just omit them.
+/** + Update the domain id list of the given type with elements from the new list. + + @param new_ids [IN] new list of ids + @param type [IN] domain id list type to update + @param changed [IN] has the domain id list changed in the + last CHANGE MASTER command? + + @retval void +*/ +void Domain_id_filter::update_ids(DYNAMIC_ARRAY *new_ids, enum_list_type type, + bool changed) +{ + DYNAMIC_ARRAY *ids= &m_domain_ids[type]; + + if (changed == true) + reset_dynamic(ids); + + /* bsearch requires an ordered list. */ + sort_dynamic(new_ids, change_master_domain_id_cmp); + + for (uint i= 0; i < new_ids->elements; i++) + { + uint32 domain_id; + get_dynamic(new_ids, (void *) &domain_id, i); + + if (bsearch((const uint32 *) &domain_id, ids->buffer, ids->elements, + sizeof(uint32), change_master_domain_id_cmp) == NULL) + { + insert_dynamic(ids, (uint32*) &domain_id); + } + } + return; +}
Ok, so I understand that this code is modelled after the existing code for IGNORED_SERVER_IDS. But I found it rather confusing, because it looks like there is a bug: if changed==false, then the code does not ensure that the resulting array will still be sorted. I guess the bug is not possible to trigger, because the list of stuff to add will only be non-empty if changed==true? But that is rather confusing. Please simplify the logic; maybe you only need to call this function when changed==true, so the parameter can be omitted and the array always reset? Also, consider if you can rewrite the existing code for IGNORE_SERVER_IDS so that the same common code can be used as for IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS, that would simplify the code and reduce risk of bugs or inconsistencies.
+/** + Initialize the given domain id list with the numbers from the specified + buffer. + + @param buf [IN] buffer + @param type [IN] domain id list type + + @retval false Success + true Error +*/ +bool Domain_id_filter::init_ids(char *buf, enum_list_type type) +{ + char *token, *last; + uint num_items; + DYNAMIC_ARRAY *ids= &m_domain_ids[type]; + + token= strtok_r(buf, " ", &last); + if (token == NULL) + { + return true; /* error */ + } + + num_items= atoi(token); + for (uint i= 0; i < num_items; i++) + { + token= strtok_r(NULL, " ", &last); + if (token == NULL) + { + return true; /* error */ + } + else + { + uint32 val= atoi(token); + insert_dynamic(ids, (uchar *) &val); + } + } + return false; +}
Again, did you consider using the same code for this and for IGNORE_SERVER_IDS? (I think you did, because you added a comment to init_dynarray_intvar_from_file ;-). Maybe consider it again; it is fine to rewrite/remove the existing code for IGNORE_SERVER_IDS and replace it with your new code. Again, it would reduce risk for bugs or inconsistencies.
+ /* getter */ + bool get_filter() { return m_filter; } + + /* setter */ + void set_filter(uint32 domain_id);
These are not really getter and setters, I think, at least I find those names confusing. Maybe something like do_filter(uint32 domain_id) and is_group_filtered() ? And expand the comments, so that they explain that these functions respectively check a GTID for needing to be filtered, and return whether the current group is being filtered. Also, personally I dislike setter/getter, and would prefer just to access the m_filter as a public member. But I suppose that's style and up to you.
+ /* + Initialize the given domain id list with the numbers from the specified + buffer. + + @param buf [IN] buffer + @param type [IN] domain id list type + + @retval false Success + true Error + */ + bool init_ids(char *buf, enum_list_type type);
Clarify in the comment that the list is initialized from a string of space-separated numbers, the first of which is the number of following entries.
@@ -182,6 +279,7 @@ bool flush_relay_log_cache, bool need_lock_relay_log); int change_master_server_id_cmp(ulong *id1, ulong *id2); +int change_master_domain_id_cmp(uint32 *id1, uint32 *id2);
I think change_master_domain_id_cmp() is only used in rpl_mi.cc, so should be static there, and not refered here in rpl_mi.h
=== modified file 'sql/slave.cc' --- a/sql/slave.cc 2014-08-12 03:55:41 +0000 +++ b/sql/slave.cc 2014-09-12 18:16:22 +0000 @@ -1259,6 +1259,7 @@ going to ignore (to not log them in the relay log). Items being read are supposed to be decimal output of values of a type shorter or equal of @c long and separated by the single space. + It also used to restore DO_DOMAIN_IDS & IGNORE_DOMAIN_IDS lists.
This comment suggests you wanted to share code, then later changed your mind. If you end up keeping the code separate, you still need to delete this comment. (And there is a typo: "It also" -> "It is also", but if you remove it anyway...).
@@ -5601,6 +5610,10 @@ mi->last_queued_gtid= event_gtid; mi->last_queued_gtid_standalone= (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0; + + /* Should filter all the subsequent events in the current GTID group? */ + mi->domain_id_filter.set_filter(event_gtid.domain_id);
So, here we will set m_filter to true, if we want to skip the following group. But I don't see any code that will set it to false at the end of the group? That seems a problem, as we could then end up wrongly skipping all kinds of other events, such as Gtid_list, Incident, ... Also note that it is possible for a GTID group to not be terminated by COMMIT/XID. This can happen at least if the master crashes in the middle of writing a group; the slave will receive only the first part of it, followed by a master restart Format_description event. That case needs to be handled, and you should add a test case for it. Also it is possible for eg. a master downgrade to introduce following event groups that do not start with GTID - seems these would also be wrongly skipped? So we also need a test case for this... Hope this helps, - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
4. Also see detailed comments for some possible problems with the implementation. The most serious is probably to ensure that events are not skipped after the end of the group, we need a couple of tests for this, see
Hm, actually I thought of another potential problem. What happens if the slave disconnects from the master in the middle of receiving an event group? There are several tricky issues around this part of replication, we definitely need some test cases for this as well. There are different cases, for example whether using GTID or non-GTID mode, whether the slave just reconnects, the I/O and/or SQL threads are restarted, or the entire server restarts. And if the filters are reconfigured before reconnecting. For example, it seems to me that in non-GTID mode, if we restart the server in the middle of receiving an event group, we can easily end up with ignoring one half of the group and not the other, which is very bad? And in GTID mode, the reconnect issue is quite tricky also, we need a test case to check that everything is ok. Though in this case we always reconnect at the start of an event group, so maybe things are easier to handle. For non-GTID mode, maybe we need to handle the ignoring in the SQL thread instead? Or alternatively, we can make ignoring events based on domain_id only legal in GTID mode, and give an error in non-GTID? - Kristian.
Hi! On Mon, Sep 15, 2014 at 10:09 AM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
4. Also see detailed comments for some possible problems with the implementation. The most serious is probably to ensure that events are not skipped after the end of the group, we need a couple of tests for this, see
Hm, actually I thought of another potential problem.
What happens if the slave disconnects from the master in the middle of receiving an event group? There are several tricky issues around this part of replication, we definitely need some test cases for this as well.
There are different cases, for example whether using GTID or non-GTID mode, whether the slave just reconnects, the I/O and/or SQL threads are restarted, or the entire server restarts. And if the filters are reconfigured before reconnecting.
I have tried to add multiple test scenarios to cover these aspects.
For example, it seems to me that in non-GTID mode, if we restart the server in the middle of receiving an event group, we can easily end up with ignoring one half of the group and not the other, which is very bad?
The filtering would only work when slave is configured with GTID.
And in GTID mode, the reconnect issue is quite tricky also, we need a test case to check that everything is ok. Though in this case we always reconnect at the start of an event group, so maybe things are easier to handle.
For non-GTID mode, maybe we need to handle the ignoring in the SQL thread instead? Or alternatively, we can make ignoring events based on domain_id only legal in GTID mode, and give an error in non-GTID?
I think the patch follows the 2nd approach minus the error. Could you please elaborate on the error? Best. - Nirbhay
Hi Kristian! Thank you for the review remarks. I have committed a patch addressing your comments. http://lists.askmonty.org/pipermail/commits/2014-October/006775.html See inline for my response on specific comments. On Mon, Sep 15, 2014 at 9:25 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@skysql.com> writes:
------------------------------------------------------------ revno: 3879 revision-id: nirbhay@skysql.com-20140912181622-s33e7r0vtgpju95q parent: nirbhay@skysql.com-20140814224304-rtkeal7p9lk06li1 committer: Nirbhay Choubey <nirbhay@skysql.com> branch nick: maria-10.0-galera_6593 timestamp: Fri 2014-09-12 14:16:22 -0400 message: MDEV-6593 : domain_id based replication filters
Added support for IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS.
Here is my review.
Some general comments, followed by some detailed comments in-line in the patch:
1. Did you prepare any documentation? You could either add it to the knowledgebase directly, or write up a text file and ask Daniel to add it in relevant places.
[NC] I haven't but I will prepare it and pass it down to Daniel.
2. Did you run the full test suite? I would have expected at least some tests in the multi_source suite to need modification. The chance touches enough things that you should test it in a feature tree in Buildbot to ensure that it is green before pushing it to a main tree.
[NC] You are right. I did that now and updated the result files.
3. I think some extra test cases are needed:
- You should test (and document) what happens when the same domain_id is in both the DO and the IGNORE list. Is it actually ever sensible to have both DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS ? If not, I think it should be an error to try.
[NC] You are right, setting both DO_ and IGNORE_ at the same time would not make any sense. But I actually followed the pattern of the existing replication, where both _do_ and _ignore_ filters are allowed to hold values at the same time and _do_ is preferred in case both are set.
- You should also add a test using non-GTID mode.
[NC] Done.
- You should also add a test where parallel replication is used (--slave-parallel-threads > 0).
[NC] Done.
- Add a test for the following scenario when using GTID mode: Some domain id is ignored, some events are received from the master and ignored. Then the slave is stopped, and the ignore rule is removed. What happens when the slave is restarted and reconnects? I think the previously ignored domain id events will now be re-fetched and replicated. I guess that's ok, but it should be documented. Also test what happens when binlog files on the master have been rotated and purged, so that the events are no longer available on the master (an error should occur).
[NC] As the filtering takes place in the IO thread. I have added some test cases covering various scenarios including ones where IO thread crashes while reading COMMIT/XID event. In this scenario if the filtering is ON the GTID is added to the ignored list (ign_gtids) on GTID event, so when the crash happens before COMMIT/XID event, that particular GTID gets ignored after the slave reconnects.
- Add a similar test for when not using GTID mode, the last event replicated by the slave is ignored, the slave is stopped, the ignore rule removed and the slave restarted; will the last event now be re-fetched or not (I think not).
[NC] Done.
- Add a test that checks that seconds_behind_master is properly updated to zero even when the last event is ignored due to DO_DOMAIN_IDS/IGNORE_DOMAIN_IDS. It can perhaps be a bit tricky to test, but I think you can use this existing test as a basis: mysql-test/suite/rpl/t/rpl_parallel2.test
[NC] Done.
- It would also be good to check that the code works when used on an old master (which does not send GTID events). You can do this by creating a binlog file with a 5.5 server and storing it in mysql-test/std_data/, and then installing it in a master; there should be several other test cases that already do this.
[NC] Done.
4. Also see detailed comments for some possible problems with the implementation. The most serious is probably to ensure that events are not skipped after the end of the group, we need a couple of tests for this, see below. Also, there is a small memory leak, see below.
+DO_DOMAIN_IDS (before) : 1 +IGNORE_DOMAIN_IDS (before) : 2 +CHANGE MASTER TO DO_DOMAIN_IDS=(4,4,5,1,7,7,7,1,1,2,6,8,1,4,5,5,9,3), MASTER_USE_GTID=current_pos;
So if a domain_id is specified multiple times, the redundant ones will be ignored.
[NC] Right. Also, there was a bug in the current IGNORE_SERVER_IDS implementation. Its fixed now.
It would seem preferable to give an error... but as far as I can see the behaviour is the same for IGNORE_SERVER_IDS, so better to be consistent as you did in the code. Just be sure to add this to the documentation.
[NC] Ok.
=== added file 'mysql-test/suite/rpl/t/rpl_domain_id_filter.test' --- a/mysql-test/suite/rpl/t/rpl_domain_id_filter.test 1970-01-01 00:00:00 +0000 +++ b/mysql-test/suite/rpl/t/rpl_domain_id_filter.test 2014-09-12 18:16:22 +0000
+SET @gtid_domain_id_saved = @@session.gtid_domain_id;
This is not needed, nor do you use it anywhere, you can remove it.
[NC] was a leftover, removed.
+##### Case 3: When both DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS are non-empty + +source include/stop_slave.inc; +let $do_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Do_Domain_Ids, 1); +let $ignore_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Ignore_Domain_Ids, 1); +--echo DO_DOMAIN_IDS (before) : $do_domain_ids_before +--echo IGNORE_DOMAIN_IDS (before) : $ignore_domain_ids_before + +# Replicate events belonging to "domain_id 1". (DO_DOMAIN_IDS should win!) +CHANGE MASTER TO DO_DOMAIN_IDS=(1), IGNORE_DOMAIN_IDS=(2), MASTER_USE_GTID=current_pos;
Did you intend here to test overlapping ids in DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS (due to the comment "DO_DOMAIN_IDS should win") ?
[NC] No, its about the DO_ and IGNORE_ both being non-empty. In that case DO_ should be preferred.
But if I understand correctly, IGNORE_DOMAIN_IDS will be ignored if DO_DOMAIN_IDS is non-empty.
[NC] Right.
So I think it should be an error to try to set both - do you agree?
[NC] Following the current pattern, I think it should DO_ should be preferred. But I can change it fail with error if you would insist.
+--source include/rpl_end.inc
Please add additional testcases, as suggested at the start of review.
=== modified file 'sql/rpl_mi.cc' --- a/sql/rpl_mi.cc 2014-02-19 10:05:15 +0000 +++ b/sql/rpl_mi.cc 2014-09-12 18:16:22 +0000 @@ -513,6 +513,25 @@ else mi->using_gtid= Master_info::USE_GTID_NO; } + + // DO_DOMAIN_IDS + if (0 == strncmp(buf, STRING_WITH_LEN("do_domain_ids=")) && + mi->domain_id_filter.init_ids(buf + sizeof("do_domain_ids"), + Domain_id_filter::DO_DOMAIN_IDS)) + { + sql_print_error("Failed to initialize master info do_domain_ids"); + goto errwithmsg; + } + + // IGNORE_DOMAIN_IDS + if (0 == strncmp(buf, STRING_WITH_LEN("ignore_domain_ids=")) && + mi->domain_id_filter.init_ids(buf + sizeof("ignore_domain_ids"), + Domain_id_filter::IGNORE_DOMAIN_IDS)) + { + sql_print_error("Failed to initialize master info " + "ignore_domain_ids"); + goto errwithmsg; + }
So here you use the same style as adjacent code (that I wrote). However, I have since learned that the style (0 == strncmp(...)) is not used in the server. So please change this, as well as my existing code, to use either strncmp(...) == 0 or !strncmp(), as you prefer.
[NC] Done.
Another thing from my old code is the sizeof("do_domain_ids"), which highly confused me for a bit before I realised that while the '=' is not included in the sizeof(), the terminating '\0' _is_ included. So since even I myself is confused by this code, better clarify it ;-) Perhaps it would be clearer if changed to (sizeof("do_domain_ids=")-1) (both in my old code and in your new code).
(Normally it's good to follow the style of existing code as you did, I just noticed this in the review and realised that my own "existing code" would benefit from these small improvements).
[NC] I did a bit of refactoring. Let me know if it looks ok.
@@ -634,6 +653,16 @@ } }
+ char* do_domain_ids_buf= + mi->domain_id_filter.as_string(Domain_id_filter::DO_DOMAIN_IDS); + if (do_domain_ids_buf == NULL) + DBUG_RETURN(1); + + char* ignore_domain_ids_buf= + mi->domain_id_filter.as_string(Domain_id_filter::IGNORE_DOMAIN_IDS); + if (ignore_domain_ids_buf == NULL) + DBUG_RETURN(1);
You have a memory leak here, do_domain_ids_buf will not be freed if ignore_domain_ids_buf fails due to out-of-memory.
[NC] Fixed.
@@ -1348,4 +1382,218 @@ DBUG_RETURN(result); }
+/* ctor */ +Domain_id_filter::Domain_id_filter() : m_filter(false) +{ + for (int i= DO_DOMAIN_IDS; i <= IGNORE_DOMAIN_IDS; i ++) + { + my_init_dynamic_array(&m_domain_ids[i], sizeof(uint32), 16, 16, MYF(0)); + } +} + +/* dtor */ +Domain_id_filter::~Domain_id_filter()
I don't think comments like /* ctor */ and /* dtor */ are useful, better just omit them.
[NC] Done.
+/** + Update the domain id list of the given type with elements from the new list. + + @param new_ids [IN] new list of ids + @param type [IN] domain id list type to update + @param changed [IN] has the domain id list changed in the + last CHANGE MASTER command? + + @retval void +*/ +void Domain_id_filter::update_ids(DYNAMIC_ARRAY *new_ids, enum_list_type type, + bool changed) +{ + DYNAMIC_ARRAY *ids= &m_domain_ids[type]; + + if (changed == true) + reset_dynamic(ids); + + /* bsearch requires an ordered list. */ + sort_dynamic(new_ids, change_master_domain_id_cmp); + + for (uint i= 0; i < new_ids->elements; i++) + { + uint32 domain_id; + get_dynamic(new_ids, (void *) &domain_id, i); + + if (bsearch((const uint32 *) &domain_id, ids->buffer, ids->elements, + sizeof(uint32), change_master_domain_id_cmp) == NULL) + { + insert_dynamic(ids, (uint32*) &domain_id); + } + } + return; +}
Ok, so I understand that this code is modelled after the existing code for IGNORED_SERVER_IDS. But I found it rather confusing, because it looks like there is a bug: if changed==false, then the code does not ensure that the resulting array will still be sorted.
I guess the bug is not possible to trigger, because the list of stuff to add will only be non-empty if changed==true? But that is rather confusing. Please simplify the logic; maybe you only need to call this function when changed==true, so the parameter can be omitted and the array always reset?
[NC] Right, in case of domain_ids the list should only be reset/updated when it has changed. I have remove the extra parameter.
Also, consider if you can rewrite the existing code for IGNORE_SERVER_IDS so that the same common code can be used as for IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS, that would simplify the code and reduce risk of bugs or inconsistencies.
[NC] Done.
+/** + Initialize the given domain id list with the numbers from the specified + buffer. + + @param buf [IN] buffer + @param type [IN] domain id list type + + @retval false Success + true Error +*/ +bool Domain_id_filter::init_ids(char *buf, enum_list_type type) +{ + char *token, *last; + uint num_items; + DYNAMIC_ARRAY *ids= &m_domain_ids[type]; + + token= strtok_r(buf, " ", &last); + if (token == NULL) + { + return true; /* error */ + } + + num_items= atoi(token); + for (uint i= 0; i < num_items; i++) + { + token= strtok_r(NULL, " ", &last); + if (token == NULL) + { + return true; /* error */ + } + else + { + uint32 val= atoi(token); + insert_dynamic(ids, (uchar *) &val); + } + } + return false; +}
Again, did you consider using the same code for this and for IGNORE_SERVER_IDS? (I think you did, because you added a comment to init_dynarray_intvar_from_file ;-). Maybe consider it again; it is fine to rewrite/remove the existing code for IGNORE_SERVER_IDS and replace it with your new code. Again, it would reduce risk for bugs or inconsistencies.
[NC] True, done.
+ /* getter */ + bool get_filter() { return m_filter; } + + /* setter */ + void set_filter(uint32 domain_id);
These are not really getter and setters, I think, at least I find those names confusing.
Maybe something like do_filter(uint32 domain_id) and is_group_filtered() ? And expand the comments, so that they explain that these functions respectively check a GTID for needing to be filtered, and return whether the current group is being filtered.
[NC] Done.
Also, personally I dislike setter/getter, and would prefer just to access the m_filter as a public member. But I suppose that's style and up to you.
[NC] I have kept it private for now to avoid accidental modification.
+ /* + Initialize the given domain id list with the numbers from the specified + buffer. + + @param buf [IN] buffer + @param type [IN] domain id list type + + @retval false Success + true Error + */ + bool init_ids(char *buf, enum_list_type type);
Clarify in the comment that the list is initialized from a string of space-separated numbers, the first of which is the number of following entries.
[NC] Done.
@@ -182,6 +279,7 @@ bool flush_relay_log_cache, bool need_lock_relay_log); int change_master_server_id_cmp(ulong *id1, ulong *id2); +int change_master_domain_id_cmp(uint32 *id1, uint32 *id2);
I think change_master_domain_id_cmp() is only used in rpl_mi.cc, so should be static there, and not refered here in rpl_mi.h
[NC] Done. Also renamed it to change_master_ids_cmp to be used by IGNORE_SERVER_IDS impl as well.
=== modified file 'sql/slave.cc' --- a/sql/slave.cc 2014-08-12 03:55:41 +0000 +++ b/sql/slave.cc 2014-09-12 18:16:22 +0000 @@ -1259,6 +1259,7 @@ going to ignore (to not log them in the relay log). Items being read are supposed to be decimal output of values of a type shorter or equal of @c long and separated by the single space. + It also used to restore DO_DOMAIN_IDS & IGNORE_DOMAIN_IDS lists.
This comment suggests you wanted to share code, then later changed your mind. If you end up keeping the code separate, you still need to delete this comment. (And there is a typo: "It also" -> "It is also", but if you remove it anyway...).
[NC] I have modified the patch to use this function now.
@@ -5601,6 +5610,10 @@ mi->last_queued_gtid= event_gtid; mi->last_queued_gtid_standalone= (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0; + + /* Should filter all the subsequent events in the current GTID group? */ + mi->domain_id_filter.set_filter(event_gtid.domain_id);
So, here we will set m_filter to true, if we want to skip the following group.
But I don't see any code that will set it to false at the end of the group?
[NC] With the previous logic m_filter was set/reset only at GTID event. I have now added the logic to reset this flag on COMMIT/XID events too.
That seems a problem, as we could then end up wrongly skipping all kinds of other events, such as Gtid_list, Incident, ...
[NC] Right. I have added the following events types in the exception list : FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT, HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.
Also note that it is possible for a GTID group to not be terminated by COMMIT/XID. This can happen at least if the master crashes in the middle of writing a group; the slave will receive only the first part of it, followed by a master restart Format_description event. That case needs to be handled, and you should add a test case for it.
[NC] Added a test to cover this scenario.
Also it is possible for eg. a master downgrade to introduce following event groups that do not start with GTID - seems these would also be wrongly skipped? So we also need a test case for this...
[NC] Right, its a valid case, but I couldn't find a way to repeat this scenario in mtr. Suggestions?
Hope this helps,
Thanks! -- Nirbhay
- Kristian.
Nirbhay Choubey <nirbhay@mariadb.com> writes:
[NC] You are right, setting both DO_ and IGNORE_ at the same time would not make any sense. But I actually followed the pattern of the existing replication, where both _do_ and _ignore_ filters are allowed to hold values at the same time and _do_ is preferred in case both are set.
But if it does not make any sense, why allow it? Not giving an error will just risk confuse users, there seems no benefit. So I still think we need an error.
- Add a test for the following scenario when using GTID mode: Some domain id is ignored, some events are received from the master and ignored. Then the slave is stopped, and the ignore rule is removed. What happens when the slave is restarted and reconnects? I think the previously ignored domain id events will now be re-fetched and replicated. I guess that's ok, but it should be documented. Also test what happens when binlog files on the master have been rotated and purged, so that the events are no longer available on the master (an error should occur).
[NC] As the filtering takes place in the IO thread. I have added some test cases covering various scenarios including ones where IO thread crashes while reading COMMIT/XID event. In this scenario if the filtering is ON the GTID is added to the ignored list (ign_gtids) on GTID event, so when the crash happens before COMMIT/XID event, that particular GTID gets ignored after the slave reconnects.
Ok, I'd forgotten about the ign_gtids. So if I understand correctly, this means that when we ignore a transaction due to filtered domain, the IO thread logs a GTID_LIST event to the relay log, and the SQL thread then updates the slave position to after the ignored events. So this means that an ignored event stays ignored, even if the domain id filter is later removed, that is good. Crash tests are always good and needed. However, the test I was asking for is what happens when the configuration is changed: CHANGE MASTER TO ignore_domain_ids=(2), master_use_gtid=current_pos; START SLAVE; # Replicate a number of events in domain 2. STOP SLAVE; CHANGE MASTER TO ignore_domain_ids=(); START SLAVE; # Replicate some more events in domain 2 from master. # Test that the events received before the filter was removed are still # ignored, while those received after were applied.
@@ -5601,6 +5610,10 @@ mi->last_queued_gtid= event_gtid; mi->last_queued_gtid_standalone= (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0; + + /* Should filter all the subsequent events in the current GTID group? */ + mi->domain_id_filter.set_filter(event_gtid.domain_id);
So, here we will set m_filter to true, if we want to skip the following group.
But I don't see any code that will set it to false at the end of the group?
[NC] With the previous logic m_filter was set/reset only at GTID event. I have now added the logic to reset this flag on COMMIT/XID events too.
I guess you mean this code? if ((mi->using_gtid != Master_info::USE_GTID_NO) && mi->domain_id_filter.is_group_filtered() && is_commit_rollback) { mi->domain_id_filter.reset_filter(); } But that is not sufficient. This is only valid for GTIDs that have the FL_STANDALONE flag clear. For GTIDs with FL_STANDALONE set, the event groups ends after then first event for which Log_event::is_part_of_group() returns false. (There are two kinds of event groups. Some use BEGIN ... COMMIT or BEGIN ... XID; Others have just a single statement, possibly with INTVAR or so events before, like CREATE TABLE ...).
That seems a problem, as we could then end up wrongly skipping all kinds of other events, such as Gtid_list, Incident, ...
[NC] Right. I have added the following events types in the exception list : FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT, HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.
Hm, that seems quite error prone if a new event is added later and not added to the list. Can you use the Log_event::is_group_event() function to check? That function is more likely to not be forgotten, and it seems all of its events should not be ignored in case of domain id filter?
Also it is possible for eg. a master downgrade to introduce following event groups that do not start with GTID - seems these would also be wrongly skipped? So we also need a test case for this...
[NC] Right, its a valid case, but I couldn't find a way to repeat this scenario in mtr. Suggestions?
I guess like your test in rpl_domain_id_filter_old_master.test, where you copy an old binlog file into the master server. To test this, just copy in a master-bin.000002 with old events, when the last event in master-bin.000001 is skipped by the slave due to domain filters.
For example, it seems to me that in non-GTID mode, if we restart the server in the middle of receiving an event group, we can easily end up with ignoring one half of the group and not the other, which is very bad?
The filtering would only work when slave is configured with GTID.
What do you mean "only work"? That trying to use domain id filters in non-gtid mode will give an error? Or that the filter will be ignored? Or that it will just malfunction? I think it is reasonable to say that domain filtering is only available in GTID mode. But then it needs to be an error if user tries to use the domain filter in non-gtid mode. Thus: CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(1) should give an error. And setting just master_use_gtid=no when there is a filter, or just setting a filter when master_use_gtid=no, both should give errors as well. Right? And now I saw a commit "Added some more test cases to non-GTID mode." which seems to add test cases to test domain id filters precisely when MASTER_USE_GTID=no? So now I am rather confused as to what you mean with "only work when slave is configured with GTID" ?
For non-GTID mode, maybe we need to handle the ignoring in the SQL thread instead? Or alternatively, we can make ignoring events based on domain_id only legal in GTID mode, and give an error in non-GTID?
I think the patch follows the 2nd approach minus the error. Could you please elaborate on the error?
If domain id filters are not available in non-GTID mode (master_use_gtid=no), then the server needs to give an error if the user tries to configure CHANGE MASTER to have domain id filters in non-gtid mode. Otherwise the user could be unpleasantly surprised.
Why do you need to use ulong? Domain ids should always be uint32, also on 64-bit.
What is the real problem? Eg. why is there a test failure?
It was due to the different sizes for ulong on different platforms. While trying to reuse the (modified) auxiliary functions for IGNORE_SERVER_IDS where ulong is used for ids (server_id), uint32 store for domain_id did not work on 64-bit.
Ok, I see, this is to re-use change_master_id_cmp() between server id filter and domain id filter. For some reason the code for server_id filter is using ulong, that is what surprised me. Well, I suppose it doesn't matter much one way or the other.
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT. So to reliably test the slave I/O thread disconnecting in the middle of an event group, you need to use some DBUG error injection. You can see an example of doing this in rpl_gtid_reconnect.test, though here we may need a slightly different DBUG injection, as we need to reconfigure the slave in the middle of the group, not trigger an automatic reconnect. But wouldn't it just be better to give an error if the user tries to use domain id filters with MASTER_USE_GTID=NO ? It could be tricky to implement correctly, and it doesn't seem an important use case? In GTID mode, things are a bit simpler - because one can not change the filters without stopping both SQL and IO thread. And in GTID mode, the I/O thread deletes the old relay logs reconnecting after both slave threads have been stopped, thus there is no opportunity to filter only half of an event group when changing domain id filters. - Kristian.
Hi! On Mon, Oct 27, 2014 at 9:12 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:
[NC] You are right, setting both DO_ and IGNORE_ at the same time would not make any sense. But I actually followed the pattern of the existing replication, where both _do_ and _ignore_ filters are allowed to hold values at the same time and _do_ is preferred in case both are set.
But if it does not make any sense, why allow it? Not giving an error will just risk confuse users, there seems no benefit.
Right.
So I still think we need an error.
Ok. CHANGE MASTER would now fail when both DO_ and INGORE_ are non-empty at the same time or when MASTER_USE_GTID=no with non-empty DO_ or IGNORE_.
- Add a test for the following scenario when using GTID mode: Some domain id is ignored, some events are received from the master and ignored. Then the slave is stopped, and the ignore rule is removed. What happens when the slave is restarted and reconnects? I think the previously ignored domain id events will now be re-fetched and replicated. I guess that's ok, but it should be documented. Also test what happens when binlog files on the master have been rotated and purged, so that the events are no longer available on the master (an error should occur).
[NC] As the filtering takes place in the IO thread. I have added some test cases covering various scenarios including ones where IO thread crashes while reading COMMIT/XID event. In this scenario if the filtering is ON the GTID is added to the ignored list (ign_gtids) on GTID event, so when the crash happens before COMMIT/XID event, that particular GTID gets ignored after the slave reconnects.
Ok, I'd forgotten about the ign_gtids.
So if I understand correctly, this means that when we ignore a transaction due to filtered domain, the IO thread logs a GTID_LIST event to the relay log, and the SQL thread then updates the slave position to after the ignored events.
So this means that an ignored event stays ignored, even if the domain id filter is later removed, that is good.
Yes.
Crash tests are always good and needed. However, the test I was asking for is what happens when the configuration is changed:
CHANGE MASTER TO ignore_domain_ids=(2), master_use_gtid=current_pos; START SLAVE; # Replicate a number of events in domain 2. STOP SLAVE; CHANGE MASTER TO ignore_domain_ids=(); START SLAVE; # Replicate some more events in domain 2 from master. # Test that the events received before the filter was removed are still # ignored, while those received after were applied.
I have added a test covering this scenario.
@@ -5601,6 +5610,10 @@ mi->last_queued_gtid= event_gtid; mi->last_queued_gtid_standalone= (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0; + + /* Should filter all the subsequent events in the current GTID group? */ + mi->domain_id_filter.set_filter(event_gtid.domain_id);
So, here we will set m_filter to true, if we want to skip the following group.
But I don't see any code that will set it to false at the end of the group?
[NC] With the previous logic m_filter was set/reset only at GTID event. I have now added the logic to reset this flag on COMMIT/XID events too.
I guess you mean this code?
if ((mi->using_gtid != Master_info::USE_GTID_NO) && mi->domain_id_filter.is_group_filtered() && is_commit_rollback) { mi->domain_id_filter.reset_filter(); }
But that is not sufficient. This is only valid for GTIDs that have the FL_STANDALONE flag clear. For GTIDs with FL_STANDALONE set, the event groups ends after then first event for which Log_event::is_part_of_group() returns false.
(There are two kinds of event groups. Some use BEGIN ... COMMIT or BEGIN ... XID; Others have just a single statement, possibly with INTVAR or so events before, like CREATE TABLE ...).
I see. Fixed.
That seems a problem, as we could then end up wrongly skipping all
other events, such as Gtid_list, Incident, ...
[NC] Right. I have added the following events types in the exception
kinds of list :
FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT, HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.
Hm, that seems quite error prone if a new event is added later and not added to the list.
Can you use the Log_event::is_group_event() function to check? That function is more likely to not be forgotten, and it seems all of its events should not be ignored in case of domain id filter?
Right. Done!
Also it is possible for eg. a master downgrade to introduce following event groups that do not start with GTID - seems these would also be wrongly skipped? So we also need a test case for this...
[NC] Right, its a valid case, but I couldn't find a way to repeat this scenario in mtr. Suggestions?
I guess like your test in rpl_domain_id_filter_old_master.test, where you copy an old binlog file into the master server. To test this, just copy in a master-bin.000002 with old events, when the last event in master-bin.000001 is skipped by the slave due to domain filters.
Now that DO_ & IGNORE_ can''t be used with MASTER_USE_GTID=no, using CURRENT_POS/SLAVE_POS causes this old binlog file to skip. So, I had to remove this test case.
For example, it seems to me that in non-GTID mode, if we restart the server in the middle of receiving an event group, we can easily end up with ignoring one half of the group and not the other, which is very bad?
The filtering would only work when slave is configured with GTID.
What do you mean "only work"? That trying to use domain id filters in non-gtid mode will give an error? Or that the filter will be ignored? Or that it will just malfunction?
It meant that in non-GTID mode, the domain_id filters with simply get ignored. But, following your suggestion, I have now modified the log to give error if domin_id filter is set with MASTER_USE_GTID=no.
I think it is reasonable to say that domain filtering is only available in GTID mode.
Right.
But then it needs to be an error if user tries to use the domain filter in non-gtid mode.
Thus:
CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(1)
should give an error. And setting just master_use_gtid=no when there is a filter, or just setting a filter when master_use_gtid=no, both should give errors as well.
Right?
Yes. I have also added some tests covering various combinations.
And now I saw a commit "Added some more test cases to non-GTID mode." which seems to add test cases to test domain id filters precisely when MASTER_USE_GTID=no? So now I am rather confused as to what you mean with "only work when slave is configured with GTID" ?
non-gtid mode tests were to make sure replication happens "normally" in non-gtid mode even when domain_id filters were set. But now since this combination (domain_id filters in non-gtid mode) is not valid any more, I have removed the non-gtid tests.
For non-GTID mode, maybe we need to handle the ignoring in the SQL thread instead? Or alternatively, we can make ignoring events based on domain_id only legal in GTID mode, and give an error in non-GTID?
I think the patch follows the 2nd approach minus the error. Could you please elaborate on the error?
If domain id filters are not available in non-GTID mode (master_use_gtid=no), then the server needs to give an error if the user tries to configure CHANGE MASTER to have domain id filters in non-gtid mode. Otherwise the user could be unpleasantly surprised.
Right, fixed.
Why do you need to use ulong? Domain ids should always be uint32, also on 64-bit.
What is the real problem? Eg. why is there a test failure?
It was due to the different sizes for ulong on different platforms. While trying to reuse the (modified) auxiliary functions for IGNORE_SERVER_IDS where ulong is used for ids (server_id), uint32 store for domain_id did not work on 64-bit.
Ok, I see, this is to re-use change_master_id_cmp() between server id filter and domain id filter. For some reason the code for server_id filter is using ulong, that is what surprised me. Well, I suppose it doesn't matter much one way or the other.
ok.
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT.
You are right. My original intent was to test a transaction which modifies both MyISAM and InnoDB tables, where first modification is done in MyISAM table. In which case the changes to MyISAM is sent to the slave right away, while rest of trx is sent on commit. I have modified the test accordingly.
So to reliably test the slave I/O thread disconnecting in the middle of an event group, you need to use some DBUG error injection. You can see an example of doing this in rpl_gtid_reconnect.test, though here we may need a slightly different DBUG injection, as we need to reconfigure the slave in the middle of the group, not trigger an automatic reconnect.
But wouldn't it just be better to give an error if the user tries to use domain id filters with MASTER_USE_GTID=NO ? It could be tricky to implement correctly, and it doesn't seem an important use case?
I agree.
In GTID mode, things are a bit simpler - because one can not change the filters without stopping both SQL and IO thread. And in GTID mode, the I/O thread deletes the old relay logs reconnecting after both slave threads have been stopped, thus there is no opportunity to filter only half of an event group when changing domain id filters.
Right. Best, Nirbhay
Nirbhay Choubey <nirbhay@mariadb.com> writes:
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT.
You are right. My original intent was to test a transaction which modifies both MyISAM and InnoDB tables, where first modification is done in MyISAM table. In which case the changes to MyISAM is sent to the slave right away, while rest of trx is sent on commit. I have modified the test accordingly.
I'm still not sure you understand the scenario I had in mind. It's not about what happens on the master during the transaction. It is about what happens in case the slave disconnects in the middle of receiving an event group/transaction. In general in replication, the major part of the work is not implementing the functionality for the normal case - that is usually relatively easy. The major part is handling and testing all the special cases that can occur in special scenarios, especially various error cases. The replication code is really complex in this respect, and the fact that things by their nature happen in parallel between different threads and different servers make things even more complex. What I wanted you to think about here is what happens if the slave is disconnected from the master after having received the first half of an event group. For example due to network error. This will not happen normally in a mysql-test-case run, and if it happens in a production site for a user, it will be extremely hard to track down. In this case, the second half of the event group could be received much later than the first half. The IO thread could have been stopped (or even the whole mysqld server could have been stopped) in-between, and the replication could have been re-configured with CHANGE MASTER. Since the IO thread is doing the filtering, it seems very important to consider what will happen if eg. filters are enabled while receiving the first half of the transaction, but disabled while receiving the second half: Suppose we have this transaction: BEGIN GTID 2-1-100 INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); COMMIT; What happens in the following scenario? CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2); START SLAVE; # slave IO thread connects to master; # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1); # slave IO thread is disconnected from master STOP SLAVE; # slave mysqld process is stopped and restarted. CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(); START SLAVE; # slave IO thread connects to master; # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT; Are you sure that this will work correctly? And what does "work correctly" mean in this case? Will the transaction be completely ignored? Or will it be completely replicated on the slave? The bug would be if the first half would be ignored, but the second half still written into the relay log. To test this, you would need to use DBUG error insertion. There are already some tests that do this. They use for example SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events"; The code will then (in debug builds) simulate a disconnect at some particular point in the replication stream, allowing this rare but important case to be tested. This is done using DBUG_EXECUTE_IF() in the code. To work on replication without introducing nasty bugs, it is important to think through cases like this carefully, and to convince yourself that things will work correctly. Disconnects at various points, crashes on the master or slave, errors during applying events or writing to the relay logs, and so on. Hope this helps, - Kristian.
Hi! On Fri, Nov 14, 2014 at 3:58 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT.
You are right. My original intent was to test a transaction which modifies both MyISAM and InnoDB tables, where first modification is done in MyISAM table. In which case the changes to MyISAM is sent to the slave right away, while rest of trx is sent on commit. I have modified the test accordingly.
I'm still not sure you understand the scenario I had in mind. It's not about what happens on the master during the transaction. It is about what happens in case the slave disconnects in the middle of receiving an event group/transaction.
You are perhaps looking at an older version of the test. The latest says : <cut> ##### Case 7: Stop slave before a transaction (involving MyISAM and InnoDB # table) being filtered commits and start it back with filtering # disabled. ... </cut>
In general in replication, the major part of the work is not implementing the functionality for the normal case - that is usually relatively easy. The major part is handling and testing all the special cases that can occur in special scenarios, especially various error cases. The replication code is really complex in this respect, and the fact that things by their nature happen in parallel between different threads and different servers make things even more complex.
What I wanted you to think about here is what happens if the slave is disconnected from the master after having received the first half of an event group. For example due to network error. This will not happen normally in a mysql-test-case run, and if it happens in a production site for a user, it will be extremely hard to track down.
In this case, the second half of the event group could be received much later than the first half. The IO thread could have been stopped (or even the whole mysqld server could have been stopped) in-between, and the replication could have been re-configured with CHANGE MASTER. Since the IO thread is doing the filtering, it seems very important to consider what will happen if eg. filters are enabled while receiving the first half of the transaction, but disabled while receiving the second half:
Suppose we have this transaction:
BEGIN GTID 2-1-100 INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); COMMIT;
What happens in the following scenario?
CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2); START SLAVE; # slave IO thread connects to master; # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1); # slave IO thread is disconnected from master STOP SLAVE; # slave mysqld process is stopped and restarted. CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(); START SLAVE; # slave IO thread connects to master; # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;
Are you sure that this will work correctly? And what does "work correctly" mean in this case? Will the transaction be completely ignored? Or will it be completely replicated on the slave? The bug would be if the first half would be ignored, but the second half still written into the relay log.
To test this, you would need to use DBUG error insertion. There are already some tests that do this. They use for example
SET GLOBAL debug_dbug="+d,binlog_force_reconnect_after_22_events";
The code will then (in debug builds) simulate a disconnect at some particular point in the replication stream, allowing this rare but important case to be tested. This is done using DBUG_EXECUTE_IF() in the code.
I had already added multiple cases under rpl_domain_id_filter_io_crash.test using DBUG_EXECUTE_IF("+d,"kill_io_slave_before_commit") in the previous commit. Even though, it is not exactly similar to what you suggest, it does, however,try to kill I/O thread when it receives COMMIT/XID event (cases 0 - 3) in order to test what happens when I/O exits before reading the complete transaction or group with filtering enable before/after slave restart. Following your suggestion, I have now added 2 more cases (4 and 5) using DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading first INSERT in a transaction. The outcome is expected.
To work on replication without introducing nasty bugs, it is important to think through cases like this carefully, and to convince yourself that things will work correctly. Disconnects at various points, crashes on the master or slave, errors during applying events or writing to the relay logs, and so on.
I agree.
Hope this helps,
Indeed. Best, Nirbhay
Nirbhay Choubey <nirbhay@mariadb.com> writes:
Suppose we have this transaction:
BEGIN GTID 2-1-100 INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); COMMIT;
What happens in the following scenario?
CHANGE MASTER TO master_use_gtid=current_pos, ignore_domain_ids=(2); START SLAVE; # slave IO thread connects to master; # slave receives: BEGIN GTID 2-1-100; INSERT INTO t1 VALUES (1); # slave IO thread is disconnected from master STOP SLAVE; # slave mysqld process is stopped and restarted. CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(); START SLAVE; # slave IO thread connects to master; # slave IO thread receives: INSERT INTO t1 VALUES (2); COMMIT;
Ah, now I see. CHANGE MASTER deletes the relay logs. So it's not a problem if the ignore_domain_ids changes in the middle of receiving an event group, we will discard that partial event group anyway. (I missed the fact that (almost) any CHANGE MASTER deletes the relay logs, but it's actually documented). Ok that's good, so there is no bug here.
Following your suggestion, I have now added 2 more cases (4 and 5) using DBUG_EXECUTE_IF(+d,"kill_slave_io_after_2_events") to kill I/O after reading first INSERT in a transaction. The outcome is expected.
Cool, thanks. Looks good now, I think. - Kristian.
participants (2)
-
Kristian Nielsen
-
Nirbhay Choubey