Hi! On Mon, Oct 27, 2014 at 9:12 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@mariadb.com> writes:
[NC] You are right, setting both DO_ and IGNORE_ at the same time would not make any sense. But I actually followed the pattern of the existing replication, where both _do_ and _ignore_ filters are allowed to hold values at the same time and _do_ is preferred in case both are set.
But if it does not make any sense, why allow it? Not giving an error will just risk confuse users, there seems no benefit.
Right.
So I still think we need an error.
Ok. CHANGE MASTER would now fail when both DO_ and INGORE_ are non-empty at the same time or when MASTER_USE_GTID=no with non-empty DO_ or IGNORE_.
- Add a test for the following scenario when using GTID mode: Some domain id is ignored, some events are received from the master and ignored. Then the slave is stopped, and the ignore rule is removed. What happens when the slave is restarted and reconnects? I think the previously ignored domain id events will now be re-fetched and replicated. I guess that's ok, but it should be documented. Also test what happens when binlog files on the master have been rotated and purged, so that the events are no longer available on the master (an error should occur).
[NC] As the filtering takes place in the IO thread. I have added some test cases covering various scenarios including ones where IO thread crashes while reading COMMIT/XID event. In this scenario if the filtering is ON the GTID is added to the ignored list (ign_gtids) on GTID event, so when the crash happens before COMMIT/XID event, that particular GTID gets ignored after the slave reconnects.
Ok, I'd forgotten about the ign_gtids.
So if I understand correctly, this means that when we ignore a transaction due to filtered domain, the IO thread logs a GTID_LIST event to the relay log, and the SQL thread then updates the slave position to after the ignored events.
So this means that an ignored event stays ignored, even if the domain id filter is later removed, that is good.
Yes.
Crash tests are always good and needed. However, the test I was asking for is what happens when the configuration is changed:
CHANGE MASTER TO ignore_domain_ids=(2), master_use_gtid=current_pos; START SLAVE; # Replicate a number of events in domain 2. STOP SLAVE; CHANGE MASTER TO ignore_domain_ids=(); START SLAVE; # Replicate some more events in domain 2 from master. # Test that the events received before the filter was removed are still # ignored, while those received after were applied.
I have added a test covering this scenario.
@@ -5601,6 +5610,10 @@ mi->last_queued_gtid= event_gtid; mi->last_queued_gtid_standalone= (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0; + + /* Should filter all the subsequent events in the current GTID group? */ + mi->domain_id_filter.set_filter(event_gtid.domain_id);
So, here we will set m_filter to true, if we want to skip the following group.
But I don't see any code that will set it to false at the end of the group?
[NC] With the previous logic m_filter was set/reset only at GTID event. I have now added the logic to reset this flag on COMMIT/XID events too.
I guess you mean this code?
if ((mi->using_gtid != Master_info::USE_GTID_NO) && mi->domain_id_filter.is_group_filtered() && is_commit_rollback) { mi->domain_id_filter.reset_filter(); }
But that is not sufficient. This is only valid for GTIDs that have the FL_STANDALONE flag clear. For GTIDs with FL_STANDALONE set, the event groups ends after then first event for which Log_event::is_part_of_group() returns false.
(There are two kinds of event groups. Some use BEGIN ... COMMIT or BEGIN ... XID; Others have just a single statement, possibly with INTVAR or so events before, like CREATE TABLE ...).
I see. Fixed.
That seems a problem, as we could then end up wrongly skipping all
other events, such as Gtid_list, Incident, ...
[NC] Right. I have added the following events types in the exception
kinds of list :
FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT, HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.
Hm, that seems quite error prone if a new event is added later and not added to the list.
Can you use the Log_event::is_group_event() function to check? That function is more likely to not be forgotten, and it seems all of its events should not be ignored in case of domain id filter?
Right. Done!
Also it is possible for eg. a master downgrade to introduce following event groups that do not start with GTID - seems these would also be wrongly skipped? So we also need a test case for this...
[NC] Right, its a valid case, but I couldn't find a way to repeat this scenario in mtr. Suggestions?
I guess like your test in rpl_domain_id_filter_old_master.test, where you copy an old binlog file into the master server. To test this, just copy in a master-bin.000002 with old events, when the last event in master-bin.000001 is skipped by the slave due to domain filters.
Now that DO_ & IGNORE_ can''t be used with MASTER_USE_GTID=no, using CURRENT_POS/SLAVE_POS causes this old binlog file to skip. So, I had to remove this test case.
For example, it seems to me that in non-GTID mode, if we restart the server in the middle of receiving an event group, we can easily end up with ignoring one half of the group and not the other, which is very bad?
The filtering would only work when slave is configured with GTID.
What do you mean "only work"? That trying to use domain id filters in non-gtid mode will give an error? Or that the filter will be ignored? Or that it will just malfunction?
It meant that in non-GTID mode, the domain_id filters with simply get ignored. But, following your suggestion, I have now modified the log to give error if domin_id filter is set with MASTER_USE_GTID=no.
I think it is reasonable to say that domain filtering is only available in GTID mode.
Right.
But then it needs to be an error if user tries to use the domain filter in non-gtid mode.
Thus:
CHANGE MASTER TO master_use_gtid=no, ignore_domain_ids=(1)
should give an error. And setting just master_use_gtid=no when there is a filter, or just setting a filter when master_use_gtid=no, both should give errors as well.
Right?
Yes. I have also added some tests covering various combinations.
And now I saw a commit "Added some more test cases to non-GTID mode." which seems to add test cases to test domain id filters precisely when MASTER_USE_GTID=no? So now I am rather confused as to what you mean with "only work when slave is configured with GTID" ?
non-gtid mode tests were to make sure replication happens "normally" in non-gtid mode even when domain_id filters were set. But now since this combination (domain_id filters in non-gtid mode) is not valid any more, I have removed the non-gtid tests.
For non-GTID mode, maybe we need to handle the ignoring in the SQL thread instead? Or alternatively, we can make ignoring events based on domain_id only legal in GTID mode, and give an error in non-GTID?
I think the patch follows the 2nd approach minus the error. Could you please elaborate on the error?
If domain id filters are not available in non-GTID mode (master_use_gtid=no), then the server needs to give an error if the user tries to configure CHANGE MASTER to have domain id filters in non-gtid mode. Otherwise the user could be unpleasantly surprised.
Right, fixed.
Why do you need to use ulong? Domain ids should always be uint32, also on 64-bit.
What is the real problem? Eg. why is there a test failure?
It was due to the different sizes for ulong on different platforms. While trying to reuse the (modified) auxiliary functions for IGNORE_SERVER_IDS where ulong is used for ids (server_id), uint32 store for domain_id did not work on 64-bit.
Ok, I see, this is to re-use change_master_id_cmp() between server id filter and domain id filter. For some reason the code for server_id filter is using ulong, that is what surprised me. Well, I suppose it doesn't matter much one way or the other.
ok.
##### Case 7: Stop slave into the middle of a transaction being filtered and # start it back with filtering disabled.
--echo # On master connection master; SET @@session.gtid_domain_id=1; BEGIN; INSERT INTO t2 VALUES(3); INSERT INTO t3 VALUES(3); sync_slave_with_master;
No, this does not work. Transactions are always binlogged as a whole on the master, during COMMIT.
You are right. My original intent was to test a transaction which modifies both MyISAM and InnoDB tables, where first modification is done in MyISAM table. In which case the changes to MyISAM is sent to the slave right away, while rest of trx is sent on commit. I have modified the test accordingly.
So to reliably test the slave I/O thread disconnecting in the middle of an event group, you need to use some DBUG error injection. You can see an example of doing this in rpl_gtid_reconnect.test, though here we may need a slightly different DBUG injection, as we need to reconfigure the slave in the middle of the group, not trigger an automatic reconnect.
But wouldn't it just be better to give an error if the user tries to use domain id filters with MASTER_USE_GTID=NO ? It could be tricky to implement correctly, and it doesn't seem an important use case?
I agree.
In GTID mode, things are a bit simpler - because one can not change the filters without stopping both SQL and IO thread. And in GTID mode, the I/O thread deletes the old relay logs reconnecting after both slave threads have been stopped, thus there is no opportunity to filter only half of an event group when changing domain id filters.
Right. Best, Nirbhay