Hi Kristian!

Thank you for the review remarks. I have committed a patch addressing your comments.


See inline for my response on specific comments.

On Mon, Sep 15, 2014 at 9:25 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Nirbhay Choubey <nirbhay@skysql.com> writes:

> ------------------------------------------------------------
> revno: 3879
> revision-id: nirbhay@skysql.com-20140912181622-s33e7r0vtgpju95q
> parent: nirbhay@skysql.com-20140814224304-rtkeal7p9lk06li1
> committer: Nirbhay Choubey <nirbhay@skysql.com>
> branch nick: maria-10.0-galera_6593
> timestamp: Fri 2014-09-12 14:16:22 -0400
> message:
>   MDEV-6593 : domain_id based replication filters
>
>   Added support for IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS.

Here is my review.

Some general comments, followed by some detailed comments in-line in the
patch:

1. Did you prepare any documentation? You could either add it to the
knowledgebase directly, or write up a text file and ask Daniel to add it in
relevant places.

[NC] I haven't but I will prepare it and pass it down to Daniel.
 

2. Did you run the full test suite? I would have expected at least some tests
in the multi_source suite to need modification. The chance touches enough
things that you should test it in a feature tree in Buildbot to ensure that it
is green before pushing it to a main tree.

[NC] You are right. I did that now and updated the result files.


3. I think some extra test cases are needed:

 - You should test (and document) what happens when the same domain_id is in
   both the DO and the IGNORE list. Is it actually ever sensible to have both
   DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS ? If not, I think it should be an error
   to try.

[NC] You are right, setting both DO_ and IGNORE_ at the same time would not make any sense.
But I actually followed the pattern of the existing replication, where both _do_ and _ignore_
filters are allowed to hold values at the same time and _do_ is preferred in case both are set.


 - You should also add a test using non-GTID mode.

[NC] Done.
 

 - You should also add a test where parallel replication is used
   (--slave-parallel-threads > 0).

[NC] Done.
 

 - Add a test for the following scenario when using GTID mode: Some domain id
   is ignored, some events are received from the master and ignored. Then the
   slave is stopped, and the ignore rule is removed. What happens when the
   slave is restarted and reconnects? I think the previously ignored domain id
   events will now be re-fetched and replicated. I guess that's ok, but it
   should be documented. Also test what happens when binlog files on the
   master have been rotated and purged, so that the events are no longer
   available on the master (an error should occur).

[NC] As the filtering takes place in the IO thread. I have added some test cases covering
various scenarios including ones where IO thread crashes while reading COMMIT/XID event.
In this scenario if the filtering is ON the GTID is added to the ignored list (ign_gtids) on GTID
event, so when the crash happens before COMMIT/XID event, that particular GTID gets ignored
after the slave reconnects.


 - Add a similar test for when not using GTID mode, the last event replicated
   by the slave is ignored, the slave is stopped, the ignore rule removed and
   the slave restarted; will the last event now be re-fetched or not (I think
   not).

[NC] Done.
 

 - Add a test that checks that seconds_behind_master is properly updated to
   zero even when the last event is ignored due to
   DO_DOMAIN_IDS/IGNORE_DOMAIN_IDS. It can perhaps be a bit tricky to test,
   but I think you can use this existing test as a basis:
   mysql-test/suite/rpl/t/rpl_parallel2.test

[NC] Done.
 

 - It would also be good to check that the code works when used on an old
   master (which does not send GTID events). You can do this by creating a
   binlog file with a 5.5 server and storing it in mysql-test/std_data/, and
   then installing it in a master; there should be several other test cases
   that already do this.

[NC] Done.
 

4. Also see detailed comments for some possible problems with the
implementation. The most serious is probably to ensure that events are not
skipped after the end of the group, we need a couple of tests for this, see
below. Also, there is a small memory leak, see below.

> +DO_DOMAIN_IDS (before)     : 1
> +IGNORE_DOMAIN_IDS (before) : 2
> +CHANGE MASTER TO DO_DOMAIN_IDS=(4,4,5,1,7,7,7,1,1,2,6,8,1,4,5,5,9,3), MASTER_USE_GTID=current_pos;

So if a domain_id is specified multiple times, the redundant ones will be
ignored.

[NC] Right. Also, there was a bug in the current IGNORE_SERVER_IDS implementation. Its fixed now.


It would seem preferable to give an error... but as far as I can see the
behaviour is the same for IGNORE_SERVER_IDS, so better to be consistent as you
did in the code. Just be sure to add this to the documentation.

[NC] Ok.
 


> === added file 'mysql-test/suite/rpl/t/rpl_domain_id_filter.test'
> --- a/mysql-test/suite/rpl/t/rpl_domain_id_filter.test        1970-01-01 00:00:00 +0000
> +++ b/mysql-test/suite/rpl/t/rpl_domain_id_filter.test        2014-09-12 18:16:22 +0000

> +SET @gtid_domain_id_saved = @@session.gtid_domain_id;

This is not needed, nor do you use it anywhere, you can remove it.

[NC] was a leftover, removed.
 

> +##### Case 3: When both DO_DOMAIN_IDS and IGNORE_DOMAIN_IDS are non-empty
> +
> +source include/stop_slave.inc;
> +let $do_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Do_Domain_Ids, 1);
> +let $ignore_domain_ids_before= query_get_value(SHOW SLAVE STATUS, Replicate_Ignore_Domain_Ids, 1);
> +--echo DO_DOMAIN_IDS (before)     : $do_domain_ids_before
> +--echo IGNORE_DOMAIN_IDS (before) : $ignore_domain_ids_before
> +
> +# Replicate events belonging to "domain_id 1". (DO_DOMAIN_IDS should win!)
> +CHANGE MASTER TO DO_DOMAIN_IDS=(1), IGNORE_DOMAIN_IDS=(2), MASTER_USE_GTID=current_pos;

Did you intend here to test overlapping ids in DO_DOMAIN_IDS and
IGNORE_DOMAIN_IDS (due to the comment "DO_DOMAIN_IDS should win") ?

[NC] No, its about the DO_ and IGNORE_ both being non-empty. In that case DO_ should be preferred. 


But if I understand correctly, IGNORE_DOMAIN_IDS will be ignored if
DO_DOMAIN_IDS is non-empty.

[NC] Right.
 
So I think it should be an error to try to set
both - do you agree?

[NC] Following the current pattern, I think it should DO_ should be preferred. But I can change it fail with error
if you would insist.
 

> +--source include/rpl_end.inc

Please add additional testcases, as suggested at the start of review.


> === modified file 'sql/rpl_mi.cc'
> --- a/sql/rpl_mi.cc   2014-02-19 10:05:15 +0000
> +++ b/sql/rpl_mi.cc   2014-09-12 18:16:22 +0000
> @@ -513,6 +513,25 @@
>              else
>                mi->using_gtid= Master_info::USE_GTID_NO;
>            }
> +
> +          // DO_DOMAIN_IDS
> +          if (0 == strncmp(buf, STRING_WITH_LEN("do_domain_ids=")) &&
> +              mi->domain_id_filter.init_ids(buf + sizeof("do_domain_ids"),
> +                                            Domain_id_filter::DO_DOMAIN_IDS))
> +          {
> +            sql_print_error("Failed to initialize master info do_domain_ids");
> +            goto errwithmsg;
> +          }
> +
> +          // IGNORE_DOMAIN_IDS
> +          if (0 == strncmp(buf, STRING_WITH_LEN("ignore_domain_ids=")) &&
> +              mi->domain_id_filter.init_ids(buf + sizeof("ignore_domain_ids"),
> +                                            Domain_id_filter::IGNORE_DOMAIN_IDS))
> +          {
> +            sql_print_error("Failed to initialize master info "
> +                            "ignore_domain_ids");
> +            goto errwithmsg;
> +          }

So here you use the same style as adjacent code (that I wrote). However, I
have since learned that the style (0 == strncmp(...)) is not used in the
server. So please change this, as well as my existing code, to use either
strncmp(...) == 0 or !strncmp(), as you prefer.

[NC] Done.
 

Another thing from my old code is the sizeof("do_domain_ids"), which highly
confused me for a bit before I realised that while the '=' is not included in
the sizeof(), the terminating '\0' _is_ included. So since even I myself is
confused by this code, better clarify it ;-) Perhaps it would be clearer if
changed to (sizeof("do_domain_ids=")-1) (both in my old code and in your new
code).

(Normally it's good to follow the style of existing code as you did, I just
noticed this in the review and realised that my own "existing code" would
benefit from these small improvements).

[NC] I did a bit of refactoring. Let me know if it looks ok.



> @@ -634,6 +653,16 @@
>      }
>    }
>
> +  char* do_domain_ids_buf=
> +    mi->domain_id_filter.as_string(Domain_id_filter::DO_DOMAIN_IDS);
> +  if (do_domain_ids_buf == NULL)
> +    DBUG_RETURN(1);
> +
> +  char* ignore_domain_ids_buf=
> +    mi->domain_id_filter.as_string(Domain_id_filter::IGNORE_DOMAIN_IDS);
> +  if (ignore_domain_ids_buf == NULL)
> +    DBUG_RETURN(1);

You have a memory leak here, do_domain_ids_buf will not be freed if
ignore_domain_ids_buf fails due to out-of-memory.

[NC] Fixed.
 

> @@ -1348,4 +1382,218 @@
>    DBUG_RETURN(result);
>  }
>
> +/* ctor */
> +Domain_id_filter::Domain_id_filter() : m_filter(false)
> +{
> +  for (int i= DO_DOMAIN_IDS; i <= IGNORE_DOMAIN_IDS; i ++)
> +  {
> +    my_init_dynamic_array(&m_domain_ids[i], sizeof(uint32), 16, 16, MYF(0));
> +  }
> +}
> +
> +/* dtor */
> +Domain_id_filter::~Domain_id_filter()

I don't think comments like /* ctor */ and /* dtor */ are useful, better just
omit them.

[NC] Done.
 


> +/**
> +  Update the domain id list of the given type with elements from the new list.
> +
> +  @param new_ids [IN]               new list of ids
> +  @param type    [IN]               domain id list type to update
> +  @param changed [IN]               has the domain id list changed in the
> +                                    last CHANGE MASTER command?
> +
> +  @retval                           void
> +*/
> +void Domain_id_filter::update_ids(DYNAMIC_ARRAY *new_ids, enum_list_type type,
> +                                  bool changed)
> +{
> +  DYNAMIC_ARRAY *ids= &m_domain_ids[type];
> +
> +  if (changed == true)
> +    reset_dynamic(ids);
> +
> +  /* bsearch requires an ordered list. */
> +  sort_dynamic(new_ids, change_master_domain_id_cmp);
> +
> +  for (uint i= 0; i < new_ids->elements; i++)
> +  {
> +    uint32 domain_id;
> +    get_dynamic(new_ids, (void *) &domain_id, i);
> +
> +    if (bsearch((const uint32 *) &domain_id, ids->buffer, ids->elements,
> +                sizeof(uint32), change_master_domain_id_cmp) == NULL)
> +    {
> +      insert_dynamic(ids, (uint32*) &domain_id);
> +    }
> +  }
> +  return;
> +}

Ok, so I understand that this code is modelled after the existing code for
IGNORED_SERVER_IDS. But I found it rather confusing, because it looks like
there is a bug: if changed==false, then the code does not ensure that the
resulting array will still be sorted.

I guess the bug is not possible to trigger, because the list of stuff to add
will only be non-empty if changed==true? But that is rather confusing. Please
simplify the logic; maybe you only need to call this function when
changed==true, so the parameter can be omitted and the array always reset?

[NC] Right, in case of domain_ids the list should only be reset/updated when
it has changed. I have remove the extra parameter.


Also, consider if you can rewrite the existing code for IGNORE_SERVER_IDS so
that the same common code can be used as for IGNORE_DOMAIN_IDS and
DO_DOMAIN_IDS, that would simplify the code and reduce risk of bugs or
inconsistencies.

[NC] Done.
 


> +/**
> +  Initialize the given domain id list with the numbers from the specified
> +  buffer.
> +
> +  @param buf  [IN]                  buffer
> +  @param type [IN]                  domain id list type
> +
> +  @retval false                     Success
> +          true                      Error
> +*/
> +bool Domain_id_filter::init_ids(char *buf, enum_list_type type)
> +{
> +  char *token, *last;
> +  uint num_items;
> +  DYNAMIC_ARRAY *ids= &m_domain_ids[type];
> +
> +  token= strtok_r(buf, " ", &last);
> +  if (token == NULL)
> +  {
> +    return true;                                /* error */
> +  }
> +
> +  num_items= atoi(token);
> +  for (uint i= 0; i < num_items; i++)
> +  {
> +    token= strtok_r(NULL, " ", &last);
> +    if (token == NULL)
> +    {
> +      return true;                              /* error */
> +    }
> +    else
> +    {
> +      uint32 val= atoi(token);
> +      insert_dynamic(ids, (uchar *) &val);
> +    }
> +  }
> +  return false;
> +}

Again, did you consider using the same code for this and for
IGNORE_SERVER_IDS? (I think you did, because you added a comment to
init_dynarray_intvar_from_file ;-). Maybe consider it again; it is fine to
rewrite/remove the existing code for IGNORE_SERVER_IDS and replace it with
your new code. Again, it would reduce risk for bugs or inconsistencies.


[NC] True, done.
 

> +  /* getter */
> +  bool get_filter() { return m_filter; }
> +
> +  /* setter */
> +  void set_filter(uint32 domain_id);

These are not really getter and setters, I think, at least I find those names
confusing.

Maybe something like do_filter(uint32 domain_id) and is_group_filtered() ?
And expand the comments, so that they explain that these functions
respectively check a GTID for needing to be filtered, and return whether the
current group is being filtered.

[NC] Done.
 

Also, personally I dislike setter/getter, and would prefer just to access the
m_filter as a public member. But I suppose that's style and up to you.

[NC] I have kept it private for now to avoid accidental modification.



> +  /*
> +    Initialize the given domain id list with the numbers from the specified
> +    buffer.
> +
> +    @param buf  [IN]                  buffer
> +    @param type [IN]                  domain id list type
> +
> +    @retval false                     Success
> +            true                      Error
> +  */
> +  bool init_ids(char *buf, enum_list_type type);

Clarify in the comment that the list is initialized from a string of
space-separated numbers, the first of which is the number of following
entries.

[NC] Done.
 


> @@ -182,6 +279,7 @@
>                        bool flush_relay_log_cache,
>                        bool need_lock_relay_log);
>  int change_master_server_id_cmp(ulong *id1, ulong *id2);
> +int change_master_domain_id_cmp(uint32 *id1, uint32 *id2);

I think change_master_domain_id_cmp() is only used in rpl_mi.cc, so should be
static there, and not refered here in rpl_mi.h

[NC] Done. Also renamed it to change_master_ids_cmp to be used by IGNORE_SERVER_IDS impl as well.
 


> === modified file 'sql/slave.cc'
> --- a/sql/slave.cc    2014-08-12 03:55:41 +0000
> +++ b/sql/slave.cc    2014-09-12 18:16:22 +0000
> @@ -1259,6 +1259,7 @@
>     going to ignore (to not log them in the relay log).
>     Items being read are supposed to be decimal output of values of a
>     type shorter or equal of @c long and separated by the single space.
> +   It also used to restore DO_DOMAIN_IDS & IGNORE_DOMAIN_IDS lists.

This comment suggests you wanted to share code, then later changed your
mind. If you end up keeping the code separate, you still need to delete this
comment. (And there is a typo: "It also" -> "It is also", but if you remove it
anyway...).


[NC] I have modified the patch to use this function now.
 

> @@ -5601,6 +5610,10 @@
>      mi->last_queued_gtid= event_gtid;
>      mi->last_queued_gtid_standalone=
>        (gtid_flag & Gtid_log_event::FL_STANDALONE) != 0;
> +
> +    /* Should filter all the subsequent events in the current GTID group? */
> +    mi->domain_id_filter.set_filter(event_gtid.domain_id);

So, here we will set m_filter to true, if we want to skip the following group.

But I don't see any code that will set it to false at the end of the group? 

[NC] With the previous logic m_filter was set/reset only at GTID event. I have now added the logic
to reset this flag on COMMIT/XID events too.


That seems a problem, as we could then end up wrongly skipping all kinds of
other events, such as Gtid_list, Incident, ...

[NC] Right. I have added the following events types in the exception list :
FORMAT_DESCRIPTION_EVENT, ROTATE_EVENT, GTID_LIST_EVENT,
HEARTBEAT_LOG_EVENT and INCIDENT_EVENT.


Also note that it is possible for a GTID group to not be terminated by
COMMIT/XID. This can happen at least if the master crashes in the middle of
writing a group; the slave will receive only the first part of it, followed by
a master restart Format_description event. That case needs to be handled, and
you should add a test case for it.

[NC] Added a test to cover this scenario.
 

Also it is possible for eg. a master downgrade to introduce following event
groups that do not start with GTID - seems these would also be wrongly
skipped? So we also need a test case for this...

[NC] Right, its a valid case, but I couldn't find a way to repeat this scenario in mtr. Suggestions?



Hope this helps,

Thanks!

-- Nirbhay
 

 - Kristian.