Rohit Kalhans <rohit.kalhans@gmail.com> writes:
As mentioned earlier i have a patch for MDEV 5873 and i am attaching here in the bundle of the patch. I have run tests on this patch and all of them have passed.
Kindly review the patch. Also i am not sure which is the suitable mailing list for sending patches, so it will be great if you can point me to the same.
Thanks for the patch! It's fine to send it to maria-developers@lists.launchpad.net, as you did. Here is my review: I see two main problems with this patch: - You moved the element::list from the upper level hash (on just domain_id) to the lower-level (domain_id, server_id). This seems not correct, as list is associated with the mysql.gtid_slave_pos table, which only holds data about one GTID per domain. - The patch relies on the master to send GTID_LIST events with the full binlog state - sequence number for each (domain_id,server_id) tuple. But I do not see anything in your patch that makes sure that those values are updated into the rpl_slave_state on the slave. Did I miss it? As I can see, Gtid_list_log_event::do_apply_event() only does such updates for events with the FLAG_IGN_GTIDS flag set; and these are not generated on the master, they are generated on the slave for a different purpose, related to slave event filters. You don't seem to have any test case that actually tests that this works, either... I have some detailed comments on the patch below, mostly related to the first of these points. More generally, I think we need to become more clear about the exact semantics of MASTER_GTID_WAIT(). I tried to think through the problem; here are my thoughts, let me know what you think: First, MASTER_GTID_WAIT() waits for a specific _position_ in the master's binlog to be reached. It does not wait for some GTID to be applied on the slave. So here is what MASTER_GTID_WAIT("D-S1-M") should do: 1. Look up D in gtid_slave_state, say "D-S2-N" 2. Determine whether D-S1-M comes before or after D-S2-N in the binlog order. Second, MariaDB GTID has two different modes, called "strict" and "non-strict". Strict mode is the recommended. In strict mode, we assume (and enforce as well as possible) that binlog order coincides with sequence number order. This makes step (2) as simple as comparing M and N. In non-strict mode we avoid assuming that sequence number order is identical to binlog order. This is necessary to not break backwards compatibility (since GTID is on by default after an upgrade to 10.0). It is also useful to be able to deal with mistakes that cause out-of-order binlogs between servers, something that is not possible to prevent 100% given the nature of asynchronous replication. In non-strict mode, step (2) requires looking at the master's binlog to decide the order of D-S1-M and D-S2-N (if S1 != S2). Out-of-order binlogs creates a lot of complexity in various corner cases. This is seen as undesirable, especially by high-end users that employ a lot of discipline in how they manage replication. Thus the main design goal is that in strict mode, things should be simple like if out-of-order sequence numbers were impossible. The secondary goal is that in non-strict mode, we try to deal with things in a reasonable way as well as possible, but accept that some semantics become more complex. The complexity in case of MASTER_GTID_WAIT() is mainly what happens if the slave is not connected to the master. In this case we will have missing or incomplete information about binlog order, so may make incorrect or inconsistent decisions related to MASTER_GTID_WAIT(). But this complexity is avoided if we can assume that binlog order equals sequence number order. So this is my suggestion: - In strict mode, MASTER_GTID_WAIT() only compares the sequence numbers, like the code before your patch. - In non-strict mode, MASTER_GTID_WAIT() works per (domain_id,server_id), like in your patch. - In non-strict mode, we give a warning if the slave specified in @@default_master_connection is not running. This achieves the goal that in strict mode, things work simple like one would expect, and in non-strict mode, we are able to deal reasonably with most cases even if we have out-of-order sequence numbers. The one thing I do not like about this is the following situation: Binlog on master contains 0-1-1,0-1-2,0-2-3. Slave has position 0-2-3. We do MASTER_GTID_WAIT("0-1-2"). If the slave is stopped, and gtid strict mode is off, then this will hang until the slave is started. It is not perfect. But I do not see much that can be done about it. At least the user got a warning, and they can use the recommended strict mode to avoid the issue. For the implementation of this, I think there could be one element per (domain_id,server_id) and one extra "global" element just for the (domain_id). When we update the slave state, we update both the element for (domain_id,server_id) and the one for (domain_id). When we execute MASTER_GTID_WAIT(), we wait for the (domain_id) element in gtid strict mode, and for the (domain_id,server_id) element in non-strict mode. What do you think? Any better ideas? Detailed comments:
@@ -112,11 +112,18 @@ };
/* Elements in the HASH that hold the state for one domain_id. */ + struct dom_element + { + uint32 domain_id; + HASH hash; + }; + + /* Elements in the HASH that hold the state for one server_id. */ struct element { struct list_element *list; - uint32 domain_id; - /* Highest seq_no seen so far in this domain. */ + uint32 server_id; + /* Highest seq_no seen so far in this (domain,server_id). */ uint64 highest_seq_no; /* If this is non-NULL, then it is the waiter responsible for the small
I think you need to keep also the list_element *list in the dom_element. Otherwise it subtly changes the semantics of how the mysql.gtid_slave_state table is managed.
+ +# MDEV-5873 +# We use a gtid with the same domain and seq no as one of the trx already +# applied but with an impossible server id. +# This should return with a timeout (returns -1) +SELECT master_gtid_wait("1-99-1", 2); + # Now actually wait until the slave reaches the position send SELECT master_gtid_wait(@pos);
I think we need a bit more testing than this. Here are some suggestions: Create a binlog with something like 0-1-2,0-1-3,0-2-2. Setup two waiters, one MASTER_GTID_WAIT("0-1-2",100), one MASTER_GTID_WAIT("0-2-2",1). Do START SLAVE UNTIL MASTER_GTID_POS="0-1-3". Check that the first wait completes and the second times out. With binlog 0-1-1,0-1-2,0-2-3, let the slave have gtid_slave_pos=0-2-3. Restart the slave server (with --skip-slave-start), check that MASTER_GTID_WAIT("0-1-2") times out in non-strict mode (with a warning), but completes in gtid strict mode. Start the slave, check that MASTER_GTID_WAIT("0-1-2") now completes in both strict and non-strict mode. Stop the slave, check that MASTER_GTID_WAIT("0-1-2") still completes immediately in both modes (because now we have the master's binlog state still in-memory). Manually SET GLOBAL gtid_slave_pos="0-2-3", check that this clears the whole state so that now MASTER_GTID_WAIT("0-1-2") again times out in strict mode.
@@ -112,17 +121,16 @@ inited= true; }
- -void -rpl_slave_state::truncate_hash() +void truncate_server_id_hash(HASH *hash) { uint32 i;
- for (i= 0; i < hash.records; ++i) + for (i= 0; i < hash->records; ++i) { - element *e= (element *)my_hash_element(&hash, i); - list_element *l= e->list; - list_element *next; + rpl_slave_state::element *e= (rpl_slave_state::element *) + my_hash_element(hash, i); + rpl_slave_state::list_element *l= e->list; + rpl_slave_state::list_element *next; while (l) { next= l->next; @@ -131,6 +139,18 @@ } /* The element itself is freed by the hash element free function. */ } + my_hash_reset(hash); +} + +void +rpl_slave_state::truncate_hash() +{ + uint32 i; + for (i= 0; i < hash.records; ++i) + { + dom_element *de= (dom_element *)my_hash_element(&hash, i); + truncate_server_id_hash(&de->hash); + } my_hash_reset(&hash);
If you keep the list in the upper (domain) element, you don't need to change this function.
@@ -152,7 +172,7 @@ element *elem= NULL; list_element *list_elem= NULL;
- if (!(elem= get_element(domain_id))) + if (!(elem= get_element(domain_id, server_id)))
This should still be get_element(domain_id), well you can rename it to get_dom_element() if you like to match the new name struct dom_element. The list update will still work on the upper dom_element, and then to handle the MASTER_GTID_WAIT() update you need to do an update in the inner hash.
@@ -185,35 +205,67 @@
struct rpl_slave_state::element * -rpl_slave_state::get_element(uint32 domain_id) +rpl_slave_state::get_element(uint32 domain_id, uint32 server_id) { - struct element *elem; + struct element *elem= 0; + struct dom_element *d_elem= 0; + bool domain_exists= false;// assuming domain does not exist already
- elem= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0); + d_elem= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0); + if (d_elem) + { + domain_exists= true; // domain exists + elem= (element *)my_hash_search(&d_elem->hash, (const uchar *)&server_id, 0); + } if (elem) return elem; - - if (!(elem= (element *)my_malloc(sizeof(*elem), MYF(MY_WME)))) + // creare new domain element. + if (!domain_exists) + { + if (!(d_elem= (dom_element *)my_malloc(sizeof(*d_elem), MYF(MY_WME)))) + return NULL; + + my_hash_init(&d_elem->hash, &my_charset_bin, 32, + offsetof(element, server_id), sizeof(uint32), + NULL, rpl_slave_state_free_element, HASH_UNIQUE); + // set domain id + d_elem->domain_id= domain_id; + // insert this domain in the hash. + if (my_hash_insert(&hash, (uchar *)d_elem)) + { + my_free(d_elem); + return NULL; + } + } + + if (!(elem=(element*)my_malloc(sizeof(*elem), MYF(MY_WME)))) return NULL; + + mysql_cond_init(key_COND_wait_gtid, &elem->COND_wait_gtid, 0); elem->list= NULL; - elem->domain_id= domain_id; + elem->server_id= server_id; elem->highest_seq_no= 0; elem->gtid_waiter= NULL; - mysql_cond_init(key_COND_wait_gtid, &elem->COND_wait_gtid, 0); - if (my_hash_insert(&hash, (uchar *)elem)) + + if(my_hash_insert(&d_elem->hash, (uchar *)elem)) { my_free(elem); return NULL; } + return elem; }
I think we still need a get_element() that works only on the domain, it is used in a few places (check_duplicate_gtid(), release_domain_owner(), and record_gtid()). So maybe another function that looks up in the sub (domain_id,server_id) hash. Or maybe a separate function that first calls the get_element() and then does the second lookup.
int -rpl_slave_state::put_back_list(uint32 domain_id, list_element *list) +rpl_slave_state::put_back_list(uint32 domain_id, uint32 server_id, + list_element *list) { + dom_element* de; element *e; - if (!(e= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0))) + if (!(de= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0))) + return 1; + if (!(e=(element *)my_hash_search(&de->hash,(const uchar*)&server_id, 0))) return 1; while (list) {
Again, the list should be on the upper layer, so that this is not necessary.
@@ -556,7 +608,7 @@ rpl_slave_state::iterate(int (*cb)(rpl_gtid *, void *), void *data, rpl_gtid *extra_gtids, uint32 num_extra) { - uint32 i; + uint32 i, j; HASH gtid_hash; uchar *rec; rpl_gtid *gtid; @@ -573,25 +625,34 @@
for (i= 0; i < hash.records; ++i) { - uint64 best_sub_id; + uint64 best_sub_id= 0; rpl_gtid best_gtid; - element *e= (element *)my_hash_element(&hash, i); - list_element *l= e->list; - - if (!l) - continue; /* Nothing here */ - - best_gtid.domain_id= e->domain_id; - best_gtid.server_id= l->server_id; - best_gtid.seq_no= l->seq_no; - best_sub_id= l->sub_id; - while ((l= l->next)) + bool first= true; + dom_element *de= (dom_element *)my_hash_element(&hash, i); + for ( j= 0; j < de->hash.records; ++j) { - if (l->sub_id > best_sub_id) + element* e= (element *)my_hash_element(&de->hash, j); + list_element *l= e->list; + + if (!l) + continue; /* Nothing here */ + if (first) // executed once per domain. { best_sub_id= l->sub_id; + best_gtid.domain_id= de->domain_id; best_gtid.server_id= l->server_id; best_gtid.seq_no= l->seq_no; + first= false; + } + while(l) + { + if (l->sub_id > best_sub_id) + { + best_sub_id= l->sub_id; + best_gtid.server_id= l->server_id; + best_gtid.seq_no= l->seq_no; + } + l = l->next; } }
I don't think rpl_slave_state::iterate() should need to change either, if you put the list on the upper (domain) hash elements.
@@ -683,30 +744,48 @@ bool rpl_slave_state::domain_to_gtid(uint32 domain_id, rpl_gtid *out_gtid) { - element *elem; + dom_element *d_elem; + element* elem; list_element *list; uint64 best_sub_id; + uint32 i; + bool first= true;
mysql_mutex_lock(&LOCK_slave_state); - elem= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0); - if (!elem || !(list= elem->list)) + d_elem= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0); + if (!d_elem) { mysql_mutex_unlock(&LOCK_slave_state); return false; } - - out_gtid->domain_id= domain_id; - out_gtid->server_id= list->server_id; - out_gtid->seq_no= list->seq_no; - best_sub_id= list->sub_id; - - while ((list= list->next)) + for (i= 0; i < d_elem->hash.records; ++i) { - if (best_sub_id > list->sub_id) - continue; - best_sub_id= list->sub_id; - out_gtid->server_id= list->server_id; - out_gtid->seq_no= list->seq_no; + elem= (element*)my_hash_element(&d_elem->hash, i); + if (!elem || !(list= elem->list)) + { + mysql_mutex_unlock(&LOCK_slave_state); + return false; + } + if (first) + { + out_gtid->domain_id= domain_id; + out_gtid->server_id= list->server_id; + out_gtid->seq_no= list->seq_no; + best_sub_id= list->sub_id; + first= false; + } + while (list) + { + if (best_sub_id > list->sub_id) + { + list= list->next; + continue; + } + best_sub_id= list->sub_id; + out_gtid->server_id= list->server_id; + out_gtid->seq_no= list->seq_no; + list= list->next; + } }
mysql_mutex_unlock(&LOCK_slave_state);
rpl_slave_state::domain_to_gtid() also should not need any changes.
@@ -836,18 +915,23 @@ bool rpl_slave_state::is_empty() { - uint32 i; + uint32 i, j; bool result= true;
mysql_mutex_lock(&LOCK_slave_state); for (i= 0; i < hash.records; ++i) { - element *e= (element *)my_hash_element(&hash, i); - if (e->list) + dom_element *de= (dom_element *)my_hash_element(&hash, i); + for (j= 0; j < de->hash.records; ++j) { - result= false; - break; + element *e= (element *)my_hash_element(&de->hash, j); + if (e->list) + { + result= false; + break; + } } + if(!result) break;
If you keep the list in dom_element, then this change becomes unnecessary. - Kristian.