Re: [Maria-developers] GTID: clueless user experience, day 1
[Cc: maria-developers@, in case others are interested or have comments] Elena Stepanova <elenst@montyprogram.com> writes:
So if you agree, I will put on my ToDo to remove that omitted domain_id defaults to 0.
It's fine with me.
Pushed to lp:~maria-captains/maria/10.0-mdev26/
Clearly I need to give a better error message, this is highly confusing. Thanks for bringing this to attention! What do you think of:
ERROR XXXX: Duplicate domain_id 0 for GTID 0-2-1
I made it like this: MariaDB [test]> change master to master_gtid_pos="0-1-2,1-1000-1000,2-3-1,0-2-100,4-4-4"; ERROR 1941 (HY000): GTID 0-2-100 and 0-1-2 conflict (duplicate domain id 0)
Sounds good. Only, if I understand correctly, it's nowadays not domain_id, but gtid_domain_id ? Then it's better be named as such in the error message (/me tried in vain to set domain_id, first on the server command line, and then in a session).
Well, these are different but related concepts. @@gtid_domain_id is a system variable. A "domain id" is a conceptual thing when taking about global transaction ID. It is like the InnoDB buffer pool. The system variable is "innodb_buffer_pool_size". The concept is "buffer pool size". But I agree the concept should be "domain ID", not domain_id (with an underscore). The reason I spell this out is not to nitpick, but rather that I would welcome suggestions for naming, I am terrible with naming. Do you think I should rename @@gtid_domain_id and @@gtid_seq_no to just @@domain_id and @@seq_no? I thought that was perhaps a bit too generic names, though I did keep plain @@server_id for backwards compatibility. Another option is @@binlog_domain_id and @@binlog_seq_no, this is perhaps better than the random "gtid" prefix? The variables are for controlling what goes into the binlog, after all.
But on the other hand, why should it be several rows? If binlog_gtid_pos() returns a *single* comma-separated list of GTIDs for the given position, as it does now, we can just as well add this list to SHOW MASTER STATUS as a *column*, no?
Sorry, it was me who was confused. I meant adding extra columns, not rows. I am still not sure if it is safe to add extra columns. It would seem useful, unless it breaks too many existing applications.
Start after. GTID is the last event executed.
Hm... I must admit I don't quite understand the reasoning. I'll ask
Let me give an example. Suppose server 1 is master. It binlogs an event with GTID 0-1-10. This event is currently the last in the binlog. Server 2 is a slave of server 1. When it catches up, GTID 0-1-10 is the last event applied on the slave. So now, what should be the replication state of slave server 2? If we want to use "start from" we could try GTID 0-1-11? But then suppose we now switch master to a new server 3. Then the event following 0-1-10 is 0-2-11, not 0-1-11. So if server 2 would try to connect as a slave to server 3 and start from 0-1-11, it will not work, this event does not exist anywhere. That is why it has to be "start after", as when we save the state on the slave, we know only the GTID of the last event applied, not the one to apply next.
There might be more in there for me to look into..
Indeed. This is a critical part of the design, I spent a _lot_ of thought to get this working, and I think I got it right, but I will feel a lot better knowing you spent your best efforts to break it :-). But please wait with this for a later stage, I have some planned development around this that is not ready yet.
Gtid_Pos_Auto=1 means the new method, whether AUTO or explicit 'xxx' gtid, So what should we call it? How about just "Using_Gtid" ?
Yep, Using_Gtid sounds fine.
This is fixed in the tree now.
This brings a more conceptual question for me... Does Using_Gtid indicate how the slave *replicates*, or just how it determines the starting position while connecting to master? In other words, is there
There is a difference. There are two ways a slave can connect. The old way using binlog file and position. And the new way using GTID. I think we need to have both. The Using_Gtid records which of the two is used. I made it this way so that users do not need to set some specific option in my.cnf, and so that with multi-source, one slave connection can use the old style (eg. master not supporting GTID?), while another can use new style. The main difference is finding the starting position. But this also implies difference for how the slave replicates. Because with GTID there can be *multiple* starting positions (one for each domain id). So after connecting, some events can be skipped during replication if one domain_id is ahead of another in the master binlog.
If not (and I couldn't yet figure the difference, since, as you said, the slave reads the GTID state anyway when it connects with file name/offset), then I start having doubts that we need the Using_Gtid flag at all. I mean, if it only indicates what I provided last time in CHANGE MASTER, how is this information useful?
It is used when the slave connects (which happens at every START SLAVE, and every server restart, including mysqld_safe autostart). If Using_Gtid is false, then slave connects at the last file/offset found in relay-log.info. It also obtains the corresponding GTID position from the master, so that it is ready in case the user wants to switch to CHANGE MASTER TO ... MASTER_GTID_POS=AUTO. If Using_Gtid is true, slave connects at the GTID position read from the mysql.rpl_slave_state table, which is crash safe. And multiple start positions are used if there is more than one domain id in the GTID position. So you see, the information is needed, otherwise the slave and the master would not know how to handle slave connect.
But frankly, I am not sure if the right place is in SHOW ALL SLAVES STATUS or in plain SHOW SLAVE STATUS. Can you help me understand what the difference is between putting the column in one or the other of these?
I guess the only real reason was that Monty didn't want to change the output format of existing SHOW SLAVE STATUS, not to break anything.
Yes, this is an important issue. I will discuss it with Monty when I visit him next week.
Ideally, I'd expect the global values be shown both in SHOW SLAVE STATUS, e.g. at the end of it, and once for all slaves in SHOW ALL SLAVES STATUS, again at the end. But I don't know if it's doable, and if it is, we need to think what it might break.
Yes. The code actually seems to have been written for something like SHOW FULL SLAVE STATUS, which would include the extra data. I will need to ask Monty to know more.
Then I assumed you decided not to print current GTIDs in SHOW SLAVE STATUS on purpose, to avoid the mess like MySQL has -- which was also surprising since we wouldn't have that insanity anyway. I didn't ask how I could obtain them on slave, as I figured it was supposed to be done via mysql.rpl_slave_state. But it didn't occur to me that it was printed in SHOW ALL SLAVES STATUS...
Yes. In any case, the user interface is very much not finalised yet. Sorry for the rough experience, but hopefully we can use your experience to come up with a good interface for the final release.
Maybe the problem is that if I add a column in SHOW SLAVE STATUS, and MySQL adds another column in the same position, then scripts will break because the column count for one or the other will change?
eh.. MySQL's show slave status currently has like +13 columns comparing to ours, I really don't think that scripts relying on such level of detail stand any chance to survive.
My main worry is not if one has more than the other. The worry is if some column is in position 25 (say) in MySQL but 28 (say) in MariaDB. Is there even any easy way to pick out a column by name, rather than numerical position?
Good thing you mentioned it, I actually totally forgot to tell the tale of my *very* first experience. Well, I started when I mentioned
Then I followed the instruction again, and... it didn't work. It didn't crash, replication didn't fail, but when I switched master and slave, the new slave (old master) just wouldn't replicate the last line, (6,3).
I couldn't understand it until I compared the flow and configuration with the MTR test that you have (it does basically the same, and it works), and only then I figured that I have to have log-slave-updates=on. It wasn't clear to me why, I mean, I had the transaction in quesiton in the binlog, and GTID was there too, so?.. That's when I switched from AUTO to explicit values and started experimenting with them.
Well, in my example, first server 1 is the master, then I switch around so that server 2 is the master. For a slave to also be able to serve as master, it needs --log-slave-updates=1, right? On the other hand, the design specifically allows for --log-slave-updates=0. I did fix one or two possibly related bugs lately, so please try again and if you can reproduce the problem I will look into it. (I was a bit confused how you could have the transaction in question in the binlog on the server that was changed from slave to master, if it did not have --log-slave-updates=1 ...)
But nothing happens... No errors in the slave, but it isn't moving, either:
This looks like a bug, your commands look correct. I could reproduce, I will investigate and fix ASAP and let you know.
This should be fixed now and pushed to the tree.
Getting the wrong mutex usage warnings. I saw a discussion on IRC the other day, but don't know the current status -- whether it's supposed to be fixed, or not fixed yet, or is considered innocent...
This also should be fixed in the tree now. - Kristian. (who now has to run from the computer and hopes the above makes sense...)
participants (1)
-
Kristian Nielsen