[Maria-developers] MariaDB GTID first impression
Kristian, Thanks for the docs. It's a very good start. Compared to MySQL 5.6, GTID in MariaDB has some advantages: it is enabled by default and it is easy to read. There are a few items, though, that should need some more thought: 1) rpl_slave_state is probably ill-named. In this architecture, it is only written by the slaves, and thus the name matches, but I wonder if the master should write this table (or a similar one) to keep track of its progress in the binary log. Rationale: all the info related to replication should be in tables and easy to query. 2) rpl_slave_state does not have enough metadata. What is missing here, and would be extremely useful, is the corresponding binlog+position for each GTID, a timestamp of when the transaction was saved to the binary log, and the timestamp of when it was applied. (The host name of the origin would also be useful). Also very important (when you will implement parallel replication) is a shard identifier. 3) Multiple domains do not integrate well with GTID. In multi-source replication, we use "connection names", but for GTID we use "domain numbers". They do not play well together : there is no correspondence between the gtid_domain_id and the connection name, and a "SHOW ALL SLAVES STATUS" lists all the GTID for every connection. For example: mysql -e 'show all slaves status\G' |grep 'Connection_name\|Running\|Master_Host\|Master_Port\|Gtid_Pos\|Master_Log' Connection_name: Bear Master_Host: 127.0.0.1 Master_Port: 8203 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 1020 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 1020 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20 Connection_name: Wolf Master_Host: 127.0.0.1 Master_Port: 8204 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 896 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 896 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20 Connection_name: Fox Master_Host: 127.0.0.1 Master_Port: 8205 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 365 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 365 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20 Connection_name: Deer Master_Host: 127.0.0.1 Master_Port: 8206 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 1144 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 1144 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20 Which connection is running which gtid number? Not easy to tell, as the correspondence between domain and connections is only in the DBA's head. 4) gtid_pos is text that includes all sequences from all domains. The docs say that I can modify it if needed, but in a case like the one above, modifying this string can be very error prone. For example, if server_ids are assigned using inet_aton(@IP), you could have a gtid_pos like this '1001-170062692-230000,1002-170062593-12893445,1003-170062700-672009,1003-170062699-700002' Instead of editing a string, and risk breaking replication in all domains, I would expect the availability of commands like: CHANGE GTID for domain 'Wolf' set gtid_pos=1900 That's it for now. Hope there is still room for improvements. Cheers Giuseppe
Giuseppe Maxia <g.maxia@gmail.com> writes:
Thanks for the docs. It's a very good start. Compared to MySQL 5.6, GTID in MariaDB has some advantages: it is enabled by default and it is easy to read.
There are a few items, though, that should need some more thought:
1) rpl_slave_state is probably ill-named. In this architecture, it is only
Yeah, I agree. Let's find a better name. What about: mysql.rpl_gtid_state? Or maybe: mysql.gtid_pos. This matches the similar @@GLOBAL.gtid_pos variable. Though the table is not a simple position, it has a more complex structure. Opinions?
written by the slaves, and thus the name matches, but I wonder if the master should write this table (or a similar one) to keep track of its progress in
Right. Actually, I've carefully avoided writing the table on the master, to avoid introducing extra overhead.
the binary log. Rationale: all the info related to replication should be in tables and easy to query.
That's one valid viewpoint. Unfortunately this is not really supported in the design. I mean, you can query the table if you want, but there is not guaranteed to be an easy way to map the results to the actual internal state. Similarly, updating the table is not guaranteed to be reflected in the internal in-memory state. This is all to reduce overhead.
2) rpl_slave_state does not have enough metadata. What is missing here, and would be extremely useful, is the corresponding binlog+position for each GTID, a timestamp of when the transaction was saved to the binary log, and the timestamp of when it was applied. (The host name of the origin would
Right. I had this thought myself that I should maybe put more stuff in there, it is good to get your suggestion, based on more experience with actual usage. I think most of this info is already available. Though timestamps might be worth it to log into Gtid with better than whole second precision? And not sure about the host name (but we do have server_id, of course). I will think about this. At least, this should be easy to add later, but it would be good to have as much as possible in the first release.
also be useful). Also very important (when you will implement parallel replication) is a shard identifier.
Actually, the domain_id is the shard identifier. This is a core part of the desing, though actual implementation will be for later.
3) Multiple domains do not integrate well with GTID. In multi-source replication, we use "connection names", but for GTID we use "domain numbers". They do not play well together : there is no correspondence between the gtid_domain_id and the connection name, and a "SHOW ALL SLAVES STATUS" lists all the GTID for every connection. For example:
Right. I understand your point of course, it would be natural to expect domain id to be equivalent somehow to multi-source connection name. Unfortunately, that is not how things work. There is no simple translation between multi-source connection name and domain id. For example, consider a topology like A->C, B->C, C->E, D->E. Here, lets say A has domain id 1, B has 2, and D has 3. There is no single domain_id for the connection from E to C. Or a ring topology: A->B->C->A. Here, there will be three domains, but there is only a single connection. And later, we will do parallel replication, where a single master will use multiple different domain_ids (basically one for each shard I suppose). On the other hand, normally a given domain_id should only be received on a single of multi-source connections. It sounds like a good idea to be able to specify a list of valid domains for each multi-source connection. Then the code could also check for violations and stop replication with an error if such violation happens. I will think about adding this.
Connection_name: Deer Master_Host: 127.0.0.1 Master_Port: 8206 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 1144 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 1144 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20
Which connection is running which gtid number? Not easy to tell, as the correspondence between domain and connections is only in the DBA's head.
Well, there is a single global Gtid_pos, it is shared by all connections. I do not see how this can be different? For example, Suppose we have A->C, B->C, C->D. If C goes away, we will want to switch to A->D, B->D. But the Gtid_pos should not change while doing this switch. Hm, I will admit I did not before think much about a tighter connection between GTID position and individual slave connections. I have just considered the GTID state as associated with the data actually in the tables, not with any specific configuration ...
4) gtid_pos is text that includes all sequences from all domains. The docs say that I can modify it if needed, but in a case like the one above, modifying this string can be very error prone. For example, if server_ids are assigned using inet_aton(@IP), you could have a gtid_pos like this '1001-170062692-230000,1002-170062593-12893445,1003-170062700-672009,1003-170062699-700002' Instead of editing a string, and risk breaking replication in all domains, I would expect the availability of commands like: CHANGE GTID for domain 'Wolf' set gtid_pos=1900
Right, I agree. On the other hand, if one is doing scripting to e.g. set up a new slave automatically, it is much easier to be able to pass around a single string of Gtid position, rather than having to contruct complicated SQL expressions to add some, change others, and remove any extra. So that is why I chose the simple approach with a single string for the first version. Your comments made me realise that much of the current design is based towards how I prefer to use databases - where "everything in the dba's/developer's head" (or better the source revision control system) is a good thing. But there is also a need to support a more database-centric approach, where the only source of configuration and metadata is the current state of the DB. Hopefully more of this other approach can be added and extended as we go along.
That's it for now. Hope there is still room for improvements.
Yes. Thanks a lot Giuseppe for taking the time to review this and provide your opinions. - Kristian.
Kristian, If performance is an issue that prevents writing to the gtid table, perhaps a INFORMATIOn_SCHEMA table would do the trick? (Having the state of replication in tables has been my pet peeve since 2006, and none of the MySQL implementations seems to share my view) Cheers Giuseppe On Tuesday, May 7, 2013 at 12:52 , Kristian Nielsen wrote:
Giuseppe Maxia <g.maxia@gmail.com (mailto:g.maxia@gmail.com)> writes:
Thanks for the docs. It's a very good start. Compared to MySQL 5.6, GTID in MariaDB has some advantages: it is enabled by default and it is easy to read.
There are a few items, though, that should need some more thought:
1) rpl_slave_state is probably ill-named. In this architecture, it is only
Yeah, I agree. Let's find a better name.
What about: mysql.rpl_gtid_state?
Or maybe: mysql.gtid_pos. This matches the similar @@GLOBAL.gtid_pos variable. Though the table is not a simple position, it has a more complex structure.
Opinions?
written by the slaves, and thus the name matches, but I wonder if the master should write this table (or a similar one) to keep track of its progress in
Right. Actually, I've carefully avoided writing the table on the master, to avoid introducing extra overhead.
the binary log. Rationale: all the info related to replication should be in tables and easy to query.
That's one valid viewpoint. Unfortunately this is not really supported in the design. I mean, you can query the table if you want, but there is not guaranteed to be an easy way to map the results to the actual internal state. Similarly, updating the table is not guaranteed to be reflected in the internal in-memory state. This is all to reduce overhead.
2) rpl_slave_state does not have enough metadata. What is missing here, and would be extremely useful, is the corresponding binlog+position for each GTID, a timestamp of when the transaction was saved to the binary log, and the timestamp of when it was applied. (The host name of the origin would
Right. I had this thought myself that I should maybe put more stuff in there, it is good to get your suggestion, based on more experience with actual usage.
I think most of this info is already available. Though timestamps might be worth it to log into Gtid with better than whole second precision? And not sure about the host name (but we do have server_id, of course).
I will think about this. At least, this should be easy to add later, but it would be good to have as much as possible in the first release.
also be useful). Also very important (when you will implement parallel replication) is a shard identifier.
Actually, the domain_id is the shard identifier. This is a core part of the desing, though actual implementation will be for later.
3) Multiple domains do not integrate well with GTID. In multi-source replication, we use "connection names", but for GTID we use "domain numbers". They do not play well together : there is no correspondence between the gtid_domain_id and the connection name, and a "SHOW ALL SLAVES STATUS" lists all the GTID for every connection. For example:
Right. I understand your point of course, it would be natural to expect domain id to be equivalent somehow to multi-source connection name.
Unfortunately, that is not how things work. There is no simple translation between multi-source connection name and domain id.
For example, consider a topology like A->C, B->C, C->E, D->E. Here, lets say A has domain id 1, B has 2, and D has 3. There is no single domain_id for the connection from E to C.
Or a ring topology: A->B->C->A. Here, there will be three domains, but there is only a single connection.
And later, we will do parallel replication, where a single master will use multiple different domain_ids (basically one for each shard I suppose).
On the other hand, normally a given domain_id should only be received on a single of multi-source connections. It sounds like a good idea to be able to specify a list of valid domains for each multi-source connection. Then the code could also check for violations and stop replication with an error if such violation happens. I will think about adding this.
Connection_name: Deer Master_Host: 127.0.0.1 Master_Port: 8206 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 1144 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 1144 Gtid_Pos: 100-101-21,400-104-23,200-102-8,300-103-12,500-105-20
Which connection is running which gtid number? Not easy to tell, as the correspondence between domain and connections is only in the DBA's head.
Well, there is a single global Gtid_pos, it is shared by all connections.
I do not see how this can be different? For example, Suppose we have A->C, B->C, C->D. If C goes away, we will want to switch to A->D, B->D. But the Gtid_pos should not change while doing this switch.
Hm, I will admit I did not before think much about a tighter connection between GTID position and individual slave connections. I have just considered the GTID state as associated with the data actually in the tables, not with any specific configuration ...
4) gtid_pos is text that includes all sequences from all domains. The docs say that I can modify it if needed, but in a case like the one above, modifying this string can be very error prone. For example, if server_ids are assigned using inet_aton(@IP), you could have a gtid_pos like this '1001-170062692-230000,1002-170062593-12893445,1003-170062700-672009,1003-170062699-700002' Instead of editing a string, and risk breaking replication in all domains, I would expect the availability of commands like: CHANGE GTID for domain 'Wolf' set gtid_pos=1900
Right, I agree. On the other hand, if one is doing scripting to e.g. set up a new slave automatically, it is much easier to be able to pass around a single string of Gtid position, rather than having to contruct complicated SQL expressions to add some, change others, and remove any extra.
So that is why I chose the simple approach with a single string for the first version.
Your comments made me realise that much of the current design is based towards how I prefer to use databases - where "everything in the dba's/developer's head" (or better the source revision control system) is a good thing. But there is also a need to support a more database-centric approach, where the only source of configuration and metadata is the current state of the DB. Hopefully more of this other approach can be added and extended as we go along.
That's it for now. Hope there is still room for improvements.
Yes. Thanks a lot Giuseppe for taking the time to review this and provide your opinions.
- Kristian.
Giuseppe Maxia <g.maxia@gmail.com> writes:
If performance is an issue that prevents writing to the gtid table, perhaps a INFORMATIOn_SCHEMA table would do the trick? (Having the state of
Yes, that sounds easy to do. Something like information_schema.gtid_pos, which has one row per domain_id, giving GTID and other metadata. That sounds like a very good idea. I'll add it to the list of ToDos (but no promises that this will be done within the next 1-2 weeks, I'm feeling a bit overloaded currently). I do not think information_schema tables are updatable, else this would also be a very nice way to expose a flexible interface for manipulating the @@gtid_pos string in a nice per-gtid manner. - Kristian.
Performance_schema tables can be updated though ... On Tuesday, May 7, 2013 at 13:12 , Kristian Nielsen wrote:
Giuseppe Maxia <g.maxia@gmail.com (mailto:g.maxia@gmail.com)> writes:
If performance is an issue that prevents writing to the gtid table, perhaps a INFORMATIOn_SCHEMA table would do the trick? (Having the state of
Yes, that sounds easy to do. Something like information_schema.gtid_pos, which has one row per domain_id, giving GTID and other metadata. That sounds like a very good idea.
I'll add it to the list of ToDos (but no promises that this will be done within the next 1-2 weeks, I'm feeling a bit overloaded currently).
I do not think information_schema tables are updatable, else this would also be a very nice way to expose a flexible interface for manipulating the @@gtid_pos string in a nice per-gtid manner.
- Kristian.
On Tue, May 7, 2013 at 3:52 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
written by the slaves, and thus the name matches, but I wonder if the master should write this table (or a similar one) to keep track of its progress in
Right. Actually, I've carefully avoided writing the table on the master, to avoid introducing extra overhead.
Kristian, I was thinking about your words above and they make me wonder: if we have a master and it doesn't write its current state into rpl_slave_state, then we shut down the master, and while it's out we failover to another server. Then the first server comes back and should start to replicate from the new master. But how will it know which GTID to start from if it doesn't have any state in rpl_slave_state? I can guess that with a graceful shutdown there's also file <log-bin>.state which seem to contain the last state of the binlog. BTW, can you tell more about this file, how it participates in the replication? Will it be used to restore gtid_pos after shutdown? But anyway that file seem to be written only at shutdown so in case when mysqld crashes both rpl_slave_state and <log-bin>.state won't exist. Will MariaDB restore its gtid_pos by scanning binlogs? Another question I have: I see with multiple domains gtid_pos is not sorted by domain id. I would guess it's sorted by the order of last transactions in each domain. Is this a deliberate decision to make it like that? Can it be sorted by domain id to be more predictable? I mean if gtid_pos is not sorted with multiple domains it won't be enough to compare values of gtid_pos on different servers to understand whether they have the same transactions... Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
I was thinking about your words above and they make me wonder: if we have a master and it doesn't write its current state into rpl_slave_state, then we shut down the master, and while it's out we failover to another server. Then the first server comes back and should start to replicate from the new master. But how will it know which GTID to start from if it doesn't have any state in rpl_slave_state?
It uses what it has in the binlog (I call this the "binlog state"). The value of @@GLOBAL.gtid_pos is obtained as follows: - First take the row with the highest sub_id from mysql.rpl_slave_state, per domain_id. - Then look in the binlog state. If we have something there with higher seq_no, then it overrides what was in rpl_slave_state. So basically if the last transaction originated on this server it uses the binlog state, if the last transaction was replicated it uses mysql.rpl_slave_state.
I can guess that with a graceful shutdown there's also file <log-bin>.state which seem to contain the last state of the binlog. BTW, can you tell more about this file, how it participates in the replication? Will it be used to restore gtid_pos after shutdown?
<log-bin>.state contains the current binlog state at the time of shutdown. This is used to be able to output the correct Gtid_list event at the start of the new binlog created at server startup, and also to initialise the sequence number counter for new events.
But anyway that file seem to be written only at shutdown so in case when mysqld crashes both rpl_slave_state and <log-bin>.state won't exist. Will MariaDB restore its gtid_pos by scanning binlogs?
Yes. This uses the existing crash recovery framework. If the binlog was not closed cleanly ("we crashed"), the old binlog is scanned, any non-committed InnoDB transactions are either committed or rolled back as appropriate. And now with GTID, the binlog state is also restored during the scan.
Another question I have: I see with multiple domains gtid_pos is not sorted by domain id. I would guess it's sorted by the order of last
Yeah, I just did not think of it, it is probably in whatever order the internal hash table spits out stuff. Sorting it is a good idea, I'll add it to my ToDo. - Kristian.
Sorry, I didn't quite understand. On Sun, May 12, 2013 at 10:24 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:
I was thinking about your words above and they make me wonder: if we have a master and it doesn't write its current state into rpl_slave_state, then we shut down the master, and while it's out we failover to another server. Then the first server comes back and should start to replicate from the new master. But how will it know which GTID to start from if it doesn't have any state in rpl_slave_state?
It uses what it has in the binlog (I call this the "binlog state").
The value of @@GLOBAL.gtid_pos is obtained as follows:
- First take the row with the highest sub_id from mysql.rpl_slave_state, per domain_id.
- Then look in the binlog state. If we have something there with higher seq_no, then it overrides what was in rpl_slave_state.
So basically if the last transaction originated on this server it uses the binlog state, if the last transaction was replicated it uses mysql.rpl_slave_state.
You didn't mention <log-bin>.state here. Is it intentional? Does it try to look into <log-bin>.state after rpl_slave_state and binlogs? Or e.g. in the situation when binlogs are not found?
I can guess that with a graceful shutdown there's also file <log-bin>.state which seem to contain the last state of the binlog. BTW, can you tell more about this file, how it participates in the replication? Will it be used to restore gtid_pos after shutdown?
<log-bin>.state contains the current binlog state at the time of shutdown. This is used to be able to output the correct Gtid_list event at the start of the new binlog created at server startup, and also to initialise the sequence number counter for new events.
Again it "used to be" which means it's not used to do that anymore? Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
Sorry, I didn't quite understand.
Sorry for being unclear, let me elaborate:
You didn't mention <log-bin>.state here. Is it intentional? Does it try to look into <log-bin>.state after rpl_slave_state and binlogs? Or e.g. in the situation when binlogs are not found?
There is just a single binlog state. The binlog state consists of the last GTID logged into the binlog for every (domain_id, server_id) pair. It also remembers the most recent GTID per domain. During normal server operation, the binlog state is kept only in-memory. The binlog state can be recovered from the binlog simply by scanning and recording every GTID encountered on the way. We do this after a crash. However, as an optimisation, to avoid having to scan binlog files at normal startup, we write out the binlog state to <log-bin>.state at server shutdown and read it back in at server startup. This is the only use of the file <log-bin>.state.
<log-bin>.state contains the current binlog state at the time of shutdown. This is used to be able to output the correct Gtid_list event at the start of the new binlog created at server startup, and also to initialise the sequence number counter for new events.
Again it "used to be" which means it's not used to do that anymore?
Just bad wording on my part, I meant "it is used", not that "it used to be". So to clarify: The server maintains an in-memory struct rpl_binlog_state with last GTIDs binlogged per-domain and per-(domain_id,server_id). At server shutdown, this in-memory state is saved to <log-bin>.state. At server startup, if we detect that binlog was properly closed and we do not need to do crash recovery, the in-memory struct rpl_binlog_state is initialised from the file <log-bin>.state. If we detect that we crashed prior and need to do crash recovery, <log-bin>.state is not consulted. Instead the in-memory struct rpl_binlog_state is initialised during the scan of the binlog, as part of crash recovery. I hope this makes things clear, - Kristian.
Thank you, Kristian. Could you please make just one more clarification: On Mon, May 13, 2013 at 12:02 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
At server startup, if we detect that binlog was properly closed and we do not need to do crash recovery, the in-memory struct rpl_binlog_state is initialised from the file <log-bin>.state.
If we detect that we crashed prior and need to do crash recovery, <log-bin>.state is not consulted. Instead the in-memory struct rpl_binlog_state is initialised during the scan of the binlog, as part of crash recovery.
How does it detect if binlogs were closed properly? I mean if at the start MariaDB sees that there's no binlogs at all but there is <log-bin>.state file will it read it? In other words is just having <log-bin>.state enough to restore gtid_pos after graceful shutdown? And do I understand correctly that gtid_pos will be combined from the last records in <log-bin>.state for each domain? Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
How does it detect if binlogs were closed properly?
This is the standard binlog crash recovery going back to 5.1 (or possibly 5.0): When a new binlog file is created, the first Format_description event has a flag LOG_EVENT_BINLOG_IN_USE_F set. At normal shutdown, everything is written and fsync()ed to disk, then finally the flag is cleared. At startup, if the flag is still set in the most recent binlog, it means we did not shut down cleanly.
I mean if at the start MariaDB sees that there's no binlogs at all but there is <log-bin>.state file will it read it?
If there are no binlog files at all, it will not read <log-bin>.state. But then, I do not see how it is possible that <log-bin>.state exists but no binlog files exist? This would seem to indicate severe corruption of the data directory?
And do I understand correctly that gtid_pos will be combined from the last records in <log-bin>.state for each domain?
Yes. If we want to point an existing slave to a new master, we need to start at the position of the last replicated GTID, which is in mysql.rpl_slave_state. But if we have a master server that we want to make a slave, and point it to a new master (eg. an old slave which is now promoted as master), we need to start replication at the last transaction originated on the master, which will be the last transaction logged in the binlog. Originally I wanted that the exact same CHANGE MASTER TO master_use_gtid=1 command would work for both cases. So the code tries to guess whether the server was last a slave or a master, by looking which of mysql.rpl_slave_state and the binlog has the newest transaction. And then use either the slave state or the binlog state as the start position. So currently we have this magic where the value of @@GLOBAL.gtid_pos is sometimes taken from what is in mysql.rpl_slave_state, and sometimes from what is in the binlog. However, I am having second thoughts about this idea, it is too magical. I want to change it so that @@GLOBAL.gtid_pos is just the last replicated transaction, as stored in mysql.rpl_slave_state. And there will be separate CHANGE MASTER options for the two separate cases (pointing an old slave to a new master, or making an old master into a slave). The current stuff is just too magical. I was planning to do this after I've finished START SLAVE UNTIL master_gtid_pos=xxx. But this is basically a user interface issue only, it does not affect how mysql.rpl_slave_state or <log-bin>.state is handled.
In other words is just having <log-bin>.state enough to restore gtid_pos after graceful shutdown?
There are two separate issues here. One is to load the binlog state at startup, you should think of this _only_ as conceptually scanning the binlog files and remembering last GTID logged per domain and server_id. We optimise this to avoid the binlog scan by caching the binlog state in <log-bin>.state, but it would be a bug if this optimisation yielded different results than a re-scan of the binlog files would give. The second issue is how the value of @@GLOBAL.gtid_pos is derived. This is a function of what is in mysql.rpl_slave_state and what is in the binlog. The <log-bin>.state should not affect this in any way (except make server startup faster). I have a hunch that what you are really asking is how to take a backup of the master server and restore this on a new server to provision a new slave? My intention for this was to use the existing facilities for such backup/restore, like mysqldump and XtraDB. If you take a mysqldump or XtraDB backup of the master, there are facilities to get the current binlog position (filename/offset) in a non-blocking way. You can then convert that into a GTID position with BINLOG_GTID_POS(filename, offset). Finally, on the new slave, you SET GLOBAL gtid_pos to the correct value. This is how I planned slave provisioning to work. You should not think of somehow copying around <log-bin>.state on its own without the binlog files, this will not work. Of course, a full consistent filesystem copy of everything (datadir and binlogs), like tar file of a stopped server or LVM snapshot or whatever, is ok. - Kristian.
On Mon, May 13, 2013 at 12:08 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I have a hunch that what you are really asking is how to take a backup of the master server and restore this on a new server to provision a new slave?
That's right, I'm thinking about taking backup and restoring.
My intention for this was to use the existing facilities for such backup/restore, like mysqldump and XtraDB.
I didn't check with the 10.0.2, but in 10.0.0 XtraDB didn't compile. Does it compile now?
If you take a mysqldump or XtraDB backup of the master, there are facilities to get the current binlog position (filename/offset) in a non-blocking way. You can then convert that into a GTID position with BINLOG_GTID_POS(filename, offset). Finally, on the new slave, you SET GLOBAL gtid_pos to the correct value. This is how I planned slave provisioning to work.
You should not think of somehow copying around <log-bin>.state on its own without the binlog files, this will not work. Of course, a full consistent filesystem copy of everything (datadir and binlogs), like tar file of a stopped server or LVM snapshot or whatever, is ok.
We are taking backup by the full consistent filesystem copy of datadir. Copying binlogs seem to be overkill because they can be huge and the only info that matters in there is information to restore gtid_pos. Somehow I thought that additional copying or <log-bin>.state will be enough for MariaDB to restore that state. If it's not then I guess we'll just need to extract the gtid_pos while taking backup and then force-set it during the restore. This should work, right? Thanks, Pavel
Pavel Ivanov <pivanof@google.com> writes:
My intention for this was to use the existing facilities for such backup/restore, like mysqldump and XtraDB.
I didn't check with the 10.0.2, but in 10.0.0 XtraDB didn't compile. Does it compile now?
I do not know, I do not think anyone tried. It may need some updates, as 10.0 is based on InnoDB from 5.6. But surely it will be made to work at some point.
If you take a mysqldump or XtraDB backup of the master, there are facilities to get the current binlog position (filename/offset) in a non-blocking way. You can then convert that into a GTID position with BINLOG_GTID_POS(filename, offset). Finally, on the new slave, you SET GLOBAL gtid_pos to the correct value. This is how I planned slave provisioning to work.
You should not think of somehow copying around <log-bin>.state on its own without the binlog files, this will not work. Of course, a full consistent filesystem copy of everything (datadir and binlogs), like tar file of a stopped server or LVM snapshot or whatever, is ok.
We are taking backup by the full consistent filesystem copy of datadir.
Copying binlogs seem to be overkill because they can be huge and the only info that matters in there is information to restore gtid_pos. Somehow I thought that additional copying or <log-bin>.state
Agree it seems overkill. In fact, you only need the last binlog file (or last two just around binlog rotate). But that may be too fiddly to be worth it.
will be enough for MariaDB to restore that state. If it's not then I guess we'll just need to extract the gtid_pos while taking backup and then force-set it during the restore. This should work, right?
Yes, that should work fine. Though I wonder - if you are not copying binlog files, how do you deal with transactions inside InnoDB that are in the prepared state when the consistent filesystem copy is taken? Normally, the binlog is used to recover correctly, by rolling back those that did not commit to the binlog, and committing those that do. But without the binlog on the restored server, I wonder how you handle this. Are you sure that you get prepared transactions rolled back/committed correctly inside InnoDB? And that the resulting InnoDB state matches the binlog position you associate with the restored backup? - Kristian.
On Tue, May 14, 2013 at 5:31 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Though I wonder - if you are not copying binlog files, how do you deal with transactions inside InnoDB that are in the prepared state when the consistent filesystem copy is taken? Normally, the binlog is used to recover correctly, by rolling back those that did not commit to the binlog, and committing those that do. But without the binlog on the restored server, I wonder how you handle this.
Are you sure that you get prepared transactions rolled back/committed correctly inside InnoDB? And that the resulting InnoDB state matches the binlog position you associate with the restored backup?
We reaching consistency in the backup by shutting down the server. I would think InnoDB state shouldn't be a problem in such situation. Am I right?
Pavel Ivanov <pivanof@google.com> writes:
We reaching consistency in the backup by shutting down the server. I
So thinking about this again ... So if we shut down the server gracefully, copy over everything to a new server, delete all binlogs (or omit binlogs from the copy), restart the server - then it could actually work to read the <log-bin>.state file. The result on the new server would be exactly the same as if we copied everything (including all binlogs), and then immediately purged all binlog files except the newly created one. Your use case seems valid - do you think it is worth it to have that (reading <log-bin>.state at server startup if there are no binlogs)? The implementation should be trivial. The main potential issue I can think of is if users do it wrong, eg. take an LVM snapshot of a _running_ server, remove the binlogs, forget to remove the <log-bin>.state. Then the binlog state will be wrong (it will be a stale file from when the original server was last restarted). But I could solve that by simply unlinking <log-bin>.state during server startup. That seems like a good thing to do anyway. With this, it should be possible to shut down master, copy all files except <log-bin>.NNNNNN/<log-bin>.index, start server on new copy, join original master as a slave using GTID. What do you think? I did not understand your use case correctly the first time I read it, as I was thinking about snapshots of a running server... - Kristian.
On Wed, May 15, 2013 at 1:11 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Pavel Ivanov <pivanof@google.com> writes:
We reaching consistency in the backup by shutting down the server. I
So thinking about this again ...
So if we shut down the server gracefully, copy over everything to a new server, delete all binlogs (or omit binlogs from the copy), restart the server - then it could actually work to read the <log-bin>.state file.
The result on the new server would be exactly the same as if we copied everything (including all binlogs), and then immediately purged all binlog files except the newly created one.
Your use case seems valid - do you think it is worth it to have that (reading <log-bin>.state at server startup if there are no binlogs)? The implementation should be trivial.
The main potential issue I can think of is if users do it wrong, eg. take an LVM snapshot of a _running_ server, remove the binlogs, forget to remove the <log-bin>.state. Then the binlog state will be wrong (it will be a stale file from when the original server was last restarted). But I could solve that by simply unlinking <log-bin>.state during server startup. That seems like a good thing to do anyway.
With this, it should be possible to shut down master, copy all files except <log-bin>.NNNNNN/<log-bin>.index, start server on new copy, join original master as a slave using GTID.
What do you think? I did not understand your use case correctly the first time I read it, as I was thinking about snapshots of a running server...
This sounds good to me. So always reading the state file (unless crash is detected) and deleting the file after the state is read. It seems that in such situation it will be even impossible to find the state file and detect a crash at the same time. If you don't see any other issues with this it looks worth implementing. Pavel
participants (3)
-
Giuseppe Maxia
-
Kristian Nielsen
-
Pavel Ivanov