[Maria-developers] Fwd: possible bug in MySQL 5.6.7-rc slave replication crash safety?
Hello, Does Maria 10.0 implement crash safe replication that is found in MySQL 5.6.7? If so, does this issue exist there? Thanks -Zardosht ---------- Forwarded message ---------- From: Zardosht Kasheff <zardosht@gmail.com> Date: Wed, Oct 17, 2012 at 8:59 AM Subject: possible bug in MySQL 5.6.7-rc slave replication crash safety? To: internals@lists.mysql.com Hello all, We downloaded MySQL 5.6.7-rc to see slave replication crash safety in action. A simple test did not seem to work, as described in http://bugs.mysql.com/bug.php?id=67246. Does anyone have insight into the problem? Perhaps we screwed something up? Thanks -Zardosht
Zardosht Kasheff <zardosht@gmail.com> writes:
Does Maria 10.0 implement crash safe replication that is found in MySQL 5.6.7? If so, does this issue exist there?
No (not yet at least). So the issue does not exist at the moment. Note that XtraDB in MariaDB has another solution for crash safe replication. - Kristian.
What is that solution? Is the solution in MariaDB 5.5? Is there a way for another engine to also have this solution? Thanks -Zardosht On Fri, Oct 19, 2012 at 3:16 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
Does Maria 10.0 implement crash safe replication that is found in MySQL 5.6.7? If so, does this issue exist there?
No (not yet at least). So the issue does not exist at the moment.
Note that XtraDB in MariaDB has another solution for crash safe replication.
- Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
What is that solution? Is the solution in MariaDB 5.5? Is there a way for another engine to also have this solution?
1. http://www.percona.com/doc/percona-server/5.5/reliability/innodb_recovery_up... 2. Yes (I think it's even in MariaDB 5.1) 3. No, this is completely innodb-specific. My current plan for crash safe replication is that this will be part of MDEV-26: https://mariadb.atlassian.net/browse/MDEV-26 This solution works by storing the state in the slave binlog, so it works with any (XA-capable) engine. - Kristian.
Hello Kristian, Thank you for the reply. I want to make sure I understand this correctly. Is the planned design for crash safe replication to be that the slave must have the binary log enabled, so that slaves may use the GTID and XA to ensure that the slave and master are in sync? If the binary log is not enabled, then the slaves are not crash safe. Is my understanding correct? Thanks -Zardosht On Fri, Oct 19, 2012 at 7:36 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
What is that solution? Is the solution in MariaDB 5.5? Is there a way for another engine to also have this solution?
1. http://www.percona.com/doc/percona-server/5.5/reliability/innodb_recovery_up...
2. Yes (I think it's even in MariaDB 5.1)
3. No, this is completely innodb-specific.
My current plan for crash safe replication is that this will be part of MDEV-26:
https://mariadb.atlassian.net/browse/MDEV-26
This solution works by storing the state in the slave binlog, so it works with any (XA-capable) engine.
- Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
Thank you for the reply. I want to make sure I understand this correctly. Is the planned design for crash safe replication to be that the slave must have the binary log enabled, so that slaves may use the GTID and XA to ensure that the slave and master are in sync? If the binary log is not enabled, then the slaves are not crash safe. Is my understanding correct?
I know of three different features that touch on "crash safe slave": 1. MDEV-26. If/when this is implemented, the plan is that the state is made crash-safe using the binary log, as you said. 2. The MySQL feature, where we store in a transactional table the info that is now in relay-log.info. 3. The XtraDB hack, where InnoDB overwrites the relay-log.info at startup from internal data salvaged from the InnoDB redo log. Some comments: - None of these make replication slaves truly "crash safe". DDL for example is still not crash safe in any MySQL/MariaDB version. - (1) and (2) work for any storage engine. (3) is an innodb-only hack. - As far as I understand, (1) is also what is used in MySQL 5.6 global transaction ID. However, they may have modified the design since I last looked. - The option (2) makes it harder to implement parallel replication. The problem is that two transactions running in parallel could get row lock conflicts on the transactional table. One way to solve this is to insert new rows with every commit (and use SELECT MAX(...) to get the state). Then we need a garbage-collection thread to periodically remove old rows. - The option (1) is the nicest from a design point, as it reuses the existing mechanism for recovering consistently after a crash in an engine-neutral way. But the disadvantage is that it is quite hard (read: I doubt it will ever happen) to implement it without at least one fsync() per (group) commit, due to the requirement for having binlog on slave. In contrast, InnoDB in itself is crash-safe (can lose transactions but not become inconsistent) with innodb_flush_log_at_trx_commit=0|2. Hope this helps, - Kristian.
Thinking more about this, I cannot quite make up my mind of the best approach. I would like to get some input. On the one hand, when binlog is enabled on a slave, it really is the best solution for storing replication state and crash safety. On the other hand, running a slave with innodb_flush_log_at_trx_commit=0 and no binlog will perform much faster and makes a lot of sense for some setups (after all if we crash we can just replicate again any transactions lost). Hm. It seems to get the best of both worlds, we need a dual approach. When binlog is enabled, we use it to store and crash-recover slave state. When binlog is disabled, every transaction inserts a row in some table containing the global transaction ID. After a crash we then obtain the last applied transaction with SELECT MAX(...) from that table. And we need to periodically delete old rows. Both methods work with any (XA-capable) storage engine. However, the table approach will have a large overhead when the engine used for the table is different from the engine used in transactions. Thoughts? - Kristian. Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Zardosht Kasheff <zardosht@gmail.com> writes:
Thank you for the reply. I want to make sure I understand this correctly. Is the planned design for crash safe replication to be that the slave must have the binary log enabled, so that slaves may use the GTID and XA to ensure that the slave and master are in sync? If the binary log is not enabled, then the slaves are not crash safe. Is my understanding correct?
I know of three different features that touch on "crash safe slave":
1. MDEV-26. If/when this is implemented, the plan is that the state is made crash-safe using the binary log, as you said.
2. The MySQL feature, where we store in a transactional table the info that is now in relay-log.info.
3. The XtraDB hack, where InnoDB overwrites the relay-log.info at startup from internal data salvaged from the InnoDB redo log.
Hello Kristian, Thanks for the feedback. For me, the biggest design goal is to have a way for slaves to be crash safe without requiring the storage engine to fsync every transaction it processes. This would be a huge performance win. In a world where slave replication is single threaded, I solution (2), where the information is stored in a slave_relay_log_info table, for the performance reasons you mention. Not requiring fsyncs on transactional commit seems like a big deal, not just for InnoDB, but any transactional storage engine. As far as crash safety goes, is DDL the only issue? If so, engines can implement the handlerton's discover API to make itself crash safe. Should there be a crash, a little tweaking will be needed to get the relay log info to the right position, but it should be doable. Inlining other points On Tue, Oct 23, 2012 at 7:49 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Thinking more about this, I cannot quite make up my mind of the best approach. I would like to get some input.
On the one hand, when binlog is enabled on a slave, it really is the best solution for storing replication state and crash safety.
On the other hand, running a slave with innodb_flush_log_at_trx_commit=0 and no binlog will perform much faster and makes a lot of sense for some setups (after all if we crash we can just replicate again any transactions lost).
For performance, this sounds like what users that want to scale will want, so I am partial towards this solution.
Hm.
It seems to get the best of both worlds, we need a dual approach.
When binlog is enabled, we use it to store and crash-recover slave state.
When binlog is disabled, every transaction inserts a row in some table containing the global transaction ID. After a crash we then obtain the last applied transaction with SELECT MAX(...) from that table. And we need to periodically delete old rows.
To better analyze this, can you please give some details on how a slave would apply a binary log in parallel? If it is just what MySQL does, and applies it per database, then perhaps a row in the relay log table per database would be sufficient. I think whatever solution makes sense would depend on the implementation of parallel replication.
Both methods work with any (XA-capable) storage engine. However, the table approach will have a large overhead when the engine used for the table is different from the engine used in transactions.
True, but that's the way it is. It is (hopefully) well known that keeping data across multiple engines consistent incurs performance hits with XA. So, users will hopefully have their data predominately in one engine. If not, they suffer performance in this scenario. Nothing anyone can do about it.
Thoughts?
- Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Zardosht Kasheff <zardosht@gmail.com> writes:
Thank you for the reply. I want to make sure I understand this correctly. Is the planned design for crash safe replication to be that the slave must have the binary log enabled, so that slaves may use the GTID and XA to ensure that the slave and master are in sync? If the binary log is not enabled, then the slaves are not crash safe. Is my understanding correct?
I know of three different features that touch on "crash safe slave":
1. MDEV-26. If/when this is implemented, the plan is that the state is made crash-safe using the binary log, as you said.
2. The MySQL feature, where we store in a transactional table the info that is now in relay-log.info.
3. The XtraDB hack, where InnoDB overwrites the relay-log.info at startup from internal data salvaged from the InnoDB redo log.
Zardosht Kasheff <zardosht@gmail.com> writes:
For me, the biggest design goal is to have a way for slaves to be crash safe without requiring the storage engine to fsync every transaction it processes. This would be a huge performance win.
Right. This basically requires [*] that everything goes through a single transaction log, ie. the InnoDB redo log or alternatively the log some other transactional storage engine. This can be done by writing the state to a table in the engine. (Another way is the XtraDB hack where it is essentially written directly to the redo log, but this has problems when XA rollback is done). BTW, when using global transaction ID in MySQL 5.6, binlog on slave is required. "Normal" replication seems to use the mysql.slave_relay_log_info table. Parallel replication seems to use mysql.slave_worker_info. (Am I the only one who is sad to see this total mess, all new replication features doing something different from each other?)
In a world where slave replication is single threaded, I solution (2), where the information is stored in a slave_relay_log_info table, for the performance reasons you mention. Not requiring fsyncs on transactional commit seems like a big deal, not just for InnoDB, but any transactional storage engine.
Yeah. It depends on the system of course. In a high-end server with battery-backed RAID cache and working set does not fit in memory it will be a minor issue. On consumer-grade disks with all data in memory and small transactions it would make a huge difference.
As far as crash safety goes, is DDL the only issue? If so, engines can implement the handlerton's discover API to make itself crash safe. Should there be a crash, a little tweaking will be needed to get the relay log info to the right position, but it should be doable.
There is the slave relay log and the relay-log.info file, but those are the ones we are talking about fixing. Then there is master.info, but it is only changed during CHANGE MASTER, so the window where a crash can corrup things is very small. Neither the binlog nor the relaylog have protection against partial disk block writes (sometimes called "torn pages") - InnoDB protects against this with the doublewrite buffer. Apart from those and DDL I can't think of anything at the moment, does not mean there isn't anything else though.
To better analyze this, can you please give some details on how a slave would apply a binary log in parallel? If it is just what MySQL does, and applies it per database, then perhaps a row in the relay log table per database would be sufficient. I think whatever solution makes sense would depend on the implementation of parallel replication.
We are thinking of several methods, which could supplement each other: - http://askmonty.org/worklog/Server-RawIdeaBin/?tid=169 - http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184 - https://lists.launchpad.net/maria-developers/msg04911.html (see also http://mysql.taobao.org/index.php/RLPR_for_MariaDB) - http://askmonty.org/worklog/Server-RawIdeaBin/?tid=186 - http://askmonty.org/worklog/Server-RawIdeaBin/?tid=208 But the short story, there are basically two kinds of approaches: out-of-order (transactions commit in different order on slave than on master) and in-order (transactions may execute in different order on slave, but commits are synchronised to happen in the same order as on master). The MySQL approach that applies per database is an out-of-order approach. I agree that these approaches need distinct rows in the table per database (or per-whatever). Because every database could be in a different place in the binlog. The in-order approaches have just a single location in the master binlog at any one time. But if we use just one row in the table for all of them, then we get row lock contention, and loose parallelism. I think it is best just to always insert a new row for every commit, and periodically delete old rows. This works for both out-of-order and in-order.
True, but that's the way it is. It is (hopefully) well known that keeping data across multiple engines consistent incurs performance hits with XA. So, users will hopefully have their data predominately
Agree. - Kristian.
participants (2)
-
Kristian Nielsen
-
Zardosht Kasheff