Re: [Maria-developers] what pros/cons of storing binary log in an InnoDB table?
Zardosht Kasheff <zardosht@gmail.com> writes:
Instead of having the binary log be stored as a log file, a table with a transactional engine (like InnoDB) is used. I have NOT done any
Yes, this is a tempting idea, is it not? Drizzle even implemented this. But there are some subtle problems, and I think there are better solutions. One problem you will find if you try implementing this is - how are you going to keep the concept of binlog order? A relational table is unordered. You can use an increasing uint64 as the primary key, but then how are you going to ensure that the order in the table is the same as the commit order in InnoDB (InnoDB redo log)? If you take a lock from allocating the sequence number to doing the commit, then you have introduced a serialisation bottleneck and killed group commit. If you accept different order (as Drizzle does), then when you take an XtraBackup, it may not be consistent with the binlog. You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to the redo log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this. So it is tempting, but re-implementing binlog is a huge change (also in terms of end-user visibility), and my personal opinion is that for such huge a change we need to go to the best solution, not second-best.
investigation into whether this is possible, but I'm wondering what people abstractly think of this idea.
My major motivation is to find a way to have users not be forced to fsync on every transaction commit on slaves. With crash safety of
My preference is MWL#164: http://askmonty.org/worklog/Server-RawIdeaBin/?tid=164 This reduces to one fsync() per group commit. But as you point out, this is still horrendously expensive for a single-threaded slave :-( Abstractly, the proper solution is to store the binlog and the InnoDB redo log (and all other transactional-type logs) in the *same* log. The upper layer would provide a general transactional log facility, and binlog, InnoDB, partitioning, etc. would use this instead of their own logs. But it is hard to see how we could arrive there from current code ... For MariaDB global transaction ID, I support crash-safe without binlog enabled, so there turning off fsync() will be possible. But this does not help if a slave also needs to act as a master. Parallel replication can help a lot, by allowing group commit, amortising the cost of the fsync(). But it would still be nice to solve the orginal problem. As you say, for a slave it is not a problem to loose transactions in a crash as they can be just replicated again. The problem is to be consistent between what is in the InnoDB tables and what is in the binlog. InnoDB already records in a crash-safe way the binlog position of the last commit. So if we can ensure that InnoDB is never ahead of the binlog, we can just cut the binlog after a crash if it has more events than what InnoDB recovered. We could do this by doing binlog writes as async I/O O_DIRECT writes, and sending a signal somehow to InnoDB so it would buffer its own redo log until binlog is safe on disk. Then no fsync() would be needed. But this is so far just a loose idea, it needs more thought to be sure it can work... Anyway, this is my current long-term plan: if we cannot have a proper single database-wide transaction log, then instead coordinate writes between the multiple logs that exist. Delay writes to logs so that they are never ahead of the master log. And at crash, either recover missing pieces of logs from the master log (MWL#164), or cut logs to be consistent. Then we can be consistent without any fsync() delays.
- As I understand it, for two phase commit, storage engines are currently expected to fsync once after a transaction is prepared and once after a transaction is committed. If the binary log is an InnoDB table, then two-phase commit is unnecessary and the number of fsyncs on the system is cut in half.
In MariaDB 10 and MySQL 5.6, the fsync() at commit is no longer needed (InnoDB does not have it). [Note that IIRC, MySQL 5.6 breaks the storage engine API, you will need to implement the skip-fsync-at-commit-stage, or you will loose group commit.]
- With crash safe slaves on MySQL 5.6, one theoretically does not need to enable fsyncs on transaction commit on slaves. If some data gets lost after a crash, then the data may be replayed from the master. However, to get GTIDs working, one must enable the binary log on slaves, which requires two-phase commit, which require fsyncs. By
Yes, requiring binlog on slaves is a serious mistake :-( Ok, as you can perhaps tell, this is something I have thought a lot about over the past years, so I have a lot of ideas but it is a bit difficult to put it all in a single mail reply without it becomming mostly a jumble of loose ideas. You are welcome to ask for more details on any specific points of interest. - Kristian.
Hello Kristian, Thank you for the detailed reply. Hearing the pros and cons of these different design choices is helpful. There is too much to reply to in a single email, so I will focus on different pieces in different emails. I see the following interesting pieces that I would like to expand on: - problems with having the binary log be an InnoDB (or any transactional engine) table - the "proper solution" - the "current long-term plan" - improvements made to replication in MySQL 5.6 that does not require fsync on commit and to Maria 10 that does not require binary logging on slaves for GTID. In this email, I will focus on (and hope to understand better) just the problems with having the binary log be an InnoDB table. Here are some design problems you mentioned: " One problem you will find if you try implementing this is - how are you going to keep the concept of binlog order? A relational table is unordered. You can use an increasing uint64 as the primary key, but then how are you going to ensure that the order in the table is the same as the commit order in InnoDB (InnoDB redo log)? If you take a lock from allocating the sequence number to doing the commit, then you have introduced a serialisation bottleneck and killed group commit. If you accept different order (as Drizzle does), then when you take an XtraBackup, it may not be consistent with the binlog." I don't see issues with using an increasing uint64 as the primary key. You ask how we ensure that the order in the table is the same as the commit order in InnoDB. Why does this matter? As long as the user work done to InnoDB tables and changes to the binary log are done with the same transaction (the way MySQL 5.6 updates the relay-log info on slaves with the same transaction) I don't understand why matters. If XtraBackup works properly, then the backed up data should have something that is consistent across the board. I must say, I have a limited knowledge of InnoDB. Is there something I am missing here? As far as the performance issues go, you say we are writing the data 6 times. With the current solution, we write the data four times, three for InnoDB and once to the binary log. So really, there seems to be a 33% increase of data written. This does not strike me as so bad. Also, other engines will probably only need to write the data one more time, not two more times. Yes, this is more data written, but it is sequential data being written, so hopefully the hit is not too bad. Thoughts? -Zardosht On Mon, Jan 28, 2013 at 4:28 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
Instead of having the binary log be stored as a log file, a table with a transactional engine (like InnoDB) is used. I have NOT done any
Yes, this is a tempting idea, is it not? Drizzle even implemented this. But there are some subtle problems, and I think there are better solutions.
One problem you will find if you try implementing this is - how are you going to keep the concept of binlog order? A relational table is unordered. You can use an increasing uint64 as the primary key, but then how are you going to ensure that the order in the table is the same as the commit order in InnoDB (InnoDB redo log)? If you take a lock from allocating the sequence number to doing the commit, then you have introduced a serialisation bottleneck and killed group commit. If you accept different order (as Drizzle does), then when you take an XtraBackup, it may not be consistent with the binlog.
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to the redo log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
So it is tempting, but re-implementing binlog is a huge change (also in terms of end-user visibility), and my personal opinion is that for such huge a change we need to go to the best solution, not second-best.
investigation into whether this is possible, but I'm wondering what people abstractly think of this idea.
My major motivation is to find a way to have users not be forced to fsync on every transaction commit on slaves. With crash safety of
My preference is MWL#164:
http://askmonty.org/worklog/Server-RawIdeaBin/?tid=164
This reduces to one fsync() per group commit. But as you point out, this is still horrendously expensive for a single-threaded slave :-(
Abstractly, the proper solution is to store the binlog and the InnoDB redo log (and all other transactional-type logs) in the *same* log. The upper layer would provide a general transactional log facility, and binlog, InnoDB, partitioning, etc. would use this instead of their own logs. But it is hard to see how we could arrive there from current code ...
For MariaDB global transaction ID, I support crash-safe without binlog enabled, so there turning off fsync() will be possible. But this does not help if a slave also needs to act as a master.
Parallel replication can help a lot, by allowing group commit, amortising the cost of the fsync(). But it would still be nice to solve the orginal problem.
As you say, for a slave it is not a problem to loose transactions in a crash as they can be just replicated again. The problem is to be consistent between what is in the InnoDB tables and what is in the binlog.
InnoDB already records in a crash-safe way the binlog position of the last commit. So if we can ensure that InnoDB is never ahead of the binlog, we can just cut the binlog after a crash if it has more events than what InnoDB recovered. We could do this by doing binlog writes as async I/O O_DIRECT writes, and sending a signal somehow to InnoDB so it would buffer its own redo log until binlog is safe on disk. Then no fsync() would be needed. But this is so far just a loose idea, it needs more thought to be sure it can work...
Anyway, this is my current long-term plan: if we cannot have a proper single database-wide transaction log, then instead coordinate writes between the multiple logs that exist. Delay writes to logs so that they are never ahead of the master log. And at crash, either recover missing pieces of logs from the master log (MWL#164), or cut logs to be consistent. Then we can be consistent without any fsync() delays.
- As I understand it, for two phase commit, storage engines are currently expected to fsync once after a transaction is prepared and once after a transaction is committed. If the binary log is an InnoDB table, then two-phase commit is unnecessary and the number of fsyncs on the system is cut in half.
In MariaDB 10 and MySQL 5.6, the fsync() at commit is no longer needed (InnoDB does not have it).
[Note that IIRC, MySQL 5.6 breaks the storage engine API, you will need to implement the skip-fsync-at-commit-stage, or you will loose group commit.]
- With crash safe slaves on MySQL 5.6, one theoretically does not need to enable fsyncs on transaction commit on slaves. If some data gets lost after a crash, then the data may be replayed from the master. However, to get GTIDs working, one must enable the binary log on slaves, which requires two-phase commit, which require fsyncs. By
Yes, requiring binlog on slaves is a serious mistake :-(
Ok, as you can perhaps tell, this is something I have thought a lot about over the past years, so I have a lot of ideas but it is a bit difficult to put it all in a single mail reply without it becomming mostly a jumble of loose ideas. You are welcome to ask for more details on any specific points of interest.
- Kristian.
Zardosht Kasheff <zardosht@gmail.com> writes:
In this email, I will focus on (and hope to understand better) just the problems with having the binary log be an InnoDB table.
I don't see issues with using an increasing uint64 as the primary key. You ask how we ensure that the order in the table is the same as the commit order in InnoDB. Why does this matter? As long as the user work done to InnoDB tables and changes to the binary log are done with the same transaction (the way MySQL 5.6 updates the relay-log info on slaves with the same transaction) I don't understand why matters. If XtraBackup works properly, then the backed up data should have something that is consistent across the board.
Yes, it is a subtle issue. Let me explain it in more depth. Suppose we have 4 transactions: T1 T2 T3 T4. They run in parallel and commit around the same time. When we allocate rows for them in the binlog (in an InnoDB table), we happen to assign them numbers like this: 1: T1 2: T2 3: T3 4: T4 And in the InnoDB redo log, they happen to be committed in order: T1 T3 T4 T2. Now suppose a non-blocking XtraBackup is running in parallel with this with the intention of provisioning a new slave. XtraBackup happens to take a snapshot of InnoDB that has T1 T3 T4 committed (but not T2). We restore the XtraBackup to a new slave. Now the problem is - which binlog position should the slave start replicating from? If we start after (4: T4), then we will have lost T2 on the slave. If we start at (2: T2), then we will duplicate T3 and T4 on the slave. So the problem is that without a consistent binlog order between innodb redo log and the binlog, we do not have a unique position to start replicating from in the new slave. So this is a small thing perhaps - you can just eg. give up on non-blocking XtraBackup provisioning of slaves. And anyway, it just occurs to me that MySQL 5.6 global transaction ID does not have this issue, because it anyway gives up on having a consistent binlog order. Instead it keeps track of all applied and not applied transactions, so should be able to replicate T2 and skip T3 and T4. You might still need to support non-GTID replication though. BTW, a closely related idea would be to store the binlog inside the InnoDB redo log (as extra info logged along with the transaction). This would solve most problems, I believe. This might work well for many transactional storage engines. The problem with InnoDB is that it has a cyclic log, so there is no way to ensure that old binlogs are not overwritten while slaves might still need them.
As far as the performance issues go, you say we are writing the data 6 times. With the current solution, we write the data four times, three for InnoDB and once to the binary log. So really, there seems to be a
Agree. So as I said, it is tempting. Just use a table in the storage engine, and get all the transactional consistency for free. We have a lot of nasty problems in current MySQL/MariaDB because of things that write to all kinds of different files, rather than use a common transactional framework. So what I am saying - I thought a lot about this and similar ideas a couple of years ago. I ended up deciding not to go this way, because of above-mentioned problems and probably others that I have forgotten. But it is not clear that it cannot work. Of course, there will be quite a lot of practical work to move the binlog to a storage engine. I suppose you have in mind a general extension to the storage engine API so that other engines could own the binlog as well? A lot of existing infrastructure and tools would need to deal with binlog tables rather than binlog files. Maybe the decisive factor was mostly that keeping the binlog, and slowly adding improvements, can be done in small, evolutionary steps. So it seems the more realistic approach, compared to a revolutionary approach of completely rewriting the binlog. I do not have the final answer. It would definitely be nice to see us move towards being more transactional. It is a hard journey though. - Kristian.
Hello Kristian, Thanks again for the reply. Inlining my thoughts. On Mon, Jan 28, 2013 at 10:06 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
In this email, I will focus on (and hope to understand better) just the problems with having the binary log be an InnoDB table.
I don't see issues with using an increasing uint64 as the primary key. You ask how we ensure that the order in the table is the same as the commit order in InnoDB. Why does this matter? As long as the user work done to InnoDB tables and changes to the binary log are done with the same transaction (the way MySQL 5.6 updates the relay-log info on slaves with the same transaction) I don't understand why matters. If XtraBackup works properly, then the backed up data should have something that is consistent across the board.
Yes, it is a subtle issue. Let me explain it in more depth.
Suppose we have 4 transactions: T1 T2 T3 T4. They run in parallel and commit around the same time.
When we allocate rows for them in the binlog (in an InnoDB table), we happen to assign them numbers like this:
1: T1 2: T2 3: T3 4: T4
And in the InnoDB redo log, they happen to be committed in order: T1 T3 T4 T2.
Now suppose a non-blocking XtraBackup is running in parallel with this with the intention of provisioning a new slave. XtraBackup happens to take a snapshot of InnoDB that has T1 T3 T4 committed (but not T2).
We restore the XtraBackup to a new slave. Now the problem is - which binlog position should the slave start replicating from? If we start after (4: T4), then we will have lost T2 on the slave. If we start at (2: T2), then we will duplicate T3 and T4 on the slave. So the problem is that without a consistent binlog order between innodb redo log and the binlog, we do not have a unique position to start replicating from in the new slave.
So this is a small thing perhaps - you can just eg. give up on non-blocking XtraBackup provisioning of slaves.
And anyway, it just occurs to me that MySQL 5.6 global transaction ID does not have this issue, because it anyway gives up on having a consistent binlog order. Instead it keeps track of all applied and not applied transactions, so should be able to replicate T2 and skip T3 and T4. You might still need to support non-GTID replication though.
Thanks for the explanation. I now understand this. A possible approach may be that legacy replication works as it does now, and GTID replication (or "new" replication) works with this scheme.
BTW, a closely related idea would be to store the binlog inside the InnoDB redo log (as extra info logged along with the transaction). This would solve most problems, I believe. This might work well for many transactional storage engines. The problem with InnoDB is that it has a cyclic log, so there is no way to ensure that old binlogs are not overwritten while slaves might still need them.
I think this goes counter to the ideas that I am proposing/thinking about. My goal is to use the existing transactional storage engine infrastructure to implement the binary log, so that other transactional engines may use it, and to keep storage engine implementations and binary logs decoupled. This couples the implementation of InnoDB and the binary log.
As far as the performance issues go, you say we are writing the data 6 times. With the current solution, we write the data four times, three for InnoDB and once to the binary log. So really, there seems to be a
Agree.
So as I said, it is tempting. Just use a table in the storage engine, and get all the transactional consistency for free. We have a lot of nasty problems in current MySQL/MariaDB because of things that write to all kinds of different files, rather than use a common transactional framework.
So what I am saying - I thought a lot about this and similar ideas a couple of years ago. I ended up deciding not to go this way, because of above-mentioned problems and probably others that I have forgotten. But it is not clear that it cannot work.
So in summary, these seem to be the non-trivial (and very legitimate) issues - This would be revolutionary, as opposed to evolutionary, making it a risky project to undertake. - Legacy (non-GTID) replication requires a consistent order of transactions in the binary log, making it difficult to work with this feature. Are there any other issues?
Of course, there will be quite a lot of practical work to move the binlog to a storage engine. I suppose you have in mind a general extension to the storage engine API so that other engines could own the binlog as well? A lot of existing infrastructure and tools would need to deal with binlog tables rather than binlog files.
I actually have no APIs in mind. I was hoping this would work with existing APIs writing data to a table and reading data from a table.
Maybe the decisive factor was mostly that keeping the binlog, and slowly adding improvements, can be done in small, evolutionary steps. So it seems the more realistic approach, compared to a revolutionary approach of completely rewriting the binlog.
I do not have the final answer. It would definitely be nice to see us move towards being more transactional. It is a hard journey though.
- Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to the redo log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
Mind you, it performs really quite well due to the reduced fsync()s and (at least in Drizzle code) was rather trivial to implement. Big bang for buck :) -- Stewart Smith
Stewart, I would pretty much like to see benchmarks numbers, especially under IO-bound workload before agreeing that it performs well. On Mon, Jan 28, 2013 at 5:05 PM, Stewart Smith <stewart@flamingspork.com> wrote:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to the redo log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
Mind you, it performs really quite well due to the reduced fsync()s and (at least in Drizzle code) was rather trivial to implement. Big bang for buck :)
-- Stewart Smith
-- Vadim Tkachenko, CTO, Percona Phone +1-925-400-7377, Skype: vadimtk153 Schedule meeting: http://meetme.so/VadimTkachenko Join us in Santa Clara for the annual Percona Live MySQL Conference & Expo 2013! http://www.percona.com/live/mysql-conference-2013/ Looking for Replication with Data Consistency? Try Percona XtraDB Cluster!
Stewart, thanks for the feedback. That this was simple for Drizzle is encouraging to hear. Is there a proposal somewhere for eliminating the need for fsyncs on a slave? As I understand, MWL#164 still requires the fsync. On Mon, Jan 28, 2013 at 8:38 PM, Vadim Tkachenko <vadim@percona.com> wrote:
Stewart,
I would pretty much like to see benchmarks numbers, especially under IO-bound workload before agreeing that it performs well.
On Mon, Jan 28, 2013 at 5:05 PM, Stewart Smith <stewart@flamingspork.com> wrote:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to the redo log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
Mind you, it performs really quite well due to the reduced fsync()s and (at least in Drizzle code) was rather trivial to implement. Big bang for buck :)
-- Stewart Smith
-- Vadim Tkachenko, CTO, Percona Phone +1-925-400-7377, Skype: vadimtk153 Schedule meeting: http://meetme.so/VadimTkachenko
Join us in Santa Clara for the annual Percona Live MySQL Conference & Expo 2013! http://www.percona.com/live/mysql-conference-2013/
Looking for Replication with Data Consistency? Try Percona XtraDB Cluster!
Zardosht Kasheff <zardosht@gmail.com> writes:
Is there a proposal somewhere for eliminating the need for fsyncs on a slave? As I understand, MWL#164 still requires the fsync.
Do you mean for MariaDB? My intention is that MDEV-26 will make the slave crash-safe without the need for fsync() if no --log-slave-updates. If binlog is enabled, then fsync() will still be needed to prevent the binlog and the engine from getting out of sync. - Kristian.
These are some results from sysbench for the GA release of Drizzle, it was about a 10% hit having the transaction log stored in inno. The only automated testing I did was sysbench I do recall how i/o bound sysbench is. https://lists.launchpad.net/drizzle-benchmark/msg09626.html Cheers, --Joe On Mon, Jan 28, 2013 at 8:38 PM, Vadim Tkachenko <vadim@percona.com> wrote:
Stewart,
I would pretty much like to see benchmarks numbers, especially under IO-bound workload before agreeing that it performs well.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to
On Mon, Jan 28, 2013 at 5:05 PM, Stewart Smith <stewart@flamingspork.com> wrote: the redo
log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
Mind you, it performs really quite well due to the reduced fsync()s and (at least in Drizzle code) was rather trivial to implement. Big bang for buck :)
-- Stewart Smith
-- Vadim Tkachenko, CTO, Percona Phone +1-925-400-7377, Skype: vadimtk153 Schedule meeting: http://meetme.so/VadimTkachenko
Join us in Santa Clara for the annual Percona Live MySQL Conference & Expo 2013! http://www.percona.com/live/mysql-conference-2013/
Looking for Replication with Data Consistency? Try Percona XtraDB Cluster!
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
oops meant to say I don't recall how i/o bound sysbench is On Thu, Jan 31, 2013 at 10:07 AM, Joe Daly <skinny.moey@gmail.com> wrote:
These are some results from sysbench for the GA release of Drizzle, it was about a 10% hit having the transaction log stored in inno. The only automated testing I did was sysbench I do recall how i/o bound sysbench is.
https://lists.launchpad.net/drizzle-benchmark/msg09626.html
Cheers, --Joe
On Mon, Jan 28, 2013 at 8:38 PM, Vadim Tkachenko <vadim@percona.com>wrote:
Stewart,
I would pretty much like to see benchmarks numbers, especially under IO-bound workload before agreeing that it performs well.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
You will be writing your data *6* times: To the real table, when flushing its buffer pool pages. To the double-write buffer while flushing, and to
On Mon, Jan 28, 2013 at 5:05 PM, Stewart Smith <stewart@flamingspork.com> wrote: the redo
log before flushing. Then to the binlog table, and before that to the redo log and the doublewrite buffer. I could never reconcile myself with this.
Mind you, it performs really quite well due to the reduced fsync()s and (at least in Drizzle code) was rather trivial to implement. Big bang for buck :)
-- Stewart Smith
-- Vadim Tkachenko, CTO, Percona Phone +1-925-400-7377, Skype: vadimtk153 Schedule meeting: http://meetme.so/VadimTkachenko
Join us in Santa Clara for the annual Percona Live MySQL Conference & Expo 2013! http://www.percona.com/live/mysql-conference-2013/
Looking for Replication with Data Consistency? Try Percona XtraDB Cluster!
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
participants (5)
-
Joe Daly
-
Kristian Nielsen
-
Stewart Smith
-
Vadim Tkachenko
-
Zardosht Kasheff