Hello Kristian, Thanks again for the reply. Inlining my thoughts. On Mon, Jan 28, 2013 at 10:06 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Zardosht Kasheff <zardosht@gmail.com> writes:
In this email, I will focus on (and hope to understand better) just the problems with having the binary log be an InnoDB table.
I don't see issues with using an increasing uint64 as the primary key. You ask how we ensure that the order in the table is the same as the commit order in InnoDB. Why does this matter? As long as the user work done to InnoDB tables and changes to the binary log are done with the same transaction (the way MySQL 5.6 updates the relay-log info on slaves with the same transaction) I don't understand why matters. If XtraBackup works properly, then the backed up data should have something that is consistent across the board.
Yes, it is a subtle issue. Let me explain it in more depth.
Suppose we have 4 transactions: T1 T2 T3 T4. They run in parallel and commit around the same time.
When we allocate rows for them in the binlog (in an InnoDB table), we happen to assign them numbers like this:
1: T1 2: T2 3: T3 4: T4
And in the InnoDB redo log, they happen to be committed in order: T1 T3 T4 T2.
Now suppose a non-blocking XtraBackup is running in parallel with this with the intention of provisioning a new slave. XtraBackup happens to take a snapshot of InnoDB that has T1 T3 T4 committed (but not T2).
We restore the XtraBackup to a new slave. Now the problem is - which binlog position should the slave start replicating from? If we start after (4: T4), then we will have lost T2 on the slave. If we start at (2: T2), then we will duplicate T3 and T4 on the slave. So the problem is that without a consistent binlog order between innodb redo log and the binlog, we do not have a unique position to start replicating from in the new slave.
So this is a small thing perhaps - you can just eg. give up on non-blocking XtraBackup provisioning of slaves.
And anyway, it just occurs to me that MySQL 5.6 global transaction ID does not have this issue, because it anyway gives up on having a consistent binlog order. Instead it keeps track of all applied and not applied transactions, so should be able to replicate T2 and skip T3 and T4. You might still need to support non-GTID replication though.
Thanks for the explanation. I now understand this. A possible approach may be that legacy replication works as it does now, and GTID replication (or "new" replication) works with this scheme.
BTW, a closely related idea would be to store the binlog inside the InnoDB redo log (as extra info logged along with the transaction). This would solve most problems, I believe. This might work well for many transactional storage engines. The problem with InnoDB is that it has a cyclic log, so there is no way to ensure that old binlogs are not overwritten while slaves might still need them.
I think this goes counter to the ideas that I am proposing/thinking about. My goal is to use the existing transactional storage engine infrastructure to implement the binary log, so that other transactional engines may use it, and to keep storage engine implementations and binary logs decoupled. This couples the implementation of InnoDB and the binary log.
As far as the performance issues go, you say we are writing the data 6 times. With the current solution, we write the data four times, three for InnoDB and once to the binary log. So really, there seems to be a
Agree.
So as I said, it is tempting. Just use a table in the storage engine, and get all the transactional consistency for free. We have a lot of nasty problems in current MySQL/MariaDB because of things that write to all kinds of different files, rather than use a common transactional framework.
So what I am saying - I thought a lot about this and similar ideas a couple of years ago. I ended up deciding not to go this way, because of above-mentioned problems and probably others that I have forgotten. But it is not clear that it cannot work.
So in summary, these seem to be the non-trivial (and very legitimate) issues - This would be revolutionary, as opposed to evolutionary, making it a risky project to undertake. - Legacy (non-GTID) replication requires a consistent order of transactions in the binary log, making it difficult to work with this feature. Are there any other issues?
Of course, there will be quite a lot of practical work to move the binlog to a storage engine. I suppose you have in mind a general extension to the storage engine API so that other engines could own the binlog as well? A lot of existing infrastructure and tools would need to deal with binlog tables rather than binlog files.
I actually have no APIs in mind. I was hoping this would work with existing APIs writing data to a table and reading data from a table.
Maybe the decisive factor was mostly that keeping the binlog, and slowly adding improvements, can be done in small, evolutionary steps. So it seems the more realistic approach, compared to a revolutionary approach of completely rewriting the binlog.
I do not have the final answer. It would definitely be nice to see us move towards being more transactional. It is a hard journey though.
- Kristian.