On Tue, 23 Mar 2010 10:12:53 +0200, Henrik Ingo <henrik.ingo@avoinelama.fi> wrote:
Meta discussion first, replication discussion below :-)
<cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it?
If the redundancy service does not know about it, how would the information be transmitted by it??? For instance take the example of the binlog, which is a redundancy service in this model. If it supported this information (which it MAY do), it of course has to save it in some format in the binlog file.
First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model.
Yes, this is the intent.
Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme:
Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API
The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The
of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server
as possible.
What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in
I guess we can consider meta-discussion closed for now unless someone wants to add to it. I'm content ;) purpose details the
same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
Ok, this is an interesting distinction you make.
So in current MySQL/MariaDB, one place where transactions are applied to a replica is the slave SQL thread. Conceptually I've always thought of this as "part of replication code". You propose here that this should be a common module on the MariaDB server side of the API, rather than part of each redundancy service.
Yes.
I guess this may make sense.
Well, it is of course a matter of debate, but not all of the redundancy-related code has to be encompassed by the redundancy API. The main purpose of API is to hide implementation details and it goes both ways: we want to hide the redundancy details form the server, and likewise we want to hide the server details from the redundancy service. Thus flexibility and maintainability is achieved. And the thinner is the API, the better. That is one of the reasons of identifying the model - this is the best way to see what this API should contain. To put it another way, there are APIs and there is an integration code that holds them together. Like, for example, the code that we exchanged with Kristian.
This opens up a new field of questions related to the user interface of all this. Typically, or "how things are today", a user will initiate replication/redundancy related events from the side of the redundancy service. Eg if I want to setup mysql statement based replication, there is a set of commands to do that. If I want to recover the database by replaying the binlog file, there is a set of binlog specific tools to do that. Each redundancy service solves some problems from its own specific approach, and provides a user interface for those tasks. So I guess at some point it will be interesting to see what the command interface to all this will look like and whether I use something specific to the redundancy service or some general MariaDB command set to make replication happen.
It does not so much depend on where you draw the API line, but more on what aspects of the model you want to expose to the user. Most probably - all. Thus we'll need the ability to create a replication set, add plugins to its stack (perhaps first create the stack) and configure individual plugin instances. Setting variables is definitely not enough for that, so you'll need either a special set of commands, something along the GRANT line, or, considering that replication configuration tends to be highly structured and you'll keep it in the tables, a special (don't laugh yet) storage engine where you will be able to modify table contents using regular SQL, and this engine will in turn call corresponding API calls. I think there could be a number of benefits in such arrangement, although I'm not sure about performance.
At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
Certainly. I think this goes without question. What I meant back there was that either the engine or the server should be capable of parallel (out-of-order is interesting only if it is parallel, right?) applying and for the purposes of recovery it will be no longer enough for the engine to just miantain the last committed transaction ID, it'll have to keep the list of uncommitted transactions from the last group. -- Alexey Yurchenko, Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011