Quoting Henrik Ingo <henrik.ingo@avoinelama.fi>:
Meta discussion first, replication discussion below :-)
On Mon, Mar 22, 2010 at 4:41 PM, Alex Yurchenko <alexey.yurchenko@codership.com> wrote:
Uh, I'm not sure I can accept this proposition. At least it seems contradictory to MariaDB's vision of being a practical, user and customer driven, database.
I do understand the desire to marry marketing to software design, but they are simply unrelated areas of human activity. "Computer science" is called "science" because there are real laws which no marketing genius can invalidate. So YMMV.
It is not marketing. Science can produced things with practical value, and things with little or no practical value. We want to produce things with practical value.
As I see it, for real world applications, you should always start with I never suggested to implement a model without connection to use cases, and I believe I went to sufficient lengths to explain how proposed model can satisfy a broad range of use cases. What I was saying, that you're always programming a model, not use cases and therefore anything that you want to implement must be expressed in terms of the model.
This is true. Skipping the part where you create a model leads to chaos.
In this connection saying that you have a use case that does not need linearly ordered commits really means nothing. Either you need to propose another model, live with linearly ordered commits or drop the case. Either way it has no effect on the design of this model implementation, because linearly ordered commits IS the model. You cannot throw them out without breaking the rest of the concept. So much for the usefulness of use cases in high-level design: some of them fit, some of them don't.
I'm not sure about where Kristian is, but at least my participation is based on the assumption that we are still exploring the proposed model to see if we like it or whether we should modify it or have a different model. This assessment is based on asking what use case are served well by the model.
I'm also a fan of abstract thinking though. Sometimes you can get great innovations from starting with a nice abstract model, and then ask yourself which real world problems it would (and would not) solve.
And that's exactly what I'm trying to do in this thread - start with a model, not use cases.
Either way, you end up with anchoring yourself in real world use cases.
Well, when you start with a model, it means that you use it as a reference stick to accept or reject use cases, doesn't it? So that makes the model an anchor. And leaves use cases only as means to see how practical the model is.
No, this is what I disagree with. You could propose a model that is sound in a theoretical sense, but useless in practice because it doesn't serve any use cases that real world users are interested in. So the use cases are there reference stick to accept or reject the model. But also the full set of use cases are not set in stone. We can decide that we like a model because it serves many use cases and then we reject the use cases not served by it.
And there is another curious property to models: the more abstract is the model (i.e. the less it is rooted in use cases), the more use cases it can satisfy. Once you stop designing specifically for asynchronous replication, you find out that the same scheme works for synchronous too.
True. Abstract thinking sure is a win, there's no question about that. But universities are also full of those scientists who produce little of practical value. I worked one year at HUT - it was the most relaxed job I ever had, there is no requirement to produce anything useful unless you really want to. My masters thesis contributes something new to the field of eLearning, that nobody had researched before. But if I had to explain the main points of it in a business world, I could do so in 60 seconds. The rest is just "scientific fluff".
Good science has practical value (sometimes apparent only after decades). But not everything that happens in science is good science.
Back on track: So the API should of course implement something which has as broad applicability as possible. This is the whole point of questioning you, since now you have just suggested a model which happens to nicely satisfy Galera's needs :-)
Well, this may seem like it because Galera is the only explicit implementation of that model. But the truth is Galera is possible only because this model was explicitly followed. And this model didn't come out of thin air. It is a result of years of research and experience - not only ours.
Yes. The model certainly looks sound and promising, no question about that. I think the discussion is more about corner cases.
For example, MySQL|MariaDB is already implementing large portion of the proposed model by representing evolution of a database as a _series_ of atomic changes recorded in a binlog. In fact it had global transaction IDs from day one. They are just expressed in the way that makes sense only in the context of a given file on a given server. Had they been recognized as global transaction IDs, implementing a mapping from a file offset to an ordinal number is below trivial. Then we would not be having 3rd party patches applicable only to MySQL 5.0. (Let's face it, global transaction IDs in master-slave replication are so trivial they are practically built in.) The reason why there is no nice replication API in MariaDB yet is that this model was never explicitly recognized. And API is a description of a model. You cannot describe what you don't recognize ;)
Yes.
So in reality I am not proposing anything new or specific to Galera. I'm just suggesting to recognize what you already have there (and proposing the abstractions to express it).
And imho this joint effort is looking really promising all in all, since so many experts are exchanging their wisdom. (Not really counting myself here, although I've read many white papers about replication :-)
<cut>
So those are the requirements I could derive from having NDB use our to-be-implemented API. My conclusion from the above is that we should consider adding to the model the concept of a transaction group, which: -> the engine (or MariaDB server, for multi-engine transactions?) MAY provide information of which transactions had been committed within the same group. -> If such information was provided, a redundancy service MAY process transactions inside a group in parallel or out of order, but MUST make sure that all transactions in transaction group G1 are processed/committed before the first transaction in G2 is processed/comitted.
Well, that's a pretty cool concept. One way to call it is "controlled eventual consistency". But does redundancy service have to know about it?
If the redundancy service does not know about it, how would the information be transmitted by it??? For instance take the example of the binlog, which is a redundancy service in this model. If it supported this information (which it MAY do), it of course has to save it in some format in the binlog file.
First of all, these groups are just superpositions of individual atomic transactions. That is, this CAN be implemented on top of the current model.
Yes, this is the intent.
Secondly, transaction applying is done by the engine, so the engine or the server HAS to have a support for this, both on the master and on the slave side. So why not keep the redundancy service API free from that at all? Consider this scheme:
Database Server | Redundancy Service (database data) | (redundancy information) | Redundancy API
The task of redundancy service is to store and provide redundancy information that can be used in restoring the database to a desired state. Keeping the information and using it - two different things. The purpose of API is to separate one part of the program from the logic of another. So I'd keep the model and the API as simple as free from the server details as possible.
What it means here: redundancy service stores atomic database changes in a certain order and it guarantees that it will return these changes in the same order. This is sufficient to restore the database to any state it had. It is up to the server in what order it will apply these changes and if it wants to skip some states. (This assumes that the changesets are opaque to redundancy service and the server can include whatever information it wants in them, including ordering prefixes)
Ok, this is an interesting distinction you make.
So in current MySQL/MariaDB, one place where transactions are applied to a replica is the slave SQL thread. Conceptually I've always thought of this as "part of replication code". You propose here that this should be a common module on the MariaDB server side of the API, rather than part of each redundancy service. I guess this may make sense.
This opens up a new field of questions related to the user interface of all this. Typically, or "how things are today", a user will initiate replication/redundancy related events from the side of the redundancy service. Eg if I want to setup mysql statement based replication, there is a set of commands to do that. If I want to recover the database by replaying the binlog file, there is a set of binlog specific tools to do that. Each redundancy service solves some problems from its own specific approach, and provides a user interface for those tasks. So I guess at some point it will be interesting to see what the command interface to all this will look like and whether I use something specific to the redundancy service or some general MariaDB command set to make replication happen.
This replication model will eventually influence the user interface. So far, in Galera project, we have postponed user interface changes for the future. Partly because, our intention is to be transparent to native MySQL, and partly because we wanted to get end user requirements for the management first. For us, this MariaDB replication project comes just in right time to lay the grounds for replication management syntax.
At least the application of replicated transactions certainly should not be part of each storage engine. From the engine point of view, applying a set of replicated transactions should be "just another transaction". For the engine it should not matter if a transaction comes from the application, mysqldump, or a redundancy service. (There may be small details: when the application does a transaction, we need a new global txn id, but when applying a replicated transaction, the id is already there.)
yes, but no. .e.g. Galera replication has this strange need to use prioritized transactions for applying. DBMS should have the responsibility to provide high priority sessions for replication appliers.