Thanks Kristian, that gives me some useful pointers and some things to think about and discuss with them when we meet tomorrow. I agree that limiting it to a subset of scenarios would be could, especially as we are only looking at proving a concept at the moment. I will keep you informed of what we find out from them. Regards Mark On 18 Mar 2014, at 08:02, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Mark Riddoch <mark.riddoch@skysql.com> writes:
we have been approached by a user with a suggestion for a project that we might do with them. They have a replication environment with a large fan out (100 slaves per master) and are looking at doing something to cut one the load on the master. the thought they had was to use MaxScale to effectively cache the bin log files, so MaxScale would act as a single Slave to the Master and would itself have a number of Slaves that read the bin log from MaxScale. My initial response was why not just use an instance of MariaDB as the intermediate node and connect slaves to one of those instances, in a tree structure. They had a few reasons why they did not want to do this;
Looking in more detail I think it is feasible, we essentially write a router module for MaxSale that acts as a slave, aching the bin log locally on the MaxScale node. We then have another router, or maybe even the same one, that relays that binlog to the real slaves. We are looking at doing something for MariaDB 10, since we would want to utilise the GTID to get the best failover semantics, I am assuming there must be some differences between the replication stream that I can find documented for MySQL and what MariaDB 10 does? Is this stuff anywhere on the Knowledge Base? I tried looking but
Right, so I agree with what Serg said, that maybe going back to the original problem and seeing if there are better solutions possible would be better. However, I will try to answer the technical aspect here.
I suppose caching and routing binlogs outside of the server should be possible, but it might be more complex than you would think at first.
I am not aware of any comprehensive documentation for this. However, it should be relatively easy to see from the code. Basically, it is the function mysql_binlog_send() in sql/sql_repl.cc that handles sending binlog data to a slave. There are good comments in there describing many of the trickier points.
You need to be aware that there is more logic in sending binlog data to a slave than just streaming raw binlog files. Maybe you can implement a subset and document any limitations.
The old-style replication (not using GTID) is the simplest, but it still does a bit extra, like sending some extra events (FORMAT_DESCRIPTION_EVENT and fake ROTATE_EVENT).
There is also the handling of old slaves. A slave sends a value in @mariadb_slave_capability, and the master will rewrite or remove any events that the slave does not understand, as appropriate.
And there is the @@skip_replication flag; when the slave sets this, the master removes events that were logged with @@skip_replication set (to reduce bandwidth needs).
The slave also does several SQL queries to obtain additional information. For example, the slave does a SELECT binlog_gtid_pos(file, offset) to obtain the GTID position associated with a given old-format replication position, so that it knows how to switch to GTID mode later. Maybe those queries could be handled by relaying them back to a real master server.
The GTID mode is significantly more complex, due to a GTID position having multiple streams, one per replication domain. The slave sends @slave_connect_state, @slave_gtid_strict_mode, @slave_gtid_ignore_duplicates, and optionally @slave_until_gtid values. The router would need to parse these values and act accordingly.
The binlog sender needs to keep track of each domain as the binlog files are scanned. Events in a domain must be skipped until the point specified in @slave_connect_state is reached. Then events are sent until the point in @slave_until_gtid is reached, at which point further events must again be skipped.
The master also sends extra GTID_LIST events, which contain the state of the binlog at the point of the event, which is needed by the slave to correctly handle START SLAVE UNTIL master_gtid_pos=XXX and MASTER_GTID_WAIT().
A lot of effort was put into error handling, to support all reasonable, and even many unreasonable uses of replication, while still giving an error in cases where things are obviously wrong (to prevent silently doing the wrong thing or hanging).
For example, if the slave tries to connect at a GTID that the master does not have, this is normally an error. This is important, because that GTID might be some transaction that was executed manually on the slave by mistake; if it only exists on the slave, then the master would endlessly skip events looking for the GTID that never shows up, and replication would silently hang.
However, there are special cases where the slave _is_ allowed to connect at a GTID that does not exist on the master. For example, one might have a slave with --log-slave-updates=0. Then if this slave is promoted as a new master, the binlogs will not have the current position, but it is still allowed to connect to it at the point where it became the new master (but not to a point earlier than that). Another special case is if the requested GTID (but no GTIDs following that) was purged from the binary logs, as can easily happen if a replication domain is unused for long time.
A _full_ implementation of a router for sending binlog to slaves would be some effort, but it might be reasonable to implement some subset. I suppose the first step would be for you to go through the code in mysql_binlog_send() and the functions it calls, and decide on which parts to implement. I will be happy to answer any questions that might pop up on the way, of course.
2. They wanted to maintain the same group commit groups as the original master, so as to benefit from the parallel replication.
Note that this may no longer apply to MariaDB 10.0.9. The issue is that an intermediate slave that is itself a master may have less group commit than the original master, which reduces the possibility for a third-level slave to efficiently do parallel replication. But in 10.0.9, it should be possible to configure the intermediate slave with --binlog-commit-wait-* to get better group commit.
3. They think that can get better failover semantics because they will have n (maybe 5 or 10) MaxScales and they will all have the same bin log, so failover between them will be easier.
I am curious how it will be assured that they all have the same binlog in case of crash? But maybe you just mean that all will be a subset of the largest one.
Hope this helps,
- Kristian.