MDEV-34705: Storing binlog in InnoDB
I am in the early stages of a project to hopefully implement a new binlog format that is managed transactionally by InnoDB. This is a complex change, and I want to solicit comments early, so I encourage feedback. The TLDR; of this: - The current binlog format stores events separately from InnoDB. This requires a complex and _very_ constly 2-phase commit protocol between the binlog and InnoDB for every commit. - The binlog format is rather naive, and not well suited as a high-performance transactional log. It exposes internal implementation detail to the slaves, and its flat-file nature makes it inflexible to read (for example finding the last event sequential start from the start). Replacing it with something new has the potential to reap many benefits. - This project proposes to extend the storage engine API to allow an engine to implement binlog storage. It will be implemented in InnoDB so that binlog events are stored in InnoDB tablespaces, and the standard InnoDB transactional and recovery mechanisms are used to provide crash safety and optionally durability. - This mechanism will allow full crash-recovery and consistency without any fsync per commit (--innodb-flush-log-at-trx-commit=0), providing huge improvement in throughput for applications that do not require strict durability requirements. And it will reduce fsync requirement to one per commit for durable config, with greatly improved opportunities for group commit. Below a long and somewhat raw write-up of my current state of design. There was some interest expressed in learning more about this work, an interest that I'm very pleased with ;-). The current state of my prototype implementation is available in the branch knielsen_binlog_in_engine: https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine Architectural considerations. The fundamental problem with the binlog is that it functions as a transactional log, but it is separate from the InnoDB transactional log. This is very unoptimal. To ensure consistency in case of crash, a two-phase commit protocol is needed that requires two fsync per commit, which is very costly in terms of performance. And it causes a lot of complexity in the code as well. We should make it so that there is only a single transactional log ("single source of truth") used, at least for the case where only a single storage engine is used (eg. InnoDB) for all tables. Some different options for doing this can be considered (explained in detail below): 1. Implement a new binlog format that is stored in a new InnoDB tablespace type (or more generally in any storage engine that implements the appropriate new API for binlog). 2. Keep the current binlog format, extend the InnoDB redo log format with write-ahead records for binlog writes. Integrate binlog writes with InnoDB checkpointing and buffer pool flush. 3. Store the binlog events directly inside the InnoDB redo log. Implement log archiving of the InnoDB log to preserve old binlog data as long as desired. 4. (Mentioned only for completeness): Implement a general redo log facility on the server level, and change both InnoDB and binlog to use that facility. I am currently working on option (1), inspired by talks I had with Marko Mäkelä at the MariaDB fest last Autumn. He suggested that a new type of tablespace be added to InnoDB which would hold the binlog, and explained that this would require relative small changes in the InnoDB code. I think this is the best option. The InnoDB log and tablespace implementation is very mature, it is very good to be able to re-use it for the binlog. The integration into InnoDB looks to be very clean, and all the existing infrastructure for log write, recovery, checkpointing, buffer pool can be re-used directly. All the low-level detail of how the binlog is stored on disk can be handled efficiently inside the InnoDB code. This option will require a lot of changes in the binlog and replication code, which is the main cost of choosing this option. I have a _lot_ of experience with the MariaDB replication code, and I think this is feasible. And I think there are another set of huge benefits possible from implementing a new binlog format. The existing binlog format is very naive and neither very flexible, nor well suited as a high-performance transactional log. So changing the binlog format is something that should be done eventually anyway. Option (2) would also be feasible I think, and has the advantage of needing less changes to the replication code. Monty mentioned seeing a patch that sounded something like this. The main disadvantage is that I think this will require complex interaction between low-level InnoDB checkpointing code and low-level binlog writing code, which I fear will eventually lead to complex code that will be hard to maintain and will stiffle further innovation on either side. It also doesn't address the other problems with the current binlog format. To me, letting the storage engine control how it stores data transactionally - including the binlog events - is the "right" solution, and worth persuing over a quicker hack that tries to avoid changing code. Option (3) is also interesting. The idea would be that the binlog becomes identical to the InnoDB redo log, storing the binlog data interleaved with the InnoDB log records. This is conceptually simpler than options (1) and (2), in the sense that we don't have to first write-ahead-log a record for modifying binlog data, and then later write the actual data modification to a separate tablespace/file. But the InnoDB log is cyclic, and old data is overwritten as the log LSN wraps over the end. Thus, this option (3) will require implementing log archiving in InnoDB, which I am not sure how much work will be required. Either the binlog data could be archived by copying out asynchroneously to persistent binlog files, which becomes somewhat similar to option (1) or (2) then perhaps. Or InnoDB could be changed to create a new log file at wrap-over, renaming the old one to archive it, though this will leave the binlog data somewhat "dilluted", mixed up with InnoDB internal redo records. One pragmatic reason I've chosen option (1) over this option (3) for now is that I am more comfortable with doing large changes to replication code than to InnoDB code. Also, option (3) would more drastically change the way users can work with the binlog files, as a large amount of the most recent data would be sitting inside the InnoDB redo log and not in dedicated binlog files, unlike options (1) and (2). Still, option (3) is an interesting idea and has the potential to reduce the write amplification that will occur from (1) and (2). So I'm eager to hear any suggestions around this. Option (4) I mention only for completeness. Somehow, the "clean" way to handle transactional logging would be if the server layer provides a general logging service, and all the components would then use this for logging. InnoDB and the binlog, but also all the other logs or log-like things we have, eg. Aria, DDL log, partitioning, etc. However, I don't think it makes sense to try to discard the whole InnoDB write-ahead logging code and replace it with some new implementation done on the server level. And I'm not even sure it makes sense from a theoretical point of view - somehow, the logging method used is intrinsically a property of the storage engine; it seems dubious if a general server-level log facility could be designed that would be optimal for both InnoDB and RocksDB, for example. High-level design. Introduce a new option --binlog-storage-engine=innodb. Extend the storage engine API with the option for an engine to provide a binlog implementation, supported by InnoDB initially. Initially the option cannot be changed dynamically. When --binlog-storage-engine=innodb, the binlog files will use a different format, stored as a new type of InnoDB tablespace. It should support that the old part of the binlog is the legacy format and the new is the InnoDB format, to facilitate migrations. Maybe also support going back again, though this is less critical initially. A goal is to clean up some of the unnecessary complexity of the legacy binlog format. The old format will be supported for the foreseeable future, so the new format can break backwards compatibility. For example, the name of binlog files will be fixed, binlog-<NNNNNN>.ibb, so each file can be identified solely by its number NNNNNN. The option to decide the directory in which to store the binlog files should be supported though. I think we can require the slaves to use GTID when --binlog-storage-engine=innodb is enabled on the master. This way we can avoid exposing slaves to internal implementation details of the binlog, and the slaves no longer need to update the file relay-log.info at every commit. We should also be able to get rid of the Format_description_log_event and the Rotate_log_event at the start of every binlog file, so that the slave can view the events from the master as one linear stream and not care about how the data is split into separate binlog files on the master. The binlog format will be page based. This will allow pre-allocating the binlog files and efficiently writing them page by page. And it will be possible to access the files without scanning them sequentially from the start; eg. we can find the last event by reading the last page of the file, binary-searching for the last used page if the binlog file is partially written. The GTID indexes can be stored inside the same tablespace as the binlog data (eg. at the end of the tablespace), avoiding the need for a separate index file. GTID index design is still mostly TBD, but I think it can be implemented so that indexes are updated transactionally as part of the InnoDB commit and no separate crash recovery of GTID indexes is needed. With GTID indexes being guaranteed available, we can use them to obtain the GTID state at the start of each binlog file, and avoid the need for the Gtid_list_log_event at the start of the binlog. With a page-based log with extensible format, metadata can be added to the binlog that is only used on the master without having to introduce new replication events that are not relevant to the replication on the slave. This can be used eg. to eliminate the Binlog_checkpoint_log_event, for example. Possibly the binlog checkpoint mechanism can be completely avoided for the --binlog-storage-engine=innodb case, since there is no more 2-phase commit needed in the common case of an InnoDB-only transaction. The existing binlog checksum and encryption will no longer be used, instead the standard InnoDB checksums and encryption will be reused. Implementation plan. The storage engine API must be extended to provide a facilities for writing to the binlog and for reading from the binlog. The design of the API is TBD, should be discussed to try to make it generic and suitable for other engines than InnoDB. When writing to binlog in InnoDB, the central idea is to use the same mini-transaction (mtr) that marks the transaction as committed in InnoDB. This is what makes the binlog data guaranteed consistent with the InnoDB table data without the need for two-phase commit. The initial version I think can use the existing binlog group commit framework; this will simplify the implementation. This will thus keep the LOCK_commit_ordered and the queue_for_group_commit() mechanisms. Later work can then be to see if this can be even further improved in terms of scalability. I want to implement that large event groups can be split into multiple chunks in the binlog that no longer need to be consecutive. The final chunk will be the one that contains the GTID and marks the event group binlogged; this final chunk will then refer back to the other parts. This way, a large transactions can be binlogged without stalling the writing of other (smaller) transactions in parallel, which is a bottleneck in the legacy binlog. And it avoids problems with exceeding the maximum size of an mtr. In the first version, I think the binlog reader in the dump thread will collect and concatenate the different chunks before sending them to the slave, so simplify the initial implementation. But a later change could allow the slave to receive the different chunks interleaved between different event groups. This can even eventually allow the slave to speculatively execute events even before the transaction has committed on the master, to potentially reduce slave lag to less than the time of even a single transaction; this would be a generalisation of the --binlog-alter-two-phase feature. The binlog will be stored in separate tablespace files, each of size --max-binlog-size. The binlog files will be pre-allocated by a background thread. Since event groups can spill over to the next file, each file can be a fixed size, and hopefully also the rotate events will no longer be necessary. A busy server will quickly cycle through tablespace files, and we want to avoid "using up" a new tablespace ID for each file (InnoDB tablespace IDs are limited to 2**32 and are not reused). Two system tablespace IDs will be reserved for the binlog, and new binlog files will alternate between them. This way, the currently written binlog can be active while the previous one is being flushed to disk and the remaining writes checkpointed. Once the previous log has been flushed completely, its tablespace ID can be re-used for the next, pre-allocated binlog file and be ready for becoming active. This way, the switching to the next binlog file should be able to occur seamlessly, without any stalls, as long as page flushing can keep up. The flushing of binlog pages will be prioritised, to avoid stalling binlog writes, to free up buffer pool pages that can be used more efficiently than holding binlog data, and to quickly make the binlog files readable from outside the server (eg. with mysqlbinlog). The binlog dump thread that reads from the binlog and sends the data to slaves will use a binlog reading API implemented in InnoDB. I will prefer to read directly from the binlog files, in order to reduce pressure on the InnoDB buffer pool etc. A slave that connects from an old position back in time may need to read a lot of old data from the binlogs; there is little value in loading this data into the buffer pool, evicting other more useful pages. The reader can lookup in the InnoDB buffer pool with the BUF_GET_IF_IN_POOL flag. This way, the data can be accessed from the buffer pool if it is present. If not present, we can be sure that the data will be in the data file, and can read it from the file directly. If the data is known to be already flushed to disk before the specific binlog position, then the buffer pool lookup can be skipped altogether. The mysqlbinlog program will need to be extended somehow to be able to read from an InnoDB tablespace (or in general other storage engine). I think this means mysqlbinlog needs some kind of engine plugin facility for reading binlog. Note that the -read-from-remote-server option will also be available to read data from a mysqlbinlog that doesn't the new format, or to read the very newest data before it gets written to disk. Final words. This is still an early draft of the feature, as it is being worked on and refined; things are subject to change. And this also means that things are *open* to change, according to any suggestions with a good rationale.
I wanted to give an update on the progress of my work on MDEV-34705, which is a task to implement a new binlog format that is stored as an InnoDB tablespace (or other engine that chooses to implement it). To recap, the motivation includes removing the costly 2-phase commit between binlog and InnoDB; making replication crash-safe even when --innodb-flush-log-at-trx-commit=0 (or 2) and --sync-binlog=0; remove unnecessary complexity in the legacy binlog implementation; and removing limitations in the legacy binlog to facilitate future developemnts for replication. The design is described in https://jira.mariadb.org/browse/MDEV-34705 , and the code is developed in https://github.com/MariaDB/server/commits/knielsen_binlog_in_engine. A few weeks ago I reached a major milestone with the first working replication from InnoDB-implemented binlog on the master to a slave. I'm currently half-way with the last big piece of the puzzle, which is to be able to split event groups into multiple pieces interleaved with other event groups in the binlog. After that there will still be many details to be implemented, as the binlog implementation is visible in many user-facing places (which is one part of the problem with the legacy binlog). So good progress, but also lots of work left still. I want to point out some design decisions that significantly changes how the new binlog works compared to the legacy one, to facilitate the discussion. Remember that an important goal is to remove some of the unnecessary complexity of the legacy binlog, so support for some things will be dropped that will be controversial, but the design is still open to suggestions with solid technical arguments. 1. I intend to remove the option to set the base name of binlog files. File names will be set by the storage engine (I'm using "binlog-NNNNNN.ibb" in the current code), and identified only by their (64-bit) number. This avoids the need for the master-bin.index file. The need to keep track of different base file names for different binlog files creates a _log_ of complexity in the legacy binlog (and there are still a number of bugs due to this). It must still be possible to set the directory containing the binlog files (but the binlog will not be possible to split amongst multiple directories). 2. There will be some delay from commit until the binlog data is readable externally from the file. This is kind of inherent in the desire to speed up binlogging exactly by delaying the physical disk write (aka fsync()). Using mysqlbinlog --read-from-remote-server will work as before (eg. it will be able to see committed transactions immediately). The code will though try to flush binlog pages to disk with high priority, so the delay will usually be small. 3. Binlog rotations, which are quite complex in the legacy binlog, will be mostly invisible. Binlog tablespace files are pre-allocated in the background, and will always have a fixed size (--max-binlog-size). Binlog writes pass seamlessly from the end of one file to the start of the other, and replication events can be split across binlog files. 4. I am thinking to require GTID mode in the new binlog format, disallowing slaves to connect using filename/offset. This is not a hard decision yet, technically I think it is not too hard to keep this. But removing this will reduce complexity and potentially allow future storage engines to implement its own binlog format that does not map well to filename/offset. 5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something more than the reality. Instead of semi-sync, users can always just SELECT MASTER_GTID_WAIT(@@last_gtid) on a slave to get arguably better functionality. And the semi-sync implementation has always been problematic (IMHO), what with sending the actual binlog filename string back and forth with every commit, and causing much complexity and many bugs. Less controversial will be to release the first version without semi-sync support and add it later. 6. Large event groups (configurable, currently using --binlog-cache-size) will be written out-of-band into the binlog during query execution. This means the event group for the transaction can be binlogged in different pieces that can be interleaved with other event groups. This removes the limitation that even a huge transaction must be binlogged as a single consecutive event group in a single binlog file (which can stall other commits). It also allows a future (not in first release) enhancement where optimistic parallel replication could optionally start applying a large transaction on the slave while it is still executing on the master. In the first version, the dump thread on the master will assemble the pieces before sending to the slave. This means that if an active binlog file N contains a commit that references event data writtent to file (N-k), then binlog log purge will be blocked not just from N, but from N-k. It also means that if a large transaction ends up being rolled back, then this will leave extra unused data in the binlog files until purged. I think this is a good trade-off, but it's easy to add an option to disable the out-of-band binlogging, if desired in some special uses. 7. For migration to the new binlog, I want to allow that the old part of the binlog is in the legacy format, and the new part is using the new implementation. This to allow switching a replication setup to use the new implementation by simply stopping and restarting the master with the new option --binlog-storage-engine=innodb, and the slaves can pick up from where they left. I also want to leave a way to roll back, that is for users to disable the new binlog and go back to the legacy one in case of problems. But I want to avoid a binlog that goes back-and-forth between different formats (only allow a single point where it switches from legacy to new). So current thinking is that rolling back to the legacy format will be with a script that converts any newly written binlog files in the new format to the legacy format while traffic is stopped. As always, comments and suggestions very welcome. - Kristian.
Hi, On 12/4/24 13:19, Kristian Nielsen via developers wrote:
5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something more than the reality. Instead of semi-sync, users can always just SELECT MASTER_GTID_WAIT(@@last_gtid) on a slave to get arguably better functionality. And the semi-sync implementation has always been problematic (IMHO), what with sending the actual binlog filename string back and forth with every commit, and causing much complexity and many bugs. Less controversial will be to release the first version without semi-sync support and add it later.
As a (kind of) user of semi-sync replication, I believe it has a valid, albeit limited, use-case and that it's a necessary component in setups where no transactions are allowed to be lost when the primary node in a replication cluster goes down. Perhaps I'm wrong or the way it works isn't how I think it does so I'll try and describe what I think it does and what it doesn't do. When semi-synchronous replication is enabled and configured correctly (by default it isn't), the only thing that it guarantees is that if a client receives an OK for the COMMIT, it means that at least one other server in the replication topology has successfully received that transaction. This means that if the server where the transaction was committed goes down and never comes back up again, it's still possible to find the committed transaction on another node and then fail over the cluster to replicate from that node. Doing this manually requires inspecting the GTIDs and waiting for the relay log to be applied but there are automated solutions for this (e.g. MaxScale). What it doesn't do is guarantee that the event has been applied when the OK is received. It might not even get applied if the server that ended up receiving the event has diverged from the rest of the cluster and fails to apply it. The delay in the application of the event is a separate problem (read causality) that can be solved with different methods (e.g. MASTER_GTID_WAIT). The problem of a "broken" server receiving and not being able to apply it is a tougher problem to solve but I believe it can be mostly avoided by making replicas read-only and requiring strict GTID ordering. I think that the current default values for the semi-sync replication are not useful for this use-case and I think that a lot of the misunderstanding comes from this. The default value of rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless failover) and rpl_semi_sync_master_timeout should be set to something a lot higher than 10 seconds. Personally, I think setting it to the maximum value is the safest as it prevents writes if they might not survive an outage of the node being written on. This does mean that a two node setup cannot tolerate a loss of a node without going read-only but this is an unavoidable fact and trying to work around it would render semi-sync useless. The documentation could also be improved to clearly state what it does to describe a use-case for the feature. Right now the Semisynchronous Replication page (https://mariadb.com/kb/en/semisynchronous-replication/) mentions that "semisynchronous replication therefore comes with some negative performance impact, but increased data integrity" which doesn't adequately explain why one would want to use it. Markus -- Markus Mäkelä, Senior Software Engineer MariaDB Corporation
Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
On 12/4/24 13:19, Kristian Nielsen via developers wrote:
5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something
As a (kind of) user of semi-sync replication, I believe it has a
Hi Markus, thanks for taking the time to comment! Your input is very valuable.
valid, albeit limited, use-case and that it's a necessary component in setups where no transactions are allowed to be lost when the primary node in a replication cluster goes down. Perhaps I'm wrong or the way
I would like to be explicit about what it means "no transactions are allowed to be lost". I know you Markus fully understand what it means, of course. Transactions can easily be lost if the server crashes up to and during the commit. What it really means is that the server will send a notification to the client at some point when a single point of failure will no longer cause the transaction to be lost. With semi-sync, this notification comes in the form of the "ok" result of the client's commit. I want to understand if there are other, possibly better ways to get this notification, if that is all the relevant applications need? I was suggesting that the application could itself use MASTER_GTID_WAIT() against a slave before accepting the commit as "ok" (or a proxy like MaxScale could do it for the application). Does the current semi-sync replication do anything more for the application than this, and if so, what? One benefit of this method is that each commit can decide whether it needs to wait or not. One commit that "is not allowed to be lost" will not block other transactions from committing. I think with AFTER_SYNC, all following transactions will be blocked from committing until the current commit has been acknowledged by a slave, and that with AFTER_COMMIT they will not be blocked, but I'm not 100% sure.
misunderstanding comes from this. The default value of rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless failover) and rpl_semi_sync_master_timeout should be set to something
I would like to understand the reason(s) AFTER_SYNC is better than AFTER_COMMIT. From my understanding, from the client's narrow perspective about their own commit there is little difference, either is a notification that the transaction is now robust to single point of failure (available on at least two servers). I know of one usecase, which is when things are set up so that if the master crashes, failover to a slave is _always_ done, and the crashed master is changed to be a slave of the new master (as opposed to letting the master restart, do crash recovery, and continue its operation as a master). With AFTER_COMMIT, the old master might have a transaction committed that does not exist on the new master, which will prevent it from working as a slave and it will need to be discarded (possibly restored from a backup). With AFTER_SYNC, the old master may still (after restarting) have a transaction committed to the binlog that is not on the slave / new master. But the old master can be restarted with --rpl-semi-sync-slave-enabled that tries to truncate the binlog to discard as many transactions from it as possible, to make sure it only has transactions that are also present on the new master. (Interestingly, this means that the purpose of AFTER_SYNC is to ensure that transactions _are_ lost, rather than ensure that they are _not_ lost). Is this the (only) reason that AFTER_SYNC should be default? Or do you know of other reasons to prefer it? Now, with the new binlog implementation, there is no longer any AFTER_SYNC. The whole point of the feature is to make the binlog commit and the InnoDB commit atomic with each other as a whole, there is no point at which a transaction is durably committed in the binlog and not committed in InnoDB. So the truncation of the binlog at old master restart with --rpl-semi-sync-slave-enabled no longer applies. But I would argue that this binlog truncation is anyway a misfeature. If we want to ensure that the master never commits a transaction before it has been received by a slave, then send the transaction to the slave and await slave reply _before_ writing it to the binlog. Don't first write it to the binlog, and then add complex crash recovery code to try and remove it from the binlog again. And doing the semi-sync handshake _before_ writing the transaction to the binlog is something that could be implemented in the new binlog implementation. It would be something like BEFORE_WRITE, instead of AFTER_SYNC (which does not exist in the new binlog implementation). Thus, I really want to understand: 1. Is the --rpl-semi-sync-slave-enabled use case, where a crashing master is always demoted to a slave, used by users in practice, to warrant implementing something like BEFORE_WRITE semisync for the new binlog format? 2. Is there another reason that AFTER_SYNC is useful that I should know, and which needs to be designed into the new binlog format? - Kristian.
Hi, On 12/4/24 18:08, Kristian Nielsen wrote:
Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
On 12/4/24 13:19, Kristian Nielsen via developers wrote:
5. A more controversial thought is to drop support for semi-sync replication. I think many users use semi-sync believing it does something As a (kind of) user of semi-sync replication, I believe it has a Hi Markus, thanks for taking the time to comment! Your input is very valuable.
valid, albeit limited, use-case and that it's a necessary component in setups where no transactions are allowed to be lost when the primary node in a replication cluster goes down. Perhaps I'm wrong or the way I would like to be explicit about what it means "no transactions are allowed to be lost". I know you Markus fully understand what it means, of course.
Transactions can easily be lost if the server crashes up to and during the commit. What it really means is that the server will send a notification to the client at some point when a single point of failure will no longer cause the transaction to be lost. With semi-sync, this notification comes in the form of the "ok" result of the client's commit.
I want to understand if there are other, possibly better ways to get this notification, if that is all the relevant applications need?
I was suggesting that the application could itself use MASTER_GTID_WAIT() against a slave before accepting the commit as "ok" (or a proxy like MaxScale could do it for the application). Does the current semi-sync replication do anything more for the application than this, and if so, what?
One benefit of this method is that each commit can decide whether it needs to wait or not. One commit that "is not allowed to be lost" will not block other transactions from committing. I think with AFTER_SYNC, all following transactions will be blocked from committing until the current commit has been acknowledged by a slave, and that with AFTER_COMMIT they will not be blocked, but I'm not 100% sure. I had a vague memory of the group commit mechanism doing only one ACK
I think that implementing semi-sync in each application is probably a bit too much but doing it in a proxy like MaxScale does sound doable and the implementation would be essentially the same: delay the OK for the commit until at least one replica responds to the MASTER_GTID_WAIT(). The number of roundtrips should be the same so the only downside of this approach is that you're forced to wait for the SQL thread to apply the transaction which introduces more latency than the existing semi-sync approach does. If a function like MASTER_GTID_WAIT_FOR_IO_THREAD() were to exist, it would be probably be very close in terms of latency. Another use-case that I think I heard about was to use semi-sync replication to slow down the rate of writes so that replication lag is avoided. While this is possible, I believe that tuning the group commit size to be larger probably has the same effect with better overall performance. per group but I might have remembered it wrong, I'm mostly a passive observer to all replication related discussion in Zulip and MDEVs. If it indeed does one ACK per commit even if there's a group of transactions then doing it at the application level might potentially perform better as the waits could be done in parallel.
misunderstanding comes from this. The default value of rpl_semi_sync_master_wait_point should be AFTER_SYNC (lossless failover) and rpl_semi_sync_master_timeout should be set to something I would like to understand the reason(s) AFTER_SYNC is better than AFTER_COMMIT.
From my understanding, from the client's narrow perspective about their own commit there is little difference, either is a notification that the transaction is now robust to single point of failure (available on at least two servers).
Yes, I think you're right and from the point of view of the client the configuration is irrelevant: if you get the OK for the commit the transaction is "durable" on more than one server.
I know of one usecase, which is when things are set up so that if the master crashes, failover to a slave is _always_ done, and the crashed master is changed to be a slave of the new master (as opposed to letting the master restart, do crash recovery, and continue its operation as a master).
With AFTER_COMMIT, the old master might have a transaction committed that does not exist on the new master, which will prevent it from working as a slave and it will need to be discarded (possibly restored from a backup).
With AFTER_SYNC, the old master may still (after restarting) have a transaction committed to the binlog that is not on the slave / new master. But the old master can be restarted with --rpl-semi-sync-slave-enabled that tries to truncate the binlog to discard as many transactions from it as possible, to make sure it only has transactions that are also present on the new master.
(Interestingly, this means that the purpose of AFTER_SYNC is to ensure that transactions _are_ lost, rather than ensure that they are _not_ lost).
Is this the (only) reason that AFTER_SYNC should be default? Or do you know of other reasons to prefer it? I think this is the only reason, I thought that it had a more fundamental effect on things but I think I must've remembered it only in relation to failed masters rejoining the cluster. Due to MDEV-33465, I
I think this is the use-case that MDEV-21117 and MDEV-33465 relate to. From what I remember (in relation to MDEV-33465) having the master roll back the transactions caused some problems to happen if a quick restart happened. I think it was that if GTID 0-1-123 gets replicated due to AFTER_SYNC but then the master crashes and comes back up, it rolls back 0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVE after MDEV-33465) but before replication starts back up, another transaction gets commited as GTID 0-1-123 on the master. Now when the replication asks for "GTID position after 0-1-123", instead of getting a "I have not seen that GTID" error, the replication continues and history effectively got rewritten. I don't remember if this was the exact problem but it was something along these lines. Looking at the description of --init-rpl-role (https://mariadb.com/kb/en/mariadbd-options/#-init-rpl-role), it seems that it can also cause replication to break. think that my initial thoughts on this are probably wrong and the default value probably isn't as important as I imagined it would be.
Now, with the new binlog implementation, there is no longer any AFTER_SYNC. The whole point of the feature is to make the binlog commit and the InnoDB commit atomic with each other as a whole, there is no point at which a transaction is durably committed in the binlog and not committed in InnoDB. So the truncation of the binlog at old master restart with --rpl-semi-sync-slave-enabled no longer applies.
But I would argue that this binlog truncation is anyway a misfeature. If we want to ensure that the master never commits a transaction before it has been received by a slave, then send the transaction to the slave and await slave reply _before_ writing it to the binlog. Don't first write it to the binlog, and then add complex crash recovery code to try and remove it from the binlog again.
And doing the semi-sync handshake _before_ writing the transaction to the binlog is something that could be implemented in the new binlog implementation. It would be something like BEFORE_WRITE, instead of AFTER_SYNC (which does not exist in the new binlog implementation).
Thus, I really want to understand:
1. Is the --rpl-semi-sync-slave-enabled use case, where a crashing master is always demoted to a slave, used by users in practice, to warrant implementing something like BEFORE_WRITE semisync for the new binlog format?
From what I know and have seen, it is used when something fully automatic like MaxScale is used to handle failovers and rejoining of nodes to the cluster. Without it, I think that you would eventually have to start restoring the nodes from backups once enough failovers have happened. I think the bigger problems is that, until MDEV-34878 or something similar is implemented, there's now way for the crashed master to know what its role in the cluster is as it depends on the other nodes in the cluster. If a failover did take place then the crashed master must come back as a slave and try to rejoin the cluster. If no failover took place, the crashed master must come back as a master and continue accepting writes. Since --init-rpl-role=MASTER cannot be set at runtime, the safest thing to do is to live with the consequences and accept the fact that you can't always rejoin the crashed master back into the cluster.
2. Is there another reason that AFTER_SYNC is useful that I should know, and which needs to be designed into the new binlog format?
- Kristian.
-- Markus Mäkelä, Senior Software Engineer MariaDB Corporation
Thanks a lot Markus for the additional explanations, very useful. Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
From what I remember (in relation to MDEV-33465) having the master roll back the transactions caused some problems to happen if a quick restart happened. I think it was that if GTID 0-1-123 gets replicated due to AFTER_SYNC but then the master crashes and comes back up, it rolls back 0-1-123 due to --rpl-semi-sync-slave-enabled (or --init-rpl-role=SLAVE after MDEV-33465) but before replication starts back up, another transaction gets commited as GTID 0-1-123 on the master. Now when the replication asks for "GTID position after
Yes. _Either_ we need to be sure the master is ahead of all slaves, and we keep it as the master after crash-recovery. _Or_ we need to be sure at least one slave is ahead of the master, and we promote that slave as the new master and demote the old master to a slave after crash recovery. Otherwise the replication hierarchy cannot be reliably re-assembled after a master crash.
From what I know and have seen, it is used when something fully automatic like MaxScale is used to handle failovers and rejoining of nodes to the cluster. Without it, I think that you would eventually have to start restoring the nodes from backups once enough failovers have happened.
What about the following idea? 1. Implement BEFORE_WRITE semi-sync mode. The master will not write transactions to the binlog until at least one slave have acknowledged. 2. This means that if the master crashes, when it comes back up it will have no transaction that does not exists on at least one running node (assuming at most a single failure at a time). 3. When the master restarts, it will go into read-only mode and wait for MaxScale (or other management system) to tell it what to do, similar to MDEV-34878. 4. If MaxScale decides to keep it as the master, it will briefly set it up as a slave and make sure it has replicated the latest GTID on any slave in the replication topology. Then it will be set read-write and continue as the master. 5. If MaxScale decides to promote another server as the new master, the old master is kept in read-only mode and configured as a slave. The BEFORE_WRITE ensures the old master will not be ahead of the new master. This requires the ability in MaxScale to do (4). I think this will be much more robust than having a crashed server try to remove transactions already written to the binlog, and having to configure the server to have one or another role when it starts up. Instead, all servers in the replication topology always wait at startup for the manager to replicate any missing transactions from the appropriate server, and then either set it read-write as a master or continue as a slave. What do you think? Of course, this is all for the future, it requires implementing BEFORE_WRITE in the server first. But I think it sounds promising.
I think that implementing semi-sync in each application is probably a bit too much but doing it in a proxy like MaxScale does sound doable and the implementation would be essentially the same: delay the OK for
It sounds like the new binlog-in-engine should support semi-sync (perhaps not in the first release, but eventually). It could then support AFTER_COMMIT, which would be used when a crashed server is allowed to restart and continue by itself, as is the current default. And then also support BEFORE_WRITE, where transactions are sent to the slave before being written to the binlog, and a crashed server comes up in read-only mode after restart. MaxScale could still implements its own version, but probably it is best if the new binlog implementation would also support some form of semi-sync eventually.
I had a vague memory of the group commit mechanism doing only one ACK per group but I might have remembered it wrong, I'm mostly a passive
I think it still does it for every commit, but this could be improved in the server (MDEV-33491). - Kristian.
Hi, On 12/5/24 18:02, Kristian Nielsen wrote:
What about the following idea?
1. Implement BEFORE_WRITE semi-sync mode. The master will not write transactions to the binlog until at least one slave have acknowledged.
2. This means that if the master crashes, when it comes back up it will have no transaction that does not exists on at least one running node (assuming at most a single failure at a time).
3. When the master restarts, it will go into read-only mode and wait for MaxScale (or other management system) to tell it what to do, similar to MDEV-34878.
4. If MaxScale decides to keep it as the master, it will briefly set it up as a slave and make sure it has replicated the latest GTID on any slave in the replication topology. Then it will be set read-write and continue as the master.
5. If MaxScale decides to promote another server as the new master, the old master is kept in read-only mode and configured as a slave. The BEFORE_WRITE ensures the old master will not be ahead of the new master.
This requires the ability in MaxScale to do (4).
I think this will be much more robust than having a crashed server try to remove transactions already written to the binlog, and having to configure the server to have one or another role when it starts up.
Instead, all servers in the replication topology always wait at startup for the manager to replicate any missing transactions from the appropriate server, and then either set it read-write as a master or continue as a slave.
What do you think? Of course, this is all for the future, it requires implementing BEFORE_WRITE in the server first. But I think it sounds promising.
I think that sounds like a good idea. In step 4, instead of briefly replicating the lost changes and resuming writes on the same node, I think MaxScale could just move all writes to the node with the newest GTID and turn off read-only there, essentially performing a switchover to another node. I think that it might actually already handle this case as it can happen with AFTER_SYNC. However, I'd imagine that this BEFORE_WRITE mode might not be super useful for manually managed replication. You'd have to always switch over to another node when a server crashes. All in all, the BEFORE_WRITE sounds promising and we'd definitely appreciate it but also doesn't seem super useful outside of this somewhat niche use-case. However I do still think semi-sync is generally useful and thus this does seem like something that, as you said, should be implemented eventually in the binlog-in-engine mode. I'm looking forward to see more progress updates on this, it all seems very interesting. Markus -- Markus Mäkelä, Senior Software Engineer MariaDB Corporation
Markus Mäkelä via developers <developers@lists.mariadb.org> writes:
I think that sounds like a good idea. In step 4, instead of briefly replicating the lost changes and resuming writes on the same node, I think MaxScale could just move all writes to the node with the newest GTID and turn off read-only there, essentially performing a switchover to another node. I think that it might actually already handle this
Agree. I took a look at the original MDEV for AFTER_SYNC, https://jira.mariadb.org/browse/MDEV-162 . It mentions a different motivation for AFTER_SYNC over AFTER_COMMIT, to prevent phantom read. Ie. make sure at least one slave has the transaction before making it visible on the master; this way if any client saw the transaction, the transaction will still be recoverable on a slave if the original master is lost.
However, I'd imagine that this BEFORE_WRITE mode might not be super useful for manually managed replication. You'd have to always switch over to another node when a server crashes. All in all, the
Yes. BEFORE_WRITE would also provide "no phantom reads". But as you say, it means that a swich-over will be required after any master crash, otherwise slaves might be ahead and replication breaks. That seems to be the price for avoiding the expensive two-phase commit between binlog and InnoDB.
I'm looking forward to see more progress updates on this, it all seems very interesting.
Agree. I'll think a bit more on the BEFORE_WRITE idea. - Kristian.
participants (2)
-
Kristian Nielsen
-
Markus Mäkelä