[Maria-developers] WL#188 New (by Knielsen): Using --log-slave-updates to ensure crash safe/transactional slave state
----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Using --log-slave-updates to ensure crash safe/transactional slave state CREATION DATE..: Mon, 21 Mar 2011, 12:56 SUPERVISOR.....: IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 188 (http://askmonty.org/worklog/?tid=188) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Overview -------- A replication slave needs to preserve certain state between slave server restarts to be able to correctly resume replication from where it left off. In current MySQL replication, this state is kept in multiple simple files (master.info, relay-log.info). This is a big problem if the slave server crashes, as there is no guarantee that these files will be in a consistent state with what is in the table data (and binlog if using --log-slow-updates), or even with each other. The Google patch rpl_transaction_enabled has a partial solution for this, by duplicating the state inside InnoDB in a transactional way. At slave server startup, InnoDB can then decide to overwrite the files keeping the slave state with its own, hopefully more correct, information. There are some remaining problems with this approach, some of which may be fixable, some not. However, the basic problem here is the need to maintain state in a transactional/crash-safe way across multiple subsystems of the server (eg. replication and storage engine(s)). And we already have such a mechanism, in the form of the two-phase commit between engines and binlog. This worklog describes how we could use this existing mechanism to keep the replication slave state across server restarts in a crash-safe way, rather than introduce new complex mechanisms for every new piece of state to be kept. Idea ---- The main state we need to make crash-safe is the binlog position (filename,offset) of the next event from the master to execute on the slave. There is more state stored current, but there is less need for that to be transactional: - relay-log.info also stores the position in the relay log files on the slave which event execution has reached. In the case of normal shutdown this can be used file as-is. In case of recovery after a crash, it is possible that the relay logs are not consistent with the master and/or slave SQL thread, so it is probably better to just discard any existing relay log and re-fetch all necessary events from the master. - master.info mainly stores the connection information from CHANGE MASTER TO, which does not change often. The basic idea is that on slave server start, we get the required information from the binlog on the slave, rather than from these files (this requires that --log-slave-updates is enabled). This also has the advantage that we avoid the need to constantly update the state files after every transaction, saving some execution cost. When we shut down the server normally, we close the binlog in a way that allows at startup to detect if we are recovering from a crash or not. As part of this close, we can write whatever state we need to recover at the end of the binlog. Recovering the state then depends on whether we crashed or not. If we did not crash, then all we need is to able to find the position of whatever event was written at the end of the binlog with the state. One way is to just have a fixed offset from the end where this event starts; however this is not too robust against finding wrong data there, especially with respect to binlogs from different versions of the server possibly with different events (or event sizes) at the end of the log. Another way is that when we close the binlog, we seek back to the start of the log and write the position of the last event there (either in the format_description event, which already has version information, or a new event written just after format_description event). We already do this seek anyway to overwrite the flag in format_description event which signals that the binlog has been closed properly (not crashed). We just need to be sure to fsync() the binlog _before_ overwriting the crashed-or-not flag, so we are sure all data will be there in the not-crashed case. If we did crash, then we can not rely on any information at the end of the binlog. In this case, we can instead build on top of the already existing crash recovery mechanism. This mechanism scans the last binlog to build a list of all committed transactions, then uses this list to tell storage engines which previously prepared transactions to commit and which to roll back. As part of this scan, we can determine the last event executed, and from this we get the necessary state to continue replication in terms of the binlog position (filename,offset) on the master. Note that if something like group ID (MWL#175) or global transaction ID is implemented, then the state to preserve is the last executed ID rather than binlog position; however, the basic mechanism remains the same. Discussion ---------- The main disadvantage I see with this approach is that in order for this to be really crash safe, we need to run with innodb_flush_log_at_trx_commit=1 and sync_binlog=1. This requires 3 fsync() calls per commit, and group commit is not possible since the slave is single-threaded. This is likely to be too expensive for many installations to be able to use it. However, it may be possible to reduce this overhead sufficiently by using MWL#185 (grouping multiple commits together on the slave to reduce fsync() overhead), or by implementing parallel replication to utilise group commit (MWL#169, MWL#184, MWL#186). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v4.0.0)
participants (1)
-
worklog-noreply@askmonty.org