[Maria-developers] Slave can take a very long time to start replication
Kristian, As I understand currently when slave connects to master and wants to start replicating it passes GTID to start from, master finds binlog file where the earliest GTID is located and then scans through that file to find the exact binlog position to start sending binlog events from. If this binlog file is pretty big then scanning can take a very long time. I guess especially long when several slaves try to start replicating roughly at the same time. We observed 60-90 seconds between initial connection by the slave and the first real binlog events starting to flow. In this period of time slave doesn't receive anything from master and thus it's very easy to confuse such situation with connection loss, hit slave_net_timeout, reopen connection to the master again and force it to start searching through binlog file from the very beginning... Putting aside the argument of what value is good enough for slave_net_timeout, I'd say in any case slave taking 60 seconds to just start receiving binlog events from master is unacceptable. Did you think about this problem before? Maybe you've even planned already to implement some solution for this? Thank you, Pavel
Pavel Ivanov <pivanof@google.com> writes:
start replicating it passes GTID to start from, master finds binlog file where the earliest GTID is located and then scans through that file to find the exact binlog position to start sending binlog events from. If this binlog file is pretty big then scanning can take a very long time. I guess especially long when several slaves try to start replicating roughly at the same time. We observed 60-90 seconds
Ouch, that's a big delay :-(
Did you think about this problem before? Maybe you've even planned already to implement some solution for this?
Yes, two possible solutions. My prefered solution is to change the binlog to be page-based, just like other database transaction logs. This has several benefits - for example easy pre-allocation which reduces the fsync() penalty by 1/2 or more, and protection from partial disk writes corrupting the end of the binlog. And it would allow binary search in the log to find the starting GTID, which should greatly improve slave connect time. But re-implementing binlog format is probably too big a task to do anytime soon, unfortunately. So the easier plan is to implement a binlog index, a separate file master-idx.XXXXXX alongside each master-bin.XXXXXX. Periodically (like every 100 events or whatever), the current binlog GTID state would be written out to this file along with the corresponding binlog offset, in some page-based format. When a slave connects, binary search is done on the index file to quickly find where to start in the binlog file. Writing the binlog index should have low overhead, as there is no need to fsync() or even flush it regularly. If we crash, we can just re-build the index file as part of the binlog scan that anyway takes place during crash recovery (or just fall back to binlog scan if no index file is found). There has not been time to get any of these solutions implemented at this point, so for now the workaround is to use smaller size binlog files, I suppose... - Kristian.
participants (2)
-
Kristian Nielsen
-
Pavel Ivanov