Pavel Ivanov <pivanof@google.com> writes:
start replicating it passes GTID to start from, master finds binlog file where the earliest GTID is located and then scans through that file to find the exact binlog position to start sending binlog events from. If this binlog file is pretty big then scanning can take a very long time. I guess especially long when several slaves try to start replicating roughly at the same time. We observed 60-90 seconds
Ouch, that's a big delay :-(
Did you think about this problem before? Maybe you've even planned already to implement some solution for this?
Yes, two possible solutions. My prefered solution is to change the binlog to be page-based, just like other database transaction logs. This has several benefits - for example easy pre-allocation which reduces the fsync() penalty by 1/2 or more, and protection from partial disk writes corrupting the end of the binlog. And it would allow binary search in the log to find the starting GTID, which should greatly improve slave connect time. But re-implementing binlog format is probably too big a task to do anytime soon, unfortunately. So the easier plan is to implement a binlog index, a separate file master-idx.XXXXXX alongside each master-bin.XXXXXX. Periodically (like every 100 events or whatever), the current binlog GTID state would be written out to this file along with the corresponding binlog offset, in some page-based format. When a slave connects, binary search is done on the index file to quickly find where to start in the binlog file. Writing the binlog index should have low overhead, as there is no need to fsync() or even flush it regularly. If we crash, we can just re-build the index file as part of the binlog scan that anyway takes place during crash recovery (or just fall back to binlog scan if no index file is found). There has not been time to get any of these solutions implemented at this point, so for now the workaround is to use smaller size binlog files, I suppose... - Kristian.