On Wed, Jan 29, 2025 at 4:55 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Yes. Truncate is used only on the one that is being written to. It is used to implement FLUSH BINARY LOGS, which is used to close the currently written file early and move on to the next binlog file.
If the binlog files would normally be preallocated on creation, it would indeed be helpful to explicitly log file size changes. We could also log that by a WRITE of rewriting the binlog file header block, which could specify the allocated size of the file. For this, we would have to overwrite the first binlog block in place. To avoid problems with torn writes, it could be a good idea to reserve the header block payload within the first 512 or fewer bytes of the 4096-byte block. In that way, any risk of the data being corrupted in the case of an interrupted write should be minimal. It would then be up to the binlog layer to interpret the contents of the WRITE record of page 0. We might also write an (EXTENDED,TRIM_PAGES) record for trimming size, but it is not strictly needed. For InnoDB tablespaces, which are not append-only, these records are necessary so that any earlier log records that would write beyond the trimmed size of the tablespace can be discarded. The binlog would be strictly append-only, and the FLUSH BINARY LOGS would never "overwrite" or discard any previously written data for that file.
If it is a problem to implement truncate, binlog can instead just pad the rest of the binlog file with dummy data. If we can have a truncate record in the redo log for recovery, we can avoid this dummy data and binlog can simply ftruncate() the file during recovery.
If after recovery we would continue to use the last binlog file and we are preallocating the binlog files, some padding with NUL bytes will have to be implemented anyway. If we are going to always move to the next file, then we might as well trim the last recovered binlog file at the last recovered position. The POSIX interface for these would be posix_fallocate() and ftruncate(). Some existing code in InnoDB prefers fallocate() and falls back to pwrite() with NUL bytes. While fallocate() requires special support from the underlying file system and requires a fallback to regular writes, ftruncate() should always be available. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc