Re: [MariaDB commits] [PATCH 4/4] MDEV-31273: Precompute binlog checksums
On Fri, Aug 25, 2023 at 10:16 AM Kristian Nielsen via commits <commits@lists.mariadb.org> wrote:
Compute binlog checksums (when enabled) already when writing events into the statement or transaction caches, where before it was done when the caches are copied to the real binlog file. This moves the checksum computation outside of holding LOCK_log, improving scalabitily.
At stmt/trx cache write time, the final end_log_pos values are not known, so with this patch these will be set to 0.
[...]
Checksums cannot be pre-computed when binlog encryption is enabled, as encryption relies on correct end_log_pos to provide part of the nonce/IV. Checksum pre-computation is also disabled for WSREP/Galera.
Have you thought about a format change that would address these issues? In MDEV-14425 https://github.com/MariaDB/server/commit/685d958e38b825ad9829be311f26729cccf... I redesigned the InnoDB log file format in a way that allows checksums to be computed before the final position of the records (the log sequence number of LSN) is known. For encryption, an additional 64-bit nonce will be written; this currently is the LSN at the time the encryption was scheduled. Tablespace identifiers, page numbers and file names are not encrypted, because these were never fully encrypted in innodb_encrypt_tables=ON either. This allows the metadata to be used as initialization vectors, and also avoids the need for backup tools to know any encryption keys. Thanks to MDEV-27774, not only InnoDB computes checksums while not holding any mutex, but multiple threads can copy their mini-transactions to the shared log_sys.buf while holding a shared lock_sys.latch. An exclusive latch will only be needed during a log checkpoint when we need to guarantee that all changes up to a particular LSN have been written to the log buffer. This would not be possible without the format change that was implemented in MDEV-14425.
+/* Convenience macros since ulong is 32 or 64 bit depending on platform. */ +#if SIZEOF_LONG == 4 +#define my_atomic_storeul(P, D) my_atomic_store32((int32 volatile *)(P), (D))
In all currently supported MariaDB Server versions, C++11 is available. Is there a particular reason why you would not use std::atomic<size_t> here? Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
On Fri, Aug 25, 2023 at 10:16 AM Kristian Nielsen via commits <commits@lists.mariadb.org> wrote:
Checksums cannot be pre-computed when binlog encryption is enabled, as
Have you thought about a format change that would address these issues?
What I though about was how to pre-encrypt without a format change. Currently, binlog encryption encrypts each event separately using a nonce built from the binlog file (12 random bytes) and the event (4 bytes of end_log_pos). The end_log_pos is easy, we can just put an increasing counter in there, and then fix the value after decryption. The problem is the per-binlog-file nonce. If a binlog rotation happens between transaction start and commit, we will have encrypted with the wrong nonce. In this case we can decrypt and re-encrypt, assuming most transactions will not span a binlog rotation. Or we could rotate this part of the nonce less often (I don't think anything says it _has_ to be for each binlog rotation). Binlog encryption is not something that I find very useful or interesting. It's one of those features that hurt more than it helps. If it was not there, users would be happy. Now that it's there, they will think they need to use it, and suffer complexity and limitations. Full-disk encryption seems much simpler, more robust, and more complete. But with this patch series, pre-encryption shouldn't be too hard to add. I think it needs input from whoever thinks binlog encryption is useful, since their requirements for the nonce will be needed to decide how to implement it.
In MDEV-14425 https://github.com/MariaDB/server/commit/685d958e38b825ad9829be311f26729cccf... I redesigned the InnoDB log file format in a way that allows checksums to be computed before the final position of the records (the log sequence number of LSN) is known. For encryption, an additional 64-bit nonce will be written; this currently is the LSN at the time the
In binlog we already have the end_log_pos, which is mostly an unused field. It's 32 bit, not sure if that would be considered sufficient for a nonce.
Thanks to MDEV-27774, not only InnoDB computes checksums while not holding any mutex, but multiple threads can copy their mini-transactions to the shared log_sys.buf while holding a shared
For the binlog, the stmt/trx caches are written directly to the binlog file under LOCK_log, from a single thread serving multiple pending transactions. I somehow doubt that this is a current scalability bottleneck, something that would be interesting to investigate though. It also seems doubtful that the kernel would scale with multiple threads appending to a file concurrently. The binlog is a very naive implementation and not well designed for a scalable transaction log currently. I think the InnoDB redo log is somewhat different here.
+/* Convenience macros since ulong is 32 or 64 bit depending on platform. */ +#if SIZEOF_LONG == 4 +#define my_atomic_storeul(P, D) my_atomic_store32((int32 volatile *)(P), (D))
In all currently supported MariaDB Server versions, C++11 is available. Is there a particular reason why you would not use std::atomic<size_t> here?
Yes. The binlog_checksum_options variable needs to be of ulong type to be usable with the MYSQL_SYSVAR_ENUM() macro to define the --binlog-checksum configuration option. As far as I could determine, std::atomic does not provide any non-atomic access to the underlying storage location to obtain the ulong * required. The alternative would be to change the implementation of the --binlog-checksum sysvar to somehow allow a std::atomic as the backing store. I thought this was more complex and out of scope of this task, but don't have a strong opinion one way or the other. (The ulong type and sys_var was always a source of grief, with ulong being different size on different platforms and so on). - Kristian.
On Fri, Aug 25, 2023 at 11:25 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Binlog encryption is not something that I find very useful or interesting. It's one of those features that hurt more than it helps.
Yes, we have some of those in InnoDB as well.
For the binlog, the stmt/trx caches are written directly to the binlog file under LOCK_log, from a single thread serving multiple pending transactions.
I think that most InnoDB redo log record writes will be non-durable ones, to the shared log_sys.buf, which could be memory mapped a PMEM device, but typically is the a RAM buffer that will be explicitly written the ib_logfile0. Those writes are protected by two special synchronization devices in log0log.cc that implement "everything up to this LSN" semantics: static group_commit_lock write_lock; static group_commit_lock flush_lock;
+/* Convenience macros since ulong is 32 or 64 bit depending on platform. */ +#if SIZEOF_LONG == 4 +#define my_atomic_storeul(P, D) my_atomic_store32((int32 volatile *)(P), (D))
In all currently supported MariaDB Server versions, C++11 is available. Is there a particular reason why you would not use std::atomic<size_t> here?
Yes. The binlog_checksum_options variable needs to be of ulong type to be usable with the MYSQL_SYSVAR_ENUM() macro to define the --binlog-checksum configuration option. As far as I could determine, std::atomic does not provide any non-atomic access to the underlying storage location to obtain the ulong * required.
Oh, sorry, I should have written std::atomic<long>. We have sizeof(size_t)>sizeof(long) on LLP64 platforms, such as Microsoft Windows. I am successfully using a dirty trick like this: static SHOW_VAR innodb_status_variables[]= { … {"ibuf_merges", &ibuf.n_merges, SHOW_SIZE_T}, … {NullS, NullS, SHOW_LONG} }; The underlying C-style cast simply assumes that the Atomic_counter<ulint> can be read as a normal ulint (which is an alias for size_t). Theoretically, it might not work on some exotic architecture where std::atomic could add some "bloat". I am not aware if any such architecture exists, and I wouldn't expect the my_atomic_ interface to work either in such a case.
(The ulong type and sys_var was always a source of grief, with ulong being different size on different platforms and so on).
Right. I think that using "long" is usually a mistake, given that both LP64 and LLP64 are in widespread use. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB plc
Marko Mäkelä <marko.makela@mariadb.com> writes:
I am successfully using a dirty trick like this: static SHOW_VAR innodb_status_variables[]= { … {"ibuf_merges", &ibuf.n_merges, SHOW_SIZE_T}, … {NullS, NullS, SHOW_LONG} };
The underlying C-style cast simply assumes that the Atomic_counter<ulint> can be read as a normal ulint (which is an alias for size_t). Theoretically, it might not work on some exotic
Oh! I see. Basically, I first used std::atomic<ulong>, then found the problem that MYSQL_SYSVAR_ENUM() expected a plain ulong *. And I didn't want to introduce the (theoretically) illegal cast, so used the old my_atomic_* interface instead. But if we're already doing those casts elsewhere, that could work too I guess. It's really only code style here. The accesses to that option are all under LOCK_log, except the one dirty read I add. And if the dirty read would somehow conflict with an update, it will be re-checked and corrected under LOCK_log anyway. And read/write of aligned long is atomic anyway by default on all relevant platforms. - Kristian.
participants (2)
-
Kristian Nielsen
-
Marko Mäkelä