Re: [PATCH] MDEV-32014 Rename binlog cache temporary file to binlog file for large transaction
Andrei Elkin <andrei.elkin@mariadb.com> writes:
So please do a small refactoring of trx_group_commit_leader() and keep only ...
and no need for refactoring really arises.
Well, the refactoring is good, even without the MDEV-32014 patch. The function trx_group_commit_leader() is already too long and complicated, and does too much in a single function. And the first part (obtaining the queue) is cleanly separated from the second part (committing each entry in the queue in-order in binlog and participating engines). It's just nice that this improvement of the existing code also makes the MDEV-32014 patch integrate cleaner.
I noted earlier on this part on the PR page and LiBing seems to have practically proven (by the most recent commit that you must've missed) that conducting the rename inside the flush stage of trx_group_commit_leader() is viable:
https://github.com/MariaDB/server/pull/3402/commits/287511a703f7dff479396334...
By that measure the flushing is expanded onto possibly multiple "implicitly" rotating log files
Hm, I don't like the change 287511a703f7dff47939633475edd436a542cf6f. In fact, that is one of the things I like the most about Libing Song's patch, that it so cleanly separates the commit-by-rotate logic from the complex group commit logic in queue_for_group_commit() and trx_group_commit_leader(), which is already too complex. The original patch without 287511a703f7dff47939633475edd436a542cf6f seems much cleaner. It is better to do as in the original patch, and treat commit-by-rotate like a normal binlog rotate, not try to participate that transaction in binlog group commit. We cannot share the fsync anyway, as the files are different. And it introduces a lot of complexities to suddenly have binlog rotations in the middle of group commit - what about the complex accounting around xid_count and binlog checkpoints? There will be a lot of corner cases eg. if _two_ commit-by-rotate happens in the same group commit, with lots of potential for subtle bugs that will occur only very rarely in production and be almost impossible to track down. I strongly prefer the original patch I reviewed (with my suggested changes to write_transaction_to_binlog_events() and trx_group_commit_leader()), where the commit-by-rotate logic is cleanly separated from the group commit logic and implemented mostly using the existing binlog rotate logic. - Kristian.
Kristian Nielsen via developers <developers@lists.mariadb.org> writes:
Andrei Elkin <andrei.elkin@mariadb.com> writes:
So please do a small refactoring of trx_group_commit_leader() and keep only ...
and no need for refactoring really arises.
Well, the refactoring is good, even without the MDEV-32014 patch. The function trx_group_commit_leader() is already too long and complicated, and does too much in a single function. And the first part (obtaining the queue) is cleanly separated from the second part (committing each entry in the queue in-order in binlog and participating engines). It's just nice that this improvement of the existing code also makes the MDEV-32014 patch integrate cleaner.
I understand your point and it makes sense. While complications of `2875...` commit are passable, I am mentioning some details further on, but your suggestion looks simpler to me.
I noted earlier on this part on the PR page and LiBing seems to have practically proven (by the most recent commit that you must've missed) that conducting the rename inside the flush stage of trx_group_commit_leader() is viable:
https://github.com/MariaDB/server/pull/3402/commits/287511a703f7dff479396334...
By that measure the flushing is expanded onto possibly multiple "implicitly" rotating log files
Hm, I don't like the change 287511a703f7dff47939633475edd436a542cf6f. In fact, that is one of the things I like the most about Libing Song's patch, that it so cleanly separates the commit-by-rotate logic from the complex group commit logic in queue_for_group_commit() and trx_group_commit_leader(), which is already too complex. The original patch without 287511a703f7dff47939633475edd436a542cf6f seems much cleaner.
I also was not sure about the skipped `queue_for_group_commit()` at time of my note. However it also looks as a safe thing to do at the expense of no actual grouping.
It is better to do as in the original patch, and treat commit-by-rotate like a normal binlog rotate, not try to participate that transaction in binlog group commit. We cannot share the fsync anyway, as the files are different. And it introduces a lot of complexities to suddenly have binlog rotations in the middle of group commit - what about the complex accounting around xid_count and binlog checkpoints? There will be a lot of corner cases eg. if _two_ commit-by-rotate happens in the same group commit, with lots of potential for subtle bugs that will occur only very rarely in production and be almost impossible to track down.
I gave some thought to these use cases and was confident they would be covered. To that matter, the interspersed with rotate group write of `287511a7...` commit can be arranged as an external loop + for (current= queue; current != NULL; current= current->next) + { if (likely(is_open())) // Should always be true { ... for (; /* current fits to the active log */; current= current->next) { ... /* flush */ ...} flush_and_sync(); ... +/- rotate(false, &check_purge); } ... trx_group_commit_with_engines(); + } to treat a rotate-as-rename()-r almost as a regular thread in a binlog group. rotate() that remains at its current position inside the flush loop would trigger (+/-) renaming if necessary. Say that occurs. The inner loop would break to leave the rest of execution intact up to commit and binlog checkpoint request for the old rotated file. The rename-rotation caused thread would be the first in the external loop's next iteration (where, taking other parts of the patch, it would have nothing to write down). This does not look to me as awfully complex, though arguably harder, still friendly to a part of the refacting suggested.
Andrei via developers <developers@lists.mariadb.org> writes: Let me admit, in the external loop description
It is better to do as in the original patch, and treat commit-by-rotate like a normal binlog rotate, not try to participate that transaction in binlog group commit. We cannot share the fsync anyway, as the files are different. And it introduces a lot of complexities to suddenly have binlog rotations in the middle of group commit - what about the complex accounting around xid_count and binlog checkpoints? There will be a lot of corner cases eg. if _two_ commit-by-rotate happens in the same group commit, with lots of potential for subtle bugs that will occur only very rarely in production and be almost impossible to track down.
I gave some thought to these use cases and was confident they would be covered. To that matter, the interspersed with rotate group write of `287511a7...` commit can be arranged as an external loop
+ for (current= queue; current != NULL; current= current->next) + { if (likely(is_open())) // Should always be true { ... for (; /* current fits to the active log */; current= current->next) { ... /* flush */ ...} flush_and_sync(); ... +/- rotate(false, &check_purge); } ... trx_group_commit_with_engines(); + }
the flush and commit phases are to run according to a special mutex lock protocol. The external loop if taken literarly can't of course do that e.g for a reason of concurrent leaders should be held off.
participants (2)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen