Sergei Golubchik <serg@mariadb.org> writes:
Note that at this point, the group commit leader thread _is_ effectively acting as if it was each of the participating threads, the other threads are blocked from running during commit_ordered.
I'm not sure it's good enough. Practically it might work, but if one thread has modified some variable under a mutex, you cannot simply read it from another thread without a mutex, there's no guarantee that another thread will see the new value. You'll need an appropriate memory barrier for that.
Sure, I understand that, maybe I should have explained this code in more detail. This is the binlog group commit. We have N transactions doing commit at the same time: T1, T2, ..., TN. After writing to the binlog, we must run commit_ordered(T_i) for each transaction in order, i=1, 2, ..., N. The wait_for_prior_commit mechanism exists for this. Each thread will do: wait_for_prior_commit(T_{i-1}) commit_ordered(T_i) wakeup_subsequent_commits() But this is slow, since it requires thread scheduling between N threads, in-order: https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit/ Instead, what we do is that T_1 will call commit_ordered() for _all_ the N transactions. Meanwhile the other threads are waiting to be signalled. Note that LOCK_wait_commit protects the hand-over from thread T_i to T_1 and back, so we _do_ have proper mutex protection. This mutex provides the necessary memory barrier between T_i modifying data and T_1 reading it, or T_1 modifying and T_i reading. Does that make it clear?
If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug?
Setting current_thd is cheating, it will trick the assert, but won't fix the underlying problem as outlined above.
Nono, it's not cheating, it's in fact setting the correct value. While running commit_ordered(T_i), the current THD _is_ the THD for T_i, not the THD for T_1, which is unrelated to transaction T_i. It's already done for a similar case a bit earlier in the code, where T_1 is writing the binlog for each of the other threads: set_current_thd(current->thd); It's similar to the pool-of-threads. If a THD goes idle for some time, and then is resumed on a different thread in the thread pool, this does not mean that every access to the data must now be done under LOCK_thd_data. Rather, now the THD is owned by another thread, so this thread sets current_thd accordingly. The group commit code is similar, the THD is (temporarily) moved to be under a different thread. Does that make sense?
Monty asserted many times that taking an uncontended mutex is cheap. Did you think it's expensive or did you benchmark that?
I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course. But here, LOCK_thd_data serves no purpose, and it's done under a very critical global lock LOCK_commit_ordered. So surely we should not be taking/releasing N extra mutexes here without any reason to do so? For the same reason, set_current_thd() was not done in this code, to avoid it if it serves no purpose. This is documented on the commit_ordered method in sql/handler.h: "Note that commit_ordered() can be called from a different thread than the one handling the transaction! So it can not do anything that depends on thread local storage" That is why I asked if you knew what the real problem is that triggered this assertion. There is no reason to add code in a performance-critical place if it serves no purpose. Worse, it might hide the real bug if some code was added to commit_ordered that violates the above requirement. I think it is fine to add set_current_thd() *if it is actually needed*. But we should understand first *why* it is needed, other wise we may hide the real bug. - Kristian.