Michael Widenius <michael.widenius@gmail.com> writes:
2. To protect access to master_info_index, eg. to prevent a Master_info from disappearing while it is accessed.
When reading about this, I think it would be better to have a counter in Master_info if someone is using it and only delete if if the counter is zero. Would be trivial to add an increment of the counter in get_master_info().
The counter could either be protected by the LOCK_active_mi or one of the mutexes in Master_info like data_lock.
Sounds reasonable. Will you do a patch? I guess this is code from the multisource feature, so you are the one familiar with it (before multi-source, the Master_info could not be created and deleted dynamically, could it?)
Both LOCK_active_mi and the new LOCK_serialize_replication_admin_commands must by themselves prevent master_info_index from having elements added or removed. This is needed for start_all_slaves() and stop_all_slaves() to work correctly, I think.
Why both? Isn't it enough to first take LOCK_serialize_replication_admin_commands and then take LOCK_active_mi if one needs to access master_info_index ?
In stop_all_slaves(), there is a loop over the master_info_hash, within which slaves are stopped one at a time. So LOCK_active_mi cannot be held across the loop because of the deadlock discussed in this thread. So it seems that LOCK_serialize_replication_admin_commands must prevent elements being added or removed, to protect this loop. But LOCK_active_mi also needs to protect elements from going away, so that they can be safely accessed in the many different places in the code that do this. Of course there are a number of different ways this could be solved - just pointing out that it is something that need to be handled one way or the other.
Actually, there would still be the deadlock problem after introducing the new lock, because it is possible for a slave thread to run CHANGE MASTER! (and I think START SLAVE / RESET SLAVE as well). But that is probably a bug, does not seem good that this is possible. I guess these need a check to throw an ER_LOCK_OR_ACTIVE_TRANSACTION error, like STOP SLAVE.
Wouldn't the new lock LOCK_serialize_replication_admin_commands fix that?
How? It's exactly the same deadlock, no matter what we call the lock. STOP SLAVE holds lock, waits for sql thread to stop. SQL thread requests lock (eg. in CHANGE MASTER), deadlocking. But really, there is no reason a slave thread should be allowed to run replication adminitrative commands, it should just be an error so that the slave threads will never try to take LOCK_serialize_replication_admin_commands. Maybe that is what you meant.
I checked the function mysql_show_binlog_events() but could not find any access to Master_info after mysql_mutex_unlock(&LOCK_active_mi) was released.
Hm, no, it doesn't seem so, I guess I misread the code. So it is only MASTER_POS_WAIT() that has the problem, then. - Kristian.