Looking at the code it looks like most places (including THD::awake) assume that thd->mysys_var->current_mutex and thd->mysys_var->current_cond can be changed only when thd->mysys_var->mutex is locked. debug_sync.cc is breaking this assumption. I've found that Item_func_sleep::val_int(), THD::enter_cond() and THD::exit_cond() break this assumption too. I believe all of those places should be fixed to lock thd->mysys_var->mutex before any attempts to read or write to thd->mysys_var->current_mutex and thd->mysys_var->current_cond. Pavel On Wed, Feb 26, 2014 at 8:29 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I got a crash in my parallel replication test case, and this one turned out to be caused by a fundamental problem in debug_sync. I am not sure how to solve it, so I wanted to explain the issue in case others have some ideas.
The crash happens in THD::awake() when a thread is killed:
mysql_mutex_lock(&mysys_var->mutex); ... int ret= mysql_mutex_trylock(mysys_var->current_mutex); mysql_cond_broadcast(mysys_var->current_cond); if (!ret) mysql_mutex_unlock(mysys_var->current_mutex);
The problem is that mysys_var->current_mutex changed between the trylock() and the unlock(), so we unlock a different mutex than the one we locked - ouch!
The mutex changed because of this code in debug_sync.cc:
if (thd->mysys_var) { old_mutex= thd->mysys_var->current_mutex; old_cond= thd->mysys_var->current_cond; thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex; thd->mysys_var->current_cond= &debug_sync_global.ds_cond; }
There is no mutex protection here.
So this means that it is not safe to use debug_sync inside enter_cond()/exit_cond() if kill will be used on the thread at that point. Since I have a lot of tests for parallel replication where I test exactly for correct error handling when killing threads at various points in the parallel processing, it is not too surprising that I would eventually be hit by this :)
I am going to avoid using debug_sync inside enter_cond()/exit_cond() for now. But any ideas for how to solve this properly? This is particularly nasty to be hit by, as it is not at all obvious that this use of debug_sync should be a problem. And it is not exactly easy to guess from the failure what the real problem is - assuming one can even reproduce the failure, the window of opportunity in the race is after all rather quite small.
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp