Re: [Maria-developers] Patch for parallel replication DBUG_ASSERT check on DROP TABLE et al
Monty writes:
Can you please check if this looks correct?
diff --git a/sql/mdl.cc b/sql/mdl.cc index 28d2006..3b1f9f2 100644 --- a/sql/mdl.cc +++ b/sql/mdl.cc @@ -17,6 +17,7 @@ #include "sql_class.h" #include "debug_sync.h" #include "sql_array.h" +#include "rpl_rli.h" #include <hash.h> #include <mysqld_error.h> #include <mysql/plugin.h> @@ -442,6 +443,7 @@ class MDL_lock virtual void notify_conflicting_locks(MDL_context *ctx) = 0;
virtual bitmap_t hog_lock_types_bitmap() const = 0; + bool check_if_conflicting_replication_locks(MDL_context *ctx);
This should probably be #ifdef NDEBUG, since you're only using it in debug build?
/** List of granted tickets for this lock. */ Ticket_list m_granted; @@ -2290,6 +2292,44 @@ void MDL_scoped_lock::notify_conflicting_locks(MDL_context *ctx) } }
+/** + Check if there is any conflicting lock that could cause this thread + to wait for another thread which is not ready to commit. + This is always an error, as the upper level of parallel replication + should not allow a scheduling of a conflicting DDL until all earlier + transactions has commited. + + This function is only called for a slave using parallel replication + and trying to get an exclusive lock for the table. +*/ + +bool MDL_lock::check_if_conflicting_replication_locks(MDL_context *ctx) +{ + Ticket_iterator it(m_granted); + MDL_ticket *conflicting_ticket; + + while ((conflicting_ticket= it++)) + { + if (conflicting_ticket->get_ctx() != ctx) + { + MDL_context *conflicting_ctx= conflicting_ticket->get_ctx(); + + /* + If the conflicting thread is another parallel replication + thread for the same master and it's not in commit stage, then + the current transaction has started too early and something is + seriously wrong. + */ + if (conflicting_ctx->get_thd()->rgi_slave && + conflicting_ctx->get_thd()->rgi_slave->rli == + ctx->get_thd()->rgi_slave->rli && + !conflicting_ctx->get_thd()->rgi_slave->did_mark_start_commit) + return 1; // Fatal error
No, this is not correct. For example, the threads might be in different domains, or using different multi-source connections. You need a check like in thd_report_wait_for(). _If_ T1 and T2 are in the same parallel replication domain, _and_ T1 goes before T2, _and_ there is a DDL lock conflict but T1 has not started to commit, then it indicates a potential problem. Also, the comments are too vague, "Something is seriously wrong" is not a good description. What is wrong? Please describe the issue being checked more precisely. What is the role of each of the two threads involved? What is the situation being checked? Why is it wrong? Also, use some temporaries for conflicting_ctx->get_thd()->rgi_slave etc, so the condition becomes more readable.
+ } + } + return 0; +} +
/** Acquire one lock with waiting for conflicting locks to go away if needed. @@ -2355,6 +2395,11 @@ MDL_context::acquire_lock(MDL_request *mdl_request, ulong lock_wait_timeout) if (lock->needs_notification(ticket) && lock_wait_timeout) lock->notify_conflicting_locks(this);
+ DBUG_ASSERT(mdl_request->type != MDL_INTENTION_EXCLUSIVE || + !(get_thd()->rgi_slave && + get_thd()->rgi_slave->is_parallel_exec && + lock->check_if_conflicting_replication_locks(this))); + mysql_prlock_unlock(&lock->m_rwlock);
will_wait_for(ticket); diff --git a/sql/mdl.h b/sql/mdl.h index c4d792a..13de602 100644 --- a/sql/mdl.h +++ b/sql/mdl.h @@ -910,7 +910,6 @@ class MDL_context */ MDL_wait_for_subgraph *m_waiting_for; private: - THD *get_thd() const { return m_owner->get_thd(); } MDL_ticket *find_ticket(MDL_request *mdl_req, enum_mdl_duration *duration); void release_locks_stored_before(enum_mdl_duration duration, MDL_ticket *sentinel); @@ -919,6 +918,7 @@ class MDL_context MDL_ticket **out_ticket);
public: + THD *get_thd() const { return m_owner->get_thd(); } void find_deadlock();
ulong get_thread_id() const { return thd_get_thread_id(get_thd()); }
- Kristian.
Monty, Apparently you pushed this patch into 10.0, even though I explained that it is incorrect, and why. That's not cool, and you can even see it failing in Buildbot now. Can you please fix it ASAP? - Kristian. Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Monty writes:
Can you please check if this looks correct?
diff --git a/sql/mdl.cc b/sql/mdl.cc index 28d2006..3b1f9f2 100644 --- a/sql/mdl.cc +++ b/sql/mdl.cc @@ -17,6 +17,7 @@ #include "sql_class.h" #include "debug_sync.h" #include "sql_array.h" +#include "rpl_rli.h" #include <hash.h> #include <mysqld_error.h> #include <mysql/plugin.h> @@ -442,6 +443,7 @@ class MDL_lock virtual void notify_conflicting_locks(MDL_context *ctx) = 0;
virtual bitmap_t hog_lock_types_bitmap() const = 0; + bool check_if_conflicting_replication_locks(MDL_context *ctx);
This should probably be #ifdef NDEBUG, since you're only using it in debug build?
/** List of granted tickets for this lock. */ Ticket_list m_granted; @@ -2290,6 +2292,44 @@ void MDL_scoped_lock::notify_conflicting_locks(MDL_context *ctx) } }
+/** + Check if there is any conflicting lock that could cause this thread + to wait for another thread which is not ready to commit. + This is always an error, as the upper level of parallel replication + should not allow a scheduling of a conflicting DDL until all earlier + transactions has commited. + + This function is only called for a slave using parallel replication + and trying to get an exclusive lock for the table. +*/ + +bool MDL_lock::check_if_conflicting_replication_locks(MDL_context *ctx) +{ + Ticket_iterator it(m_granted); + MDL_ticket *conflicting_ticket; + + while ((conflicting_ticket= it++)) + { + if (conflicting_ticket->get_ctx() != ctx) + { + MDL_context *conflicting_ctx= conflicting_ticket->get_ctx(); + + /* + If the conflicting thread is another parallel replication + thread for the same master and it's not in commit stage, then + the current transaction has started too early and something is + seriously wrong. + */ + if (conflicting_ctx->get_thd()->rgi_slave && + conflicting_ctx->get_thd()->rgi_slave->rli == + ctx->get_thd()->rgi_slave->rli && + !conflicting_ctx->get_thd()->rgi_slave->did_mark_start_commit) + return 1; // Fatal error
No, this is not correct. For example, the threads might be in different domains, or using different multi-source connections. You need a check like in thd_report_wait_for(). _If_ T1 and T2 are in the same parallel replication domain, _and_ T1 goes before T2, _and_ there is a DDL lock conflict but T1 has not started to commit, then it indicates a potential problem.
Also, the comments are too vague, "Something is seriously wrong" is not a good description. What is wrong? Please describe the issue being checked more precisely. What is the role of each of the two threads involved? What is the situation being checked? Why is it wrong?
Also, use some temporaries for conflicting_ctx->get_thd()->rgi_slave etc, so the condition becomes more readable.
+ } + } + return 0; +} +
/** Acquire one lock with waiting for conflicting locks to go away if needed. @@ -2355,6 +2395,11 @@ MDL_context::acquire_lock(MDL_request *mdl_request, ulong lock_wait_timeout) if (lock->needs_notification(ticket) && lock_wait_timeout) lock->notify_conflicting_locks(this);
+ DBUG_ASSERT(mdl_request->type != MDL_INTENTION_EXCLUSIVE || + !(get_thd()->rgi_slave && + get_thd()->rgi_slave->is_parallel_exec && + lock->check_if_conflicting_replication_locks(this))); + mysql_prlock_unlock(&lock->m_rwlock);
will_wait_for(ticket); diff --git a/sql/mdl.h b/sql/mdl.h index c4d792a..13de602 100644 --- a/sql/mdl.h +++ b/sql/mdl.h @@ -910,7 +910,6 @@ class MDL_context */ MDL_wait_for_subgraph *m_waiting_for; private: - THD *get_thd() const { return m_owner->get_thd(); } MDL_ticket *find_ticket(MDL_request *mdl_req, enum_mdl_duration *duration); void release_locks_stored_before(enum_mdl_duration duration, MDL_ticket *sentinel); @@ -919,6 +918,7 @@ class MDL_context MDL_ticket **out_ticket);
public: + THD *get_thd() const { return m_owner->get_thd(); } void find_deadlock();
ulong get_thread_id() const { return thd_get_thread_id(get_thd()); }
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi! On Thu, Aug 18, 2016 at 12:24 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Monty,
Apparently you pushed this patch into 10.0, even though I explained that it is incorrect, and why. That's not cool, and you can even see it failing in Buildbot now.
I pushed the patch as I didn't see (probably missed) a review from you for more than a day. I was also going away for a few days and I wanted that Elena would have my code in 10.0 while she was testing things that could trigger the assert. As this was a DBUG_ASSERT and could not cause a problem for anyone in production I didn't think it was totally critical to push it before the review.
Can you please fix it ASAP?
Of course. Still I don't know of any case in buildbot where the patch has caused any issues. I checked with Elena and she couldn't find anything either that she could attribute to the patch. Do you happen to know of any failures caused by the patch? Regards, Monty
Michael Widenius <michael.widenius@gmail.com> writes:
Do you happen to know of any failures caused by the patch?
As I mentioned on IRC, the assertion triggers during rpl_mdev6020, for example here: https://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/9... As we discussed, one possible reason is a race because did_mark_start_commit is set only after waking up the next thread. - Kristian.
Hi! On Thu, Aug 18, 2016 at 11:03 PM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Michael Widenius <michael.widenius@gmail.com> writes:
Do you happen to know of any failures caused by the patch?
As I mentioned on IRC, the assertion triggers during rpl_mdev6020, for example here:
https://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/9...
As we discussed, one possible reason is a race because did_mark_start_commit is set only after waking up the next thread.
Just a note about rpl_mdev6020 This test started to fail with a timeout, as some parallel threads are waiting on each-other, between MariaDB 10.0.17 and MariaDB 10.0.18 To get it to fail, one must run with valgrind: mysql-test-run --valgrind rpl_mdev6020,innodb_plugin,mix Now working on trying to see what changes caused it to fail. Regards, Monty
participants (2)
-
Kristian Nielsen
-
Michael Widenius