developers
Threads by month
- ----- 2025 -----
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6826 discussions
[Maria-developers] Updated (by Guest): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 01 Jun '10
by worklog-noreply@askmonty.org 01 Jun '10
01 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Efficient group commit for binary log (116)
by worklog-noreply@askmonty.org 01 Jun '10
by worklog-noreply@askmonty.org 01 Jun '10
01 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 60
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Tue, 01 Jun 2010, 14:20)=-=-
Status updated.
--- /tmp/wklog.116.old.32652 2010-06-01 14:20:15.000000000 +0000
+++ /tmp/wklog.116.new.32652 2010-06-01 14:20:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Mon, 31 May 2010, 06:48)=-=-
Finish first architecture draft (changed my mind a number of times before I was satisfied).
Write up architecture in worklog.
Fix remaining test failures in proof-of-concept patch + implement xtradb part.
Run some benchmarks on proof-of-concept implementation.
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Tue, 25 May 2010, 13:19)=-=-
Low Level Design modified.
--- /tmp/wklog.116.old.14255 2010-05-25 13:19:00.000000000 +0000
+++ /tmp/wklog.116.new.14255 2010-05-25 13:19:00.000000000 +0000
@@ -1 +1,363 @@
+1. Changes for ha_commit_trans()
+
+The gut of the code for commit is in the function ha_commit_trans() (and in
+commit_one_phase() which is called from it). This must be extended to use the
+new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+
+1.1 Atomic queue of committing transactions
+
+To keep the right commit order among participants, we put transactions into a
+queue. The operations on the queue are non-locking:
+
+ - Insert THD at the head of the queue, and return old queue.
+
+ THD *enqueue_atomic(THD *thd)
+
+ - Fetch (and delete) the whole queue.
+
+ THD *atomic_grab_reverse_queue()
+
+These are simple to implement with atomic compare-and-set. Note that there is
+no ABA problem [2], as we do not delete individual elements from the queue, we
+grab the whole queue and replace it with NULL.
+
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order.
+
+The queue is grabbed by the code doing group_log_xid() and commit_ordered()
+calls. The queue is passed directly to group_log_xid(), and afterwards
+iterated to do individual commit_ordered() calls.
+
+Using a lock-free queue allows prepare_ordered() (for one transaction) to run
+in parallel with commit_ordered (in another transaction), increasing potential
+parallelism.
+
+The queue is simply a linked list of THD objects, linked through a
+THD::next_commit_ordered field. Since we add at the head of the queue, the
+list is actually in reverse order, so must be reversed when we grab and delete
+it.
+
+The reason that enqueue_atomic() returns the old queue is so that we can check
+if an insert goes to the head of the queue. The thread at the head of the
+queue will do the sequential part of group commit for everyone.
+
+
+1.2 Locks
+
+1.2.1 Global LOCK_prepare_ordered
+
+This lock is taken to serialise calls to prepare_ordered(). Note that
+effectively, the commit order is decided by the order in which threads obtain
+this lock.
+
+
+1.2.2 Global LOCK_group_commit and COND_group_commit
+
+This lock is used to protect the serial part of group commit. It is taken
+around the code where we grab the queue, call group_log_xid() on the queue,
+and call commit_ordered() on each element of the queue, to make sure they
+happen serialised and in consistent order. It also protects the variable
+group_commit_queue_busy, which is used when not using group_log_xid() to delay
+running over a new queue until the first queue is completely done.
+
+
+1.2.3 Global LOCK_commit_ordered
+
+This lock is taken around calls to commit_ordered(), to ensure they happen
+serialised.
+
+
+1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+
+This lock protects the thd->group_commit_ready variable, as well as the
+condition variable used to wake up threads after log_xid() and
+commit_ordered() finishes.
+
+
+1.2.5 Global LOCK_group_commit_queue
+
+This is only used on platforms with no native compare-and-set operations, to
+make the queue operations atomic.
+
+
+1.3 Commit algorithm.
+
+This is the basic algorithm, simplified by
+
+ - omitting some error handling
+
+ - omitting looping over all handlers when invoking handler methods
+
+ - omitting some possible optimisations when not all calls needed (see next
+ section).
+
+ - Omitting the case where no group_log_xid() is used, see below.
+
+---- BEGIN ALGORITHM ----
+ ht->prepare()
+
+ // Call prepare_ordered() and enqueue in correct commit order
+ lock(LOCK_prepare_ordered)
+ ht->prepare_ordered()
+ old_queue= enqueue_atomic(thd)
+ thd->group_commit_ready= FALSE
+ is_group_commit_leader= (old_queue == NULL)
+ unlock(LOCK_prepare_ordered)
+
+ if (is_group_commit_leader)
+
+ // The first in queue handles group commit for everyone
+
+ lock(LOCK_group_commit)
+ // Wait while queue is busy, see below for when this occurs
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+
+ // Grab and reverse the queue to get correct order of transactions
+ queue= atomic_grab_reverse_queue()
+
+ // This call will set individual error codes in thd->xid_error
+ // It also sets the cookie for unlog() in thd->xid_cookie
+ group_log_xid(queue)
+
+ lock(LOCK_commit_ordered)
+ for (other IN queue)
+ if (!other->xid_error)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ unlock(LOCK_group_commit)
+
+ // Now we are done, so wake up all the others.
+ for (other IN TAIL(queue))
+ lock(other->LOCK_commit_ordered)
+ other->group_commit_ready= TRUE
+ cond_signal(other->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+ else
+ // If not the leader, just wait until leader did the work for us.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ // Finally do any error reporting now that we're back in own thread.
+ if (thd->xid_error)
+ xid_delayed_error(thd)
+ else
+ ht->commit(thd)
+ unlog(thd->xid_cookie, thd->xid)
+---- END ALGORITHM ----
+
+If the transaction coordinator does not support group_log_xid(), we have to do
+things differently. In this case after the serialisation point at
+prepare_ordered(), we have to parallelise again when running log_xid()
+(otherwise we would loose group commit). But then when log_xid() is done, we
+have to serialise again to check for any error and call commit_ordered() in
+correct sequence for any transaction where log_xid() did not return error.
+
+The central part of the algorithm in this case (when using log_xid()) is:
+
+---- BEGIN ALGORITHM ----
+ cookie= log_xid(thd)
+ error= (cookie == 0)
+
+ if (is_group_commit_leader)
+
+ // The first to enqueue grabs the queue and runs first.
+ // But we must wait until a previous queue run is fully done.
+
+ lock(LOCK_group_commit)
+ while (group_commit_queue_busy)
+ cond_wait(COND_group_commit)
+ queue= atomic_grab_reverse_queue()
+ // The queue will be busy until last thread in it is done.
+ group_commit_queue_busy= TRUE
+ unlock(LOCK_group_commit)
+ else
+ // Not first in queue -> wait for previous one to wake us up.
+ lock(thd->LOCK_commit_ordered)
+ while (!thd->group_commit_ready)
+ cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
+ unlock(other->LOCK_commit_ordered)
+
+ if (!error) // Only if log_xid() was successful
+ lock(LOCK_commit_ordered)
+ ht->commit_ordered()
+ unlock(LOCK_commit_ordered)
+
+ // Wake up the next thread, and release queue in last.
+ next= thd->next_commit_ordered
+
+ if (next)
+ lock(next->LOCK_commit_ordered)
+ next->group_commit_ready= TRUE
+ cond_signal(next->COND_commit_ordered)
+ unlock(next->LOCK_commit_ordered)
+ else
+ lock(LOCK_group_commit)
+ group_commit_queue_busy= FALSE
+ unlock(LOCK_group_commit)
+---- END ALGORITHM ----
+
+There are a number of locks taken in the algorithm, but in the group_log_xid()
+case most of them should be uncontended most of the time. The
+LOCK_group_commit of course will be contended, as new threads queue up waiting
+for the previous group commit (and binlog fsync()) to finish so they can do
+the next group commit. This is the whole point of implementing group commit.
+
+The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
+contended as long as handlers follow the intension of having the corresponding
+handler calls execute quickly.
+
+The per-thread LOCK_commit_ordered mutexes should not be contended; they are
+only used to wake up a sleeping thread.
+
+
+1.4 Optimisations when not using all three new calls
+
+
+The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
+optional, and if not implemented by a particular handler/transaction
+coordinator, we can optimise the algorithm to take advantage of not having to
+keep ordering for the missing parts.
+
+If there is no prepare_ordered(), then we need not take the
+LOCK_prepare_ordered mutex.
+
+If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
+mutex.
+
+If there is no group_log_xid(), then we only need the queue to ensure same
+ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
+if either of these (or both) are also not present, we do not need to use the
+queue at all.
+
+
+2. Binlog code changes (log.cc)
+
+
+The bulk of the work needed for the binary log is to extend the code to allow
+group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
+inside the binlog code for group commit.
+
+The existing code runs most of the write + fsync to the binary lock under the
+global LOCK_log mutex, preventing any group commit.
+
+To enable group commit, this code must be split into two parts:
+
+ - one part that runs per transaction, re-writing the embedded event positions
+ for the correct offset, and writing this into the in-memory log cache.
+
+ - another part that writes a set of transactions to the disk, and runs
+ fsync().
+
+Then in group_log_xid(), we can run the first part in a loop over all the
+transactions in the passed-in queue, and run the second part only once.
+
+The binlog code also has other code paths that write into the binlog,
+eg. non-transactional statements. These have to be adapted also to work with
+the new code.
+
+In order to get some group commit facility for these also, we change that part
+of the code in a similar way to ha_commit_trans. We keep another,
+binlog-internal queue of such non-transactional binlog writes, and such writes
+queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
+LOCK_log, it loops over the queue for the fast part, and does the slow part
+once, then finally wakes up the others in the queue.
+
+In the transactional case in group_log_xid(), before we run the passed-in
+queue, we add any members found in the binlog-internal queue. This allows
+these non-transactional writes to share the group commit.
+
+However, in the case where it is a non-transactional write that gets the
+LOCK_log, the transactional transactions from the ha_commit_trans() queue will
+not be able to take part (they will have to wait for their turn to do another
+fsync). It seems difficult to cleanly let the binlog code grab the queue from
+out of the ha_commit_trans() algorithm. I think the group commit is mostly
+useful in transactional workloads anyway (non-transactional engines will loose
+data anyway in case of crash, so why fsync() after each transaction?)
+
+
+3. XtraDB changes (ha_innodb.cc)
+
+The changes needed in XtraDB are comparatively simple, as XtraDB already
+implements group commit, it just needs to be enabled with the new
+commit_ordered() call.
+
+The existing commit() method already is logically in two parts. The first part
+runs under the prepare_commit_mutex() and must be run in same order as binlog
+commit. This part needs to be moved to commit_ordered(). The second part runs
+after releasing prepare_commit_mutex and does transaction log write+fsync; it
+can remain.
+
+Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
+XtraDB option to disable it).
+
+There are two asserts that check that the thread running the first part of
+XtraDB commit is the same as the thread running the other operations for the
+transaction. These have to be removed (as commit_ordered() can run in a
+different thread). Also an error reporting with sql_print_error() has to be
+delayed until commit() time.
+
+
+4. Proof-of-concept implementation
+
+There is a proof-of-concept implementation of this architecture, in the form
+of a quilt patch series [3].
+
+A quick benchmark was done, with sync_binlog=1 and
+innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
+transactions against one table.
+
+Without the patch, we get only 25 queries per second.
+
+With the patch, we get 650 queries per second.
+
+
+5. Open issues/tasks
+
+5.1 XA / other prepare() and commit() call sites.
+
+Check that user-level XA is handled correctly and working. And covered
+sufficiently with tests. Also check that any other calls of ha->prepare() and
+ha->commit() outside of ha_commit_trans() are handled correctly.
+
+5.2 Testing
+
+This worklog needs additions to the test suite, including error inserts to
+check error handling, and synchronisation points to check thread parallelism
+correctness.
+
+
+6. Alternative implementations
+
+ - The binlog code maintains its own extra atomic transaction queue to handle
+ non-transactional commits in a good way together with transactional (with
+ respect to group commit). Alternatively, we could ignore this issue and
+ just give up on group commit for non-transactional statements, for some
+ code simplifications.
+
+ - The binlog code has two ways to prepare end_event and similar, one that
+ uses stack-allocation, and another for when stack allocation is not
+ possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
+ so small that it would make sense to use the same code for both cases.
+
+ - Instead of adding extra fields to THD, we could allocate a separate
+ structure on the thd->mem_root() with the required extra fields (including
+ the THD pointer). Would seem to require initialising mutexes at every
+ commit though.
+
+ - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
+ (should not be hard).
+
+
+-----------------------------------------------------------------------
+
+References:
+
+[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
+
+[3] https://knielsen-hq.org/maria/patches.mwl116/
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.116.old.14249 2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249 2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+ int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+ void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+ void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+ int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+ calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+ prepare_commit_mutex (and resurrect group commit) without loosing the
+ consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+ cost of extra synchronisation between threads to do in-order
+ commit_ordered() method calls. This would also allow to call
+ commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+ transaction coordinators implement group_log_xid() instead, again for some
+ moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+ could be removed from the design. But it fits in well, is efficient to
+ implement, and could be useful later (eg. for the requested feature of
+ releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+ each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+ early (Mark Callaghan from Facebook advocates this, but need to determine
+ exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+ ordering to need only fsync() for the binlog. At crash recovery, any
+ missing transactions in an engine is replayed from the correct point in the
+ binlog (this point must be stored transactionally inside the engine, as
+ XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+ consistent snapshow, with same set of committed and not committed
+ transactions in all engines, 2) returns a corresponding consistent binlog
+ position. This should be easy by piggybacking on the synchronisation
+ implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+ block all updates with FLUSH TABLES WITH READ LOCK.
-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234 2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234 2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+ consistent with the hot backup to be able to provision a new slave, and
+ this is impossible without imposing at least partial consistent ordering
+ between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+ `mysqldump --single-transaction`, to have consistent commit order between
+ binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+ or similar expensive blocking operation. (other backup methods, like LVM
+ snapshot, don't need consistent commit order, as they can restore
+ out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+ need only one fsync (for binlog); lost commits in storage engines can then
+ be recovered from the binlog at crash recovery by re-playing against the
+ engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+ TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+ could even get it to return also a matching binlog position). Currently,
+ this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+ engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+ we can release row locks early in the commit phase, but this requires that we
+release them in
+ the same order as commits in the binlog to ensure consistency between
+ master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+ global transaction ID that it needed consistent commit order among
+ participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+ commit order, and that the architecture described in this worklog can
+ implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+
-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.
Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.
Next step: write up the architecture description.
Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).
-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.
Found and fixed several problems in error handling when writing to binlog.
Removed redundant table map version locking.
Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.
Still to be done:
Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().
Write up the final design in this worklog.
Evaluate the design to see if we can do better/different.
Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).
-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg
DESCRIPTION:
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).
This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.
The idea is to
- Implement/fix group commit to work properly with the binary log enabled.
- (Optionally) avoid the need to fsync() in the engine, and instead rely on
replaying any lost transactions from the binary log against the engine
during crash recovery.
For background see these articles:
http://kristiannielsen.livejournal.com/12254.html
http://kristiannielsen.livejournal.com/12408.html
http://kristiannielsen.livejournal.com/12553.html
----
Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:
1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).
2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.
3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.
While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:
- InnoDB hot backup needs to be able to extract a binlog position that is
consistent with the hot backup to be able to provision a new slave, and
this is impossible without imposing at least partial consistent ordering
between InnoDB and binlog.
- Other backup methods could have similar needs, eg. XtraBackup or
`mysqldump --single-transaction`, to have consistent commit order between
binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
or similar expensive blocking operation. (other backup methods, like LVM
snapshot, don't need consistent commit order, as they can restore
out-of-order commits during crash recovery using XA).
- If we have consistent commit order, we can think about optimising commit to
need only one fsync (for binlog); lost commits in storage engines can then
be recovered from the binlog at crash recovery by re-playing against the
engine from a particular point in the binlog.
- With consistent commit order, we can get better semantics for START
TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
could even get it to return also a matching binlog position). Currently,
this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
engines.
- In InnoDB, the performance in the presense of hotspots can be improved if
we can release row locks early in the commit phase, but this requires that we
release them in
the same order as commits in the binlog to ensure consistency between
master and slaves.
- There was some discussions around Galera [1] synchroneous replication and
global transaction ID that it needed consistent commit order among
participating engines.
- I believe there could be other applications for guaranteed consistent
commit order, and that the architecture described in this worklog can
implement such guarantee with reasonable overhead.
References:
[1] Galera: http://www.codership.com/products/galera_replication
HIGH-LEVEL SPECIFICATION:
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.
So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.
To do this, we add two new handlerton methods:
int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.
The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.
The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.
As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().
As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.
In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.
The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:
void group_log_xid(THD *first_thd);
This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).
This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().
This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.
The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.
As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.
Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:
int xid_delayed_error(THD *thd)
The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.
Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).
Using these new APIs, the work will be to
- In ha_commit_trans(), implement the correct semantics for the three new
calls.
- In XtraDB, use the new commit_ordered() call to remove the
prepare_commit_mutex (and resurrect group commit) without loosing the
consistency with binlog commit order.
- In log.cc (binlog module), implement group_log_xid() to do group commit of
multiple transactions to the binlog with a single shared fsync() call.
-----------------------------------------------------------------------
Some possible alternative for this worklog:
- We could eliminate the group_log_xid() method for a simpler API, at the
cost of extra synchronisation between threads to do in-order
commit_ordered() method calls. This would also allow to call
commit_ordered() in the correct thread context.
- Alternatively, we could eliminate log_xid() and require that all
transaction coordinators implement group_log_xid() instead, again for some
moderate simplification.
- At the moment there is no plugin actually using prepare_ordered(), so, it
could be removed from the design. But it fits in well, is efficient to
implement, and could be useful later (eg. for the requested feature of
releasing locks early in InnoDB).
-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:
- Add statistics about how efficient group commit is (#fsyncs/#commits in
each engine and binlog).
- Implement an XtraDB prepare_ordered() methods that can release row locks
early (Mark Callaghan from Facebook advocates this, but need to determine
exactly how to do this safely).
- Implement a new crash recovery algorithm that uses the consistent commit
ordering to need only fsync() for the binlog. At crash recovery, any
missing transactions in an engine is replayed from the correct point in the
binlog (this point must be stored transactionally inside the engine, as
XtraDB already does today).
- Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
consistent snapshow, with same set of committed and not committed
transactions in all engines, 2) returns a corresponding consistent binlog
position. This should be easy by piggybacking on the synchronisation
implemented for ha_commit_trans().
- Use this in XtraBackup to get consistent binlog position without having to
block all updates with FLUSH TABLES WITH READ LOCK.
LOW-LEVEL DESIGN:
1. Changes for ha_commit_trans()
The gut of the code for commit is in the function ha_commit_trans() (and in
commit_one_phase() which is called from it). This must be extended to use the
new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
1.1 Atomic queue of committing transactions
To keep the right commit order among participants, we put transactions into a
queue. The operations on the queue are non-locking:
- Insert THD at the head of the queue, and return old queue.
THD *enqueue_atomic(THD *thd)
- Fetch (and delete) the whole queue.
THD *atomic_grab_reverse_queue()
These are simple to implement with atomic compare-and-set. Note that there is
no ABA problem [2], as we do not delete individual elements from the queue, we
grab the whole queue and replace it with NULL.
A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order.
The queue is grabbed by the code doing group_log_xid() and commit_ordered()
calls. The queue is passed directly to group_log_xid(), and afterwards
iterated to do individual commit_ordered() calls.
Using a lock-free queue allows prepare_ordered() (for one transaction) to run
in parallel with commit_ordered (in another transaction), increasing potential
parallelism.
The queue is simply a linked list of THD objects, linked through a
THD::next_commit_ordered field. Since we add at the head of the queue, the
list is actually in reverse order, so must be reversed when we grab and delete
it.
The reason that enqueue_atomic() returns the old queue is so that we can check
if an insert goes to the head of the queue. The thread at the head of the
queue will do the sequential part of group commit for everyone.
1.2 Locks
1.2.1 Global LOCK_prepare_ordered
This lock is taken to serialise calls to prepare_ordered(). Note that
effectively, the commit order is decided by the order in which threads obtain
this lock.
1.2.2 Global LOCK_group_commit and COND_group_commit
This lock is used to protect the serial part of group commit. It is taken
around the code where we grab the queue, call group_log_xid() on the queue,
and call commit_ordered() on each element of the queue, to make sure they
happen serialised and in consistent order. It also protects the variable
group_commit_queue_busy, which is used when not using group_log_xid() to delay
running over a new queue until the first queue is completely done.
1.2.3 Global LOCK_commit_ordered
This lock is taken around calls to commit_ordered(), to ensure they happen
serialised.
1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
This lock protects the thd->group_commit_ready variable, as well as the
condition variable used to wake up threads after log_xid() and
commit_ordered() finishes.
1.2.5 Global LOCK_group_commit_queue
This is only used on platforms with no native compare-and-set operations, to
make the queue operations atomic.
1.3 Commit algorithm.
This is the basic algorithm, simplified by
- omitting some error handling
- omitting looping over all handlers when invoking handler methods
- omitting some possible optimisations when not all calls needed (see next
section).
- Omitting the case where no group_log_xid() is used, see below.
---- BEGIN ALGORITHM ----
ht->prepare()
// Call prepare_ordered() and enqueue in correct commit order
lock(LOCK_prepare_ordered)
ht->prepare_ordered()
old_queue= enqueue_atomic(thd)
thd->group_commit_ready= FALSE
is_group_commit_leader= (old_queue == NULL)
unlock(LOCK_prepare_ordered)
if (is_group_commit_leader)
// The first in queue handles group commit for everyone
lock(LOCK_group_commit)
// Wait while queue is busy, see below for when this occurs
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
// Grab and reverse the queue to get correct order of transactions
queue= atomic_grab_reverse_queue()
// This call will set individual error codes in thd->xid_error
// It also sets the cookie for unlog() in thd->xid_cookie
group_log_xid(queue)
lock(LOCK_commit_ordered)
for (other IN queue)
if (!other->xid_error)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
unlock(LOCK_group_commit)
// Now we are done, so wake up all the others.
for (other IN TAIL(queue))
lock(other->LOCK_commit_ordered)
other->group_commit_ready= TRUE
cond_signal(other->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
else
// If not the leader, just wait until leader did the work for us.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
// Finally do any error reporting now that we're back in own thread.
if (thd->xid_error)
xid_delayed_error(thd)
else
ht->commit(thd)
unlog(thd->xid_cookie, thd->xid)
---- END ALGORITHM ----
If the transaction coordinator does not support group_log_xid(), we have to do
things differently. In this case after the serialisation point at
prepare_ordered(), we have to parallelise again when running log_xid()
(otherwise we would loose group commit). But then when log_xid() is done, we
have to serialise again to check for any error and call commit_ordered() in
correct sequence for any transaction where log_xid() did not return error.
The central part of the algorithm in this case (when using log_xid()) is:
---- BEGIN ALGORITHM ----
cookie= log_xid(thd)
error= (cookie == 0)
if (is_group_commit_leader)
// The first to enqueue grabs the queue and runs first.
// But we must wait until a previous queue run is fully done.
lock(LOCK_group_commit)
while (group_commit_queue_busy)
cond_wait(COND_group_commit)
queue= atomic_grab_reverse_queue()
// The queue will be busy until last thread in it is done.
group_commit_queue_busy= TRUE
unlock(LOCK_group_commit)
else
// Not first in queue -> wait for previous one to wake us up.
lock(thd->LOCK_commit_ordered)
while (!thd->group_commit_ready)
cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
unlock(other->LOCK_commit_ordered)
if (!error) // Only if log_xid() was successful
lock(LOCK_commit_ordered)
ht->commit_ordered()
unlock(LOCK_commit_ordered)
// Wake up the next thread, and release queue in last.
next= thd->next_commit_ordered
if (next)
lock(next->LOCK_commit_ordered)
next->group_commit_ready= TRUE
cond_signal(next->COND_commit_ordered)
unlock(next->LOCK_commit_ordered)
else
lock(LOCK_group_commit)
group_commit_queue_busy= FALSE
unlock(LOCK_group_commit)
---- END ALGORITHM ----
There are a number of locks taken in the algorithm, but in the group_log_xid()
case most of them should be uncontended most of the time. The
LOCK_group_commit of course will be contended, as new threads queue up waiting
for the previous group commit (and binlog fsync()) to finish so they can do
the next group commit. This is the whole point of implementing group commit.
The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.
The per-thread LOCK_commit_ordered mutexes should not be contended; they are
only used to wake up a sleeping thread.
1.4 Optimisations when not using all three new calls
The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
optional, and if not implemented by a particular handler/transaction
coordinator, we can optimise the algorithm to take advantage of not having to
keep ordering for the missing parts.
If there is no prepare_ordered(), then we need not take the
LOCK_prepare_ordered mutex.
If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
mutex.
If there is no group_log_xid(), then we only need the queue to ensure same
ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
if either of these (or both) are also not present, we do not need to use the
queue at all.
2. Binlog code changes (log.cc)
The bulk of the work needed for the binary log is to extend the code to allow
group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
inside the binlog code for group commit.
The existing code runs most of the write + fsync to the binary lock under the
global LOCK_log mutex, preventing any group commit.
To enable group commit, this code must be split into two parts:
- one part that runs per transaction, re-writing the embedded event positions
for the correct offset, and writing this into the in-memory log cache.
- another part that writes a set of transactions to the disk, and runs
fsync().
Then in group_log_xid(), we can run the first part in a loop over all the
transactions in the passed-in queue, and run the second part only once.
The binlog code also has other code paths that write into the binlog,
eg. non-transactional statements. These have to be adapted also to work with
the new code.
In order to get some group commit facility for these also, we change that part
of the code in a similar way to ha_commit_trans. We keep another,
binlog-internal queue of such non-transactional binlog writes, and such writes
queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
LOCK_log, it loops over the queue for the fast part, and does the slow part
once, then finally wakes up the others in the queue.
In the transactional case in group_log_xid(), before we run the passed-in
queue, we add any members found in the binlog-internal queue. This allows
these non-transactional writes to share the group commit.
However, in the case where it is a non-transactional write that gets the
LOCK_log, the transactional transactions from the ha_commit_trans() queue will
not be able to take part (they will have to wait for their turn to do another
fsync). It seems difficult to cleanly let the binlog code grab the queue from
out of the ha_commit_trans() algorithm. I think the group commit is mostly
useful in transactional workloads anyway (non-transactional engines will loose
data anyway in case of crash, so why fsync() after each transaction?)
3. XtraDB changes (ha_innodb.cc)
The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.
The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.
Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).
There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.
4. Proof-of-concept implementation
There is a proof-of-concept implementation of this architecture, in the form
of a quilt patch series [3].
A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.
Without the patch, we get only 25 queries per second.
With the patch, we get 650 queries per second.
5. Open issues/tasks
5.1 XA / other prepare() and commit() call sites.
Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.
5.2 Testing
This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
6. Alternative implementations
- The binlog code maintains its own extra atomic transaction queue to handle
non-transactional commits in a good way together with transactional (with
respect to group commit). Alternatively, we could ignore this issue and
just give up on group commit for non-transactional statements, for some
code simplifications.
- The binlog code has two ways to prepare end_event and similar, one that
uses stack-allocation, and another for when stack allocation is not
possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
so small that it would make sense to use the same code for both cases.
- Instead of adding extra fields to THD, we could allocate a separate
structure on the thd->mem_root() with the required extra fields (including
the THD pointer). Would seem to require initialising mutexes at every
commit though.
- It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
(should not be hard).
-----------------------------------------------------------------------
References:
[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
[3] https://knielsen-hq.org/maria/patches.mwl116/
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2793)
by timour@askmonty.org 01 Jun '10
by timour@askmonty.org 01 Jun '10
01 Jun '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:timour@askmonty.org-20100527131347-unr62oupctbp912x
2793 timour(a)askmonty.org 2010-06-01
MWL#89: Cost-based choice between Materialization and IN->EXISTS transformation
Phase 2: Changed the code-generation for subquery materialization to be
performed in runtime memory for each (re)execution, instead of in
statement memory (once per prepared statement).
- Item_in_subselect::setup_engine() no longer wraps materialization related
objects to be created in statement memory.
- Merged subselect_hash_sj_engine::init_permanent and
subselect_hash_sj_engine::init_runtime into subselect_hash_sj_engine::init,
which is called for each (re)execution.
- Fixed deletion of the temp table accordingly.
modified:
sql/item_subselect.cc
sql/item_subselect.h
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-05-27 13:13:47 +0000
+++ b/sql/item_subselect.cc 2010-06-01 11:57:35 +0000
@@ -148,6 +148,7 @@ void Item_in_subselect::cleanup()
Item_subselect::~Item_subselect()
{
delete engine;
+ engine= NULL;
}
Item_subselect::trans_res
@@ -2090,82 +2091,62 @@ void Item_in_subselect::update_used_tabl
bool Item_in_subselect::setup_engine()
{
- subselect_hash_sj_engine *new_engine= NULL;
- bool res= FALSE;
+ subselect_hash_sj_engine *mat_engine= NULL;
+ subselect_single_select_engine *select_engine;
DBUG_ENTER("Item_in_subselect::setup_engine");
+ /*
+ The select (IN=>EXISTS) engine is pre-created already at parse time, and
+ is stored in statment memory (preserved across PS executions).
+ */
+ DBUG_ASSERT(engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE);
+ select_engine= (subselect_single_select_engine*) engine;
- if (engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE)
- {
- /* Create/initialize objects in permanent memory. */
- subselect_single_select_engine *old_engine;
- Query_arena *arena= thd->stmt_arena, backup;
-
- old_engine= (subselect_single_select_engine*) engine;
-
- if (arena->is_conventional())
- arena= 0;
- else
- thd->set_n_backup_active_arena(arena, &backup);
-
- if (!(new_engine= new subselect_hash_sj_engine(thd, this,
- old_engine)) ||
- new_engine->init_permanent(&old_engine->join->fields_list))
- {
- Item_subselect::trans_res trans_res;
- /*
- If for some reason we cannot use materialization for this IN predicate,
- delete all materialization-related objects, and apply the IN=>EXISTS
- transformation.
- */
- delete new_engine;
- new_engine= NULL;
- exec_method= NOT_TRANSFORMED;
- if (left_expr->cols() == 1)
- trans_res= single_value_in_to_exists_transformer(old_engine->join,
- &eq_creator);
- else
- trans_res= row_value_in_to_exists_transformer(old_engine->join);
- /*
- The IN=>EXISTS transformation above injects new predicates into the
- WHERE and HAVING clauses. Since the subquery was already optimized,
- below we force its reoptimization with the new injected conditions
- by the first call to subselect_single_select_engine::exec().
- This is the only case of lazy subquery optimization in the server.
- */
- DBUG_ASSERT(old_engine->join->optimized);
- old_engine->join->optimized= false;
- res= (trans_res != Item_subselect::RES_OK);
- }
- if (new_engine)
- engine= new_engine;
-
- if (arena)
- thd->restore_active_arena(arena, &backup);
- }
- else
- {
- DBUG_ASSERT(engine->engine_type() == subselect_engine::HASH_SJ_ENGINE);
- new_engine= (subselect_hash_sj_engine*) engine;
- }
+ /* Create/initialize execution objects. */
+ if (!(mat_engine= new subselect_hash_sj_engine(thd, this, select_engine)))
+ DBUG_RETURN(TRUE);
- /* Initilizations done in runtime memory, repeated for each execution. */
- if (new_engine)
+ if (mat_engine->init(&select_engine->join->fields_list))
{
+ Item_subselect::trans_res trans_res;
+ /*
+ If for some reason we cannot use materialization for this IN predicate,
+ delete all materialization-related objects, and apply the IN=>EXISTS
+ transformation.
+ */
+ delete mat_engine;
+ mat_engine= NULL;
+ exec_method= NOT_TRANSFORMED;
+
+ if (left_expr->cols() == 1)
+ trans_res= single_value_in_to_exists_transformer(select_engine->join,
+ &eq_creator);
+ else
+ trans_res= row_value_in_to_exists_transformer(select_engine->join);
/*
- Reset the LIMIT 1 set in Item_exists_subselect::fix_length_and_dec.
- TODO:
- Currently we set the subquery LIMIT to infinity, and this is correct
- because we forbid at parse time LIMIT inside IN subqueries (see
- Item_in_subselect::test_limit). However, once we allow this, here
- we should set the correct limit if given in the query.
+ The IN=>EXISTS transformation above injects new predicates into the
+ WHERE and HAVING clauses. Since the subquery was already optimized,
+ below we force its reoptimization with the new injected conditions
+ by the first call to subselect_single_select_engine::exec().
+ This is the only case of lazy subquery optimization in the server.
*/
- unit->global_parameters->select_limit= NULL;
- if ((res= new_engine->init_runtime()))
- DBUG_RETURN(res);
+ DBUG_ASSERT(select_engine->join->optimized);
+ select_engine->join->optimized= false;
+ DBUG_RETURN(trans_res != Item_subselect::RES_OK);
}
- DBUG_RETURN(res);
+ /*
+ Reset the "LIMIT 1" set in Item_exists_subselect::fix_length_and_dec.
+ TODO:
+ Currently we set the subquery LIMIT to infinity, and this is correct
+ because we forbid at parse time LIMIT inside IN subqueries (see
+ Item_in_subselect::test_limit). However, once we allow this, here
+ we should set the correct limit if given in the query.
+ */
+ unit->global_parameters->select_limit= NULL;
+
+ engine= mat_engine;
+ DBUG_RETURN(FALSE);
}
@@ -3680,14 +3661,14 @@ bitmap_init_memroot(MY_BITMAP *map, uint
@retval FALSE otherwise
*/
-bool subselect_hash_sj_engine::init_permanent(List<Item> *tmp_columns)
+bool subselect_hash_sj_engine::init(List<Item> *tmp_columns)
{
select_union *result_sink;
/* Options to create_tmp_table. */
ulonglong tmp_create_options= thd->options | TMP_TABLE_ALL_COLUMNS;
/* | TMP_TABLE_FORCE_MYISAM; TIMOUR: force MYISAM */
- DBUG_ENTER("subselect_hash_sj_engine::init_permanent");
+ DBUG_ENTER("subselect_hash_sj_engine::init");
if (bitmap_init_memroot(&non_null_key_parts, tmp_columns->elements,
thd->mem_root) ||
@@ -3762,6 +3743,17 @@ bool subselect_hash_sj_engine::init_perm
!(lookup_engine= make_unique_engine()))
DBUG_RETURN(TRUE);
+ /*
+ Repeat name resolution for 'cond' since cond is not part of any
+ clause of the query, and it is not 'fixed' during JOIN::prepare.
+ */
+ if (semi_join_conds && !semi_join_conds->fixed &&
+ semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
+ DBUG_RETURN(TRUE);
+ /* Let our engine reuse this query plan for materialization. */
+ materialize_join= materialize_engine->join;
+ materialize_join->change_result(result);
+
DBUG_RETURN(FALSE);
}
@@ -3907,30 +3899,6 @@ subselect_hash_sj_engine::make_unique_en
}
-/**
- Initialize members of the engine that need to be re-initilized at each
- execution.
-
- @retval TRUE if a memory allocation error occurred
- @retval FALSE if success
-*/
-
-bool subselect_hash_sj_engine::init_runtime()
-{
- /*
- Repeat name resolution for 'cond' since cond is not part of any
- clause of the query, and it is not 'fixed' during JOIN::prepare.
- */
- if (semi_join_conds && !semi_join_conds->fixed &&
- semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
- return TRUE;
- /* Let our engine reuse this query plan for materialization. */
- materialize_join= materialize_engine->join;
- materialize_join->change_result(result);
- return FALSE;
-}
-
-
subselect_hash_sj_engine::~subselect_hash_sj_engine()
{
delete lookup_engine;
@@ -3967,6 +3935,13 @@ void subselect_hash_sj_engine::cleanup()
count_null_only_columns= 0;
strategy= UNDEFINED;
materialize_engine->cleanup();
+ /*
+ Restore the original Item_in_subselect engine. This engine is created once
+ at parse time and stored across executions, while all other materialization
+ related engines are created and chosen for each execution.
+ */
+ ((Item_in_subselect *) item)->engine= materialize_engine;
+
if (lookup_engine_type == TABLE_SCAN_ENGINE ||
lookup_engine_type == ROWID_MERGE_ENGINE)
{
@@ -3983,6 +3958,9 @@ void subselect_hash_sj_engine::cleanup()
DBUG_ASSERT(lookup_engine->engine_type() == UNIQUESUBQUERY_ENGINE);
lookup_engine->cleanup();
result->cleanup(); /* Resets the temp table as well. */
+ DBUG_ASSERT(tmp_table);
+ free_tmp_table(thd, tmp_table);
+ tmp_table= NULL;
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-05-27 13:13:47 +0000
+++ b/sql/item_subselect.h 2010-06-01 11:57:35 +0000
@@ -802,8 +802,7 @@ public:
}
~subselect_hash_sj_engine();
- bool init_permanent(List<Item> *tmp_columns);
- bool init_runtime();
+ bool init(List<Item> *tmp_columns);
void cleanup();
int prepare();
int exec();
1
0
[Maria-developers] Rev 2789: Subquery cache for pre-review (MWL#66) in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 31 May '10
by sanja@askmonty.org 31 May '10
31 May '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2789
revision-id: sanja(a)askmonty.org-20100531212554-oal32d5v360l6cul
parent: sergii(a)pisem.net-20100510134608-oyi2vznyghgcrt0x
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-01 00:25:54 +0300
message:
Subquery cache for pre-review (MWL#66)
=== modified file 'libmysqld/Makefile.am'
--- a/libmysqld/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/libmysqld/Makefile.am 2010-05-31 21:25:54 +0000
@@ -80,7 +80,8 @@
sql_tablespace.cc \
rpl_injector.cc my_user.c partition_info.cc \
sql_servers.cc event_parse_data.cc opt_table_elimination.cc \
- multi_range_read.cc opt_index_cond_pushdown.cc
+ multi_range_read.cc opt_index_cond_pushdown.cc \
+ sql_subquery_cache.cc
libmysqld_int_a_SOURCES= $(libmysqld_sources)
nodist_libmysqld_int_a_SOURCES= $(libmysqlsources) $(sqlsources)
=== modified file 'mysql-test/r/index_merge_myisam.result'
--- a/mysql-test/r/index_merge_myisam.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/index_merge_myisam.result 2010-05-31 21:25:54 +0000
@@ -1419,19 +1419,19 @@
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge=off,index_merge_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge_union=on';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,index_merge_sort_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=4;
ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
set optimizer_switch=NULL;
@@ -1458,21 +1458,21 @@
set optimizer_switch='index_merge=off,index_merge_union=off,default';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set @@global.optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
#
# Check index_merge's @@optimizer_switch flags
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -1582,5 +1582,5 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
drop table t0, t1;
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-03-11 21:43:31 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-05-31 21:25:54 +0000
@@ -394,7 +394,7 @@
# - engine_condition_pushdown does not affect ICP
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== added file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-05-31 21:25:54 +0000
@@ -0,0 +1,591 @@
+set optimizer_switch='subquery_cache=on';
+flush status;
+create table t1 (a int, b int);
+insert into t1 values (1,2),(3,4),(1,2),(3,4),(3,4),(4,5),(4,5),(5,6),(5,6),(4,5);
+create table t2 (c int, d int);
+insert into t2 values (2,3),(3,4),(5,6);
+#single value subquery test
+select a, (select d from t2 where b=c) + 1 from t1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 6
+Subquery_cache_miss 4
+#single value subquery test (PS)
+prepare stmt1 from 'select a, (select d from t2 where b=c) + 1 from t1';
+execute stmt1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 12
+Subquery_cache_miss 8
+execute stmt1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 18
+Subquery_cache_miss 12
+deallocate prepare stmt1;
+#single value subquery test (SP)
+CREATE PROCEDURE p1() select a, (select d from t2 where b=c) + 1 from t1;
+call p1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+call p1;
+a (select d from t2 where b=c) + 1
+1 4
+3 NULL
+1 4
+3 NULL
+3 NULL
+4 7
+4 7
+5 NULL
+5 NULL
+4 7
+drop procedure p1;
+#IN subquery test
+flush status;
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 0
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 6
+Subquery_cache_miss 4
+insert into t1 values (7,8),(9,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+7 8 0
+9 NULL NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 12
+Subquery_cache_miss 10
+insert into t2 values (8,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+7 8 NULL
+9 NULL NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 18
+Subquery_cache_miss 16
+#IN subquery tesy (PS)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+prepare stmt1 from 'select a, b , b in (select d from t2) as SUBS from t1';
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 24
+Subquery_cache_miss 20
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 30
+Subquery_cache_miss 24
+insert into t1 values (7,8),(9,NULL);
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 36
+Subquery_cache_miss 30
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 42
+Subquery_cache_miss 36
+insert into t2 values (8,NULL);
+execute stmt1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 48
+Subquery_cache_miss 42
+execute stmt1;
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 54
+Subquery_cache_miss 48
+deallocate prepare stmt1;
+#IN subquery tesy (SP)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+CREATE PROCEDURE p1() select a, b , b in (select d from t2) as SUBS from t1;
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 60
+Subquery_cache_miss 52
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 66
+Subquery_cache_miss 56
+insert into t1 values (7,8),(9,NULL);
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 72
+Subquery_cache_miss 62
+call p1();
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL NULL
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 78
+Subquery_cache_miss 68
+insert into t2 values (8,NULL);
+call p1();
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 84
+Subquery_cache_miss 74
+call p1();
+a b SUBS
+1 2 NULL
+3 4 1
+1 2 NULL
+3 4 1
+3 4 1
+4 5 NULL
+4 5 NULL
+5 6 1
+5 6 1
+4 5 NULL
+9 NULL NULL
+7 8 NULL
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 90
+Subquery_cache_miss 80
+drop procedure p1;
+# test of simple exists
+select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+# test of prepared statement exists
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 96
+Subquery_cache_miss 86
+prepare stmt1 from 'select a, b , exists (select * from t2 where b=d) as SUBS from t1';
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 102
+Subquery_cache_miss 92
+execute stmt1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 108
+Subquery_cache_miss 98
+deallocate prepare stmt1;
+# test of stored procedure exists
+CREATE PROCEDURE p1() select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+call p1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+call p1;
+a b SUBS
+1 2 0
+3 4 1
+1 2 0
+3 4 1
+3 4 1
+4 5 0
+4 5 0
+5 6 1
+5 6 1
+4 5 0
+9 NULL 0
+7 8 0
+drop procedure p1;
+#clean up
+drop table t1,t2;
+test different types
+#int
+CREATE TABLE t1 ( a int, b int);
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a
+1
+3
+DROP TABLE t1;
+#char
+CREATE TABLE t1 ( a char(1), b char (1));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#decimal
+CREATE TABLE t1 ( a decimal(3,1), b decimal(3,1));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a
+1.0
+3.0
+DROP TABLE t1;
+#date
+CREATE TABLE t1 ( a date, b date);
+INSERT INTO t1 VALUES('1000-01-01','1000-01-01'),('2000-02-01','2000-02-01'),('3000-03-03','3000-03-03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-01');
+a
+1000-01-01
+3000-03-03
+DROP TABLE t1;
+#datetime
+CREATE TABLE t1 ( a datetime, b datetime);
+INSERT INTO t1 VALUES('1000-01-01 01:01:01','1000-01-01 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('3000-03-03 03:03:03','3000-03-03 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+a
+1000-01-01 01:01:01
+3000-03-03 03:03:03
+DROP TABLE t1;
+#time
+CREATE TABLE t1 ( a time, b time);
+INSERT INTO t1 VALUES('01:01:01','01:01:01'),('02:02:02','02:02:02'),('03:03:03','03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '02:02:02');
+a
+01:01:01
+03:03:03
+DROP TABLE t1;
+#timestamp
+CREATE TABLE t1 ( a timestamp, b timestamp);
+INSERT INTO t1 VALUES('2000-02-02 01:01:01','2000-02-02 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('2000-02-02 03:03:03','2000-02-02 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+a
+2000-02-02 01:01:01
+2000-02-02 03:03:03
+DROP TABLE t1;
+#bit
+CREATE TABLE t1 ( a bit(20), b bit(20));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a+0 FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+a+0
+1
+3
+DROP TABLE t1;
+#enum
+CREATE TABLE t1 ( a enum('1','2','3'), b enum('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#set
+CREATE TABLE t1 ( a set('1','2','3'), b set('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#blob
+CREATE TABLE t1 ( a blob, b blob);
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+a
+1
+3
+DROP TABLE t1;
+#geometry
+CREATE TABLE t1 ( a geometry, b geometry);
+INSERT INTO t1 VALUES(POINT(1,1),POINT(1,1)),(POINT(2,2),POINT(2,2)),(POINT(3,3),POINT(3,3));
+SELECT astext(a) FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = POINT(2,2));
+astext(a)
+POINT(1 1)
+POINT(3 3)
+DROP TABLE t1;
+#uncacheable queries test (random and side effect)
+flush status;
+CREATE TABLE t1 (a int);
+INSERT INTO t1 VALUES (2), (4), (1), (3);
+select a, a in (select a from t1) from t1 as ext;
+a a in (select a from t1)
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+select a, a in (select a from t1 where -1 < rand()) from t1 as ext;
+a a in (select a from t1 where -1 < rand())
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+select a, a in (select a from t1 where -1 < benchmark(a,100)) from t1 as ext;
+a a in (select a from t1 where -1 < benchmark(a,100))
+2 1
+4 1
+1 1
+3 1
+show status like "subquery_cache%";
+Variable_name Value
+Subquery_cache_hit 0
+Subquery_cache_miss 4
+drop table t1;
+set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/r/subselect3.result'
--- a/mysql-test/r/subselect3.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3.result 2010-05-31 21:25:54 +0000
@@ -105,6 +105,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -123,6 +124,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect3_jcl6.result'
--- a/mysql-test/r/subselect3_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3_jcl6.result 2010-05-31 21:25:54 +0000
@@ -109,6 +109,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -127,6 +128,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect_no_mat.result'
--- a/mysql-test/r/subselect_no_mat.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_mat.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_opts.result'
--- a/mysql-test/r/subselect_no_opts.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_opts.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off,semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_semijoin.result'
--- a/mysql-test/r/subselect_no_semijoin.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_semijoin.result 2010-05-31 21:25:54 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-05-31 21:25:54 +0000
@@ -202,39 +202,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-05-31 21:25:54 +0000
@@ -206,39 +206,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== added file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-05-31 21:25:54 +0000
@@ -0,0 +1,204 @@
+
+set optimizer_switch='subquery_cache=on';
+flush status;
+
+create table t1 (a int, b int);
+insert into t1 values (1,2),(3,4),(1,2),(3,4),(3,4),(4,5),(4,5),(5,6),(5,6),(4,5);
+create table t2 (c int, d int);
+insert into t2 values (2,3),(3,4),(5,6);
+
+--echo #single value subquery test
+select a, (select d from t2 where b=c) + 1 from t1;
+
+show status like "subquery_cache%";
+
+--echo #single value subquery test (PS)
+prepare stmt1 from 'select a, (select d from t2 where b=c) + 1 from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+deallocate prepare stmt1;
+
+--echo #single value subquery test (SP)
+CREATE PROCEDURE p1() select a, (select d from t2 where b=c) + 1 from t1;
+
+call p1;
+call p1;
+
+drop procedure p1;
+
+--echo #IN subquery test
+flush status;
+
+show status like "subquery_cache%";
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+select a, b , b in (select d from t2) as SUBS from t1;
+show status like "subquery_cache%";
+
+--echo #IN subquery tesy (PS)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+
+prepare stmt1 from 'select a, b , b in (select d from t2) as SUBS from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+
+deallocate prepare stmt1;
+
+
+--echo #IN subquery tesy (SP)
+delete from t1 where a > 6;
+delete from t2 where c > 6;
+
+CREATE PROCEDURE p1() select a, b , b in (select d from t2) as SUBS from t1;
+
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+insert into t1 values (7,8),(9,NULL);
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+insert into t2 values (8,NULL);
+call p1();
+show status like "subquery_cache%";
+call p1();
+show status like "subquery_cache%";
+
+drop procedure p1;
+
+
+--echo # test of simple exists
+select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+
+--echo # test of prepared statement exists
+show status like "subquery_cache%";
+prepare stmt1 from 'select a, b , exists (select * from t2 where b=d) as SUBS from t1';
+execute stmt1;
+show status like "subquery_cache%";
+execute stmt1;
+show status like "subquery_cache%";
+deallocate prepare stmt1;
+
+--echo # test of stored procedure exists
+CREATE PROCEDURE p1() select a, b , exists (select * from t2 where b=d) as SUBS from t1;
+call p1;
+call p1;
+drop procedure p1;
+
+--echo #clean up
+drop table t1,t2;
+
+--echo test different types
+--echo #int
+CREATE TABLE t1 ( a int, b int);
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #char
+CREATE TABLE t1 ( a char(1), b char (1));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #decimal
+CREATE TABLE t1 ( a decimal(3,1), b decimal(3,1));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #date
+CREATE TABLE t1 ( a date, b date);
+INSERT INTO t1 VALUES('1000-01-01','1000-01-01'),('2000-02-01','2000-02-01'),('3000-03-03','3000-03-03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-01');
+DROP TABLE t1;
+
+--echo #datetime
+CREATE TABLE t1 ( a datetime, b datetime);
+INSERT INTO t1 VALUES('1000-01-01 01:01:01','1000-01-01 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('3000-03-03 03:03:03','3000-03-03 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+DROP TABLE t1;
+
+--echo #time
+CREATE TABLE t1 ( a time, b time);
+INSERT INTO t1 VALUES('01:01:01','01:01:01'),('02:02:02','02:02:02'),('03:03:03','03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '02:02:02');
+DROP TABLE t1;
+
+--echo #timestamp
+CREATE TABLE t1 ( a timestamp, b timestamp);
+INSERT INTO t1 VALUES('2000-02-02 01:01:01','2000-02-02 01:01:01'),('2000-02-02 02:02:02','2000-02-02 02:02:02'),('2000-02-02 03:03:03','2000-02-02 03:03:03');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2000-02-02 02:02:02');
+DROP TABLE t1;
+
+--echo #bit
+CREATE TABLE t1 ( a bit(20), b bit(20));
+INSERT INTO t1 VALUES(1,1),(2,2),(3,3);
+SELECT a+0 FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = 2);
+DROP TABLE t1;
+
+--echo #enum
+CREATE TABLE t1 ( a enum('1','2','3'), b enum('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #set
+CREATE TABLE t1 ( a set('1','2','3'), b set('1','2','3'));
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #blob
+CREATE TABLE t1 ( a blob, b blob);
+INSERT INTO t1 VALUES('1','1'),('2','2'),('3','3');
+SELECT a FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = '2');
+DROP TABLE t1;
+
+--echo #geometry
+CREATE TABLE t1 ( a geometry, b geometry);
+INSERT INTO t1 VALUES(POINT(1,1),POINT(1,1)),(POINT(2,2),POINT(2,2)),(POINT(3,3),POINT(3,3));
+SELECT astext(a) FROM t1 WHERE NOT a IN (SELECT a FROM t1 WHERE b = POINT(2,2));
+DROP TABLE t1;
+
+
+--echo #uncacheable queries test (random and side effect)
+flush status;
+CREATE TABLE t1 (a int);
+INSERT INTO t1 VALUES (2), (4), (1), (3);
+select a, a in (select a from t1) from t1 as ext;
+show status like "subquery_cache%";
+select a, a in (select a from t1 where -1 < rand()) from t1 as ext;
+show status like "subquery_cache%";
+select a, a in (select a from t1 where -1 < benchmark(a,100)) from t1 as ext;
+show status like "subquery_cache%";
+drop table t1;
+
+set optimizer_switch='subquery_cache=default';
=== modified file 'mysql-test/t/subselect3.test'
--- a/mysql-test/t/subselect3.test 2010-03-20 12:01:47 +0000
+++ b/mysql-test/t/subselect3.test 2010-05-31 21:25:54 +0000
@@ -98,10 +98,12 @@
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
show status like '%Handler_read%';
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
=== modified file 'sql/CMakeLists.txt'
--- a/sql/CMakeLists.txt 2010-03-20 12:01:47 +0000
+++ b/sql/CMakeLists.txt 2010-05-31 21:25:54 +0000
@@ -78,7 +78,7 @@
rpl_rli.cc rpl_mi.cc sql_servers.cc
sql_connect.cc scheduler.cc
sql_profile.cc event_parse_data.cc opt_table_elimination.cc
- ds_mrr.cc
+ ds_mrr.cc sql_subquery_cache.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.h
${PROJECT_SOURCE_DIR}/include/mysqld_error.h
=== modified file 'sql/Makefile.am'
--- a/sql/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/sql/Makefile.am 2010-05-31 21:25:54 +0000
@@ -80,7 +80,7 @@
event_data_objects.h event_scheduler.h \
sql_partition.h partition_info.h partition_element.h \
contributors.h sql_servers.h \
- multi_range_read.h
+ multi_range_read.h sql_subquery_cache.h
mysqld_SOURCES = sql_lex.cc sql_handler.cc sql_partition.cc \
item.cc item_sum.cc item_buff.cc item_func.cc \
@@ -130,7 +130,7 @@
sql_servers.cc event_parse_data.cc \
opt_table_elimination.cc \
multi_range_read.cc \
- opt_index_cond_pushdown.cc
+ opt_index_cond_pushdown.cc sql_subquery_cache.cc
nodist_mysqld_SOURCES = mini_client_errors.c pack.c client.c my_time.c my_user.c
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item.cc 2010-05-31 21:25:54 +0000
@@ -28,6 +28,9 @@
const String my_null_string("NULL", 4, default_charset_info);
+static int save_field_in_field(Field *from,my_bool * null_value,
+ Field *to, bool no_conversions);
+
/****************************************************************************/
/* Hybrid_type_traits {_real} */
@@ -2273,6 +2276,13 @@
str->append(str_value);
}
+void Item_bool_cache::print(String *str, enum_query_type query_type)
+{
+ if (null_value)
+ str->append("NULL", 4);
+ else
+ Item_int::print(str, query_type);
+}
Item_uint::Item_uint(const char *str_arg, uint length):
Item_int(str_arg, length)
@@ -3646,12 +3656,17 @@
resolved_item->db_name : "");
const char *table_name= (resolved_item->table_name ?
resolved_item->table_name : "");
+ DBUG_ENTER("mark_as_dependent");
+ DBUG_PRINT("enter", ("Field '%s.%s.%s in select %d resolved in %d",
+ db_name, table_name,
+ resolved_item->field_name, current->select_number,
+ last->select_number));
/* store pointer on SELECT_LEX from which item is dependent */
if (mark_item)
mark_item->depended_from= last;
if (current->mark_as_dependent(thd, last, /** resolved_item psergey-thu
**/mark_item))
- return TRUE;
+ DBUG_RETURN(TRUE);
if (thd->lex->describe & DESCRIBE_EXTENDED)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE,
@@ -3661,7 +3676,7 @@
resolved_item->field_name,
current->select_number, last->select_number);
}
- return FALSE;
+ DBUG_RETURN(FALSE);
}
@@ -3698,6 +3713,7 @@
resolving)
*/
SELECT_LEX *previous_select= current_sel;
+
for (; previous_select->outer_select() != last_select;
previous_select= previous_select->outer_select())
{
@@ -3726,6 +3742,7 @@
mark_as_dependent(thd, last_select, current_sel, resolved_item,
dependent);
}
+ return;
}
@@ -4098,6 +4115,9 @@
((ref_type == REF_ITEM ||
ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) : 0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
return 0;
}
}
@@ -4113,7 +4133,9 @@
((ref_type == REF_ITEM || ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
-
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference to a view field had been found and we
substituted it instead of this Item (find_field_in_tables
@@ -4215,6 +4237,10 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, rf,
rf);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
+
return 0;
}
else
@@ -4222,6 +4248,9 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex,
this, (Item_ident*)*reference);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
if (last_checked_context->select_lex->having_fix_field)
{
Item_ref *rf;
@@ -5082,39 +5111,48 @@
/**
+ Saves one Fields of an Item of in other Field
+
+ @param from Field to copy value from
+ @param null_value reference on item null_value to set it if it is needed
+ @param to Field to cope value to
+ @param no_conversions how to deal with NULL value (see
+ set_field_to_null_with_conversions())
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static int save_field_in_field(Field *from, my_bool *null_value,
+ Field *to, bool no_conversions)
+{
+ int res;
+ if (from->is_null())
+ {
+ (*null_value)= 1;
+ res= set_field_to_null_with_conversions(to, no_conversions);
+ }
+ else
+ {
+ to->set_notnull();
+ res= field_conv(to, from);
+ (*null_value)= 0;
+ }
+ return res;
+}
+
+/**
Set a field's value from a item.
*/
void Item_field::save_org_in_field(Field *to)
{
- if (field->is_null())
- {
- null_value=1;
- set_field_to_null_with_conversions(to, 1);
- }
- else
- {
- to->set_notnull();
- field_conv(to,field);
- null_value=0;
- }
+ save_field_in_field(field, &null_value, to, TRUE);
}
int Item_field::save_in_field(Field *to, bool no_conversions)
{
- int res;
- if (result_field->is_null())
- {
- null_value=1;
- res= set_field_to_null_with_conversions(to, no_conversions);
- }
- else
- {
- to->set_notnull();
- res= field_conv(to,result_field);
- null_value=0;
- }
- return res;
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -5973,6 +6011,9 @@
refer_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
view reference found, we substituted it instead of this
Item, so can quit
@@ -6023,6 +6064,9 @@
thd->change_item_tree(reference, fld);
mark_as_dependent(thd, last_checked_context->select_lex,
thd->lex->current_select, fld, fld);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6046,6 +6090,9 @@
DBUG_ASSERT(*ref && (*ref)->fixed);
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, this, this);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ ref);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6312,7 +6359,8 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- DBUG_ASSERT(!result_field);
+ if (result_field)
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
return res;
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-03-20 12:01:47 +0000
+++ b/sql/item.h 2010-05-31 21:25:54 +0000
@@ -1922,8 +1922,31 @@
virtual void print(String *str, enum_query_type query_type);
Item_num *neg ();
uint decimal_precision() const { return max_length; }
- bool check_partition_func_processor(uchar *bool_arg) { return FALSE;}
- bool check_vcol_func_processor(uchar *arg) { return FALSE;}
+};
+
+
+/**
+ Item represent TRUE/FALSE/NULL for subquery values
+*/
+
+class Item_bool_cache: public Item_int
+{
+public:
+ Item_bool_cache(): Item_int(0, 1)
+ {
+ unsigned_flag= maybe_null= null_value= TRUE;
+ name= (char *)"bool chache";
+ }
+ Item_bool_cache(my_bool val, my_bool null): Item_int(val, 1)
+ {
+ unsigned_flag= maybe_null= TRUE;
+ null_value= null;
+ name= (char *)"bool chache";
+ }
+ Item *clone_item() { return new Item_bool_cache(value, null_value); }
+ uint decimal_precision() const { return 1; }
+ virtual void print(String *str, enum_query_type query_type);
+ void set(my_bool val, my_bool null) {value= test(val); null_value= null;}
};
@@ -3146,7 +3169,8 @@
example(0), used_table_map(0), cached_field(0), cached_field_type(MYSQL_TYPE_STRING),
value_cached(0)
{
- fixed= 1;
+ fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
Item_cache(enum_field_types field_type_arg):
@@ -3154,6 +3178,7 @@
value_cached(0)
{
fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.cc 2010-05-31 21:25:54 +0000
@@ -1736,6 +1736,15 @@
used_tables_cache|= args[1]->used_tables();
not_null_tables_cache|= args[1]->not_null_tables();
const_item_cache&= args[1]->const_item();
+ DBUG_ASSERT(scache == NULL);
+ if (args[0]->cols() ==1 &&
+ thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE &&
+ !(sub->engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ sub->depends_on.push_front((Item**)&cache);
+ scache= new Subquery_cache_tmptable(thd, sub->depends_on, &result);
+ }
fixed= 1;
return FALSE;
}
@@ -1744,10 +1753,26 @@
longlong Item_in_optimizer::val_int()
{
bool tmp;
+ DBUG_ENTER("Item_in_optimizer::val_int");
+
DBUG_ASSERT(fixed == 1);
cache->store(args[0]);
cache->cache_value();
-
+
+ /* check if result is in the cache */
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ {
+ tmp= cached_value->val_int();
+ null_value= cached_value->null_value;
+ DBUG_RETURN(tmp);
+ }
+ }
+
if (cache->null_value)
{
/*
@@ -1818,11 +1843,18 @@
for (uint i= 0; i < ncols; i++)
item_subs->set_cond_guard_var(i, TRUE);
}
- return 0;
+ DBUG_RETURN(0);
}
tmp= args[1]->val_bool_result();
null_value= args[1]->null_value;
- return tmp;
+
+ /* put result in the cache */
+ if (scache)
+ {
+ result.set(tmp, null_value);
+ scache->put_value(&result);
+ }
+ DBUG_RETURN(tmp);
}
@@ -1839,6 +1871,11 @@
Item_bool_func::cleanup();
if (!save_cache)
cache= 0;
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
DBUG_VOID_RETURN;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.h 2010-05-31 21:25:54 +0000
@@ -215,6 +215,7 @@
class Item_cache;
+class Subquery_cache;
#define UNKNOWN ((my_bool)-1)
@@ -237,6 +238,10 @@
{
protected:
Item_cache *cache;
+ /* Subquery cache */
+ Subquery_cache *scache;
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
bool save_cache;
/*
Stores the value of "NULL IN (SELECT ...)" for uncorrelated subqueries:
@@ -247,7 +252,7 @@
my_bool result_for_null_param;
public:
Item_in_optimizer(Item *a, Item_in_subselect *b):
- Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0),
+ Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0), scache(NULL),
save_cache(0), result_for_null_param(UNKNOWN)
{}
bool fix_fields(THD *, Item **);
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.cc 2010-05-31 21:25:54 +0000
@@ -34,11 +34,10 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
- engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0),
- const_item_cache(1),
- inside_first_fix_fields(0), done_first_fix_fields(FALSE),
- eliminated(FALSE),
- engine_changed(0), changed(0), is_correlated(FALSE)
+ engine(0), old_engine(0), scache(0), used_tables_cache(0),
+ have_to_be_excluded(0), const_item_cache(1), inside_first_fix_fields(0),
+ done_first_fix_fields(FALSE), eliminated(FALSE), engine_changed(0),
+ changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -116,6 +115,12 @@
}
if (engine)
engine->cleanup();
+ depends_on.empty();
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
reset();
value_assigned= 0;
DBUG_VOID_RETURN;
@@ -148,6 +153,8 @@
Item_subselect::~Item_subselect()
{
delete engine;
+ if (scache)
+ delete scache;
}
Item_subselect::trans_res
@@ -746,9 +753,22 @@
void Item_singlerow_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_singlerow_subselect::fix_length_and_dec");
if ((max_columns= engine->cols()) == 1)
{
+ DBUG_PRINT("info", ("one, elements: %u flag %u",
+ (uint)depends_on.elements,
+ (uint)test(thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE)));
engine->fix_length_and_dec(row= &value);
+ if (depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, value);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
}
else
{
@@ -765,6 +785,7 @@
*/
if (engine->no_tables())
maybe_null= engine->may_be_null();
+ DBUG_VOID_RETURN;
}
uint Item_singlerow_subselect::cols()
@@ -797,77 +818,206 @@
exec();
}
+/**
+ Checks subquery cache for value
+
+ @retval NULL nothing found
+ @retval reference on item representing value found in the cache
+*/
+
+Item *Item_subselect::check_cache()
+{
+ DBUG_ENTER("Item_subselect::check_cache");
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ DBUG_RETURN(cached_value);
+ }
+ DBUG_RETURN(NULL);
+}
+
double Item_singlerow_subselect::val_real()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_real");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_real();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_real());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
longlong Item_singlerow_subselect::val_int()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_int");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_int();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_int());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
String *Item_singlerow_subselect::val_str(String *str)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_str");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_str(str);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_str(str));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
my_decimal *Item_singlerow_subselect::val_decimal(my_decimal *decimal_value)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_decimal");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_decimal(decimal_value);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_decimal(decimal_value));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
bool Item_singlerow_subselect::val_bool()
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_bool");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ bool res= cached_value->val_bool();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_bool();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_bool());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
@@ -952,33 +1102,79 @@
void Item_exists_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_exists_subselect::fix_length_and_dec");
decimals= 0;
max_length= 1;
max_columns= engine->cols();
/* We need only 1 row to determine existence */
unit->global_parameters->select_limit= new Item_int((int32) 1);
+ if (substype() == EXISTS_SUBS && depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, &result);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
+ DBUG_VOID_RETURN;
}
double Item_exists_subselect::val_real()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_int");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return (double) value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN((double) value);
}
longlong Item_exists_subselect::val_int()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
DBUG_ASSERT(fixed == 1);
if (exec())
{
reset();
- return 0;
- }
- return value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value);
}
@@ -997,11 +1193,32 @@
String *Item_exists_subselect::val_str(String *str)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_str");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ str->set((ulonglong)0,&my_charset_bin);
+ DBUG_RETURN(str);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
str->set((ulonglong)value,&my_charset_bin);
- return str;
+ DBUG_RETURN(str);
}
@@ -1020,23 +1237,61 @@
my_decimal *Item_exists_subselect::val_decimal(my_decimal *decimal_value)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_decvimal");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ int2my_decimal(E_DEC_FATAL_ERROR, 0, 0, decimal_value);
+ DBUG_RETURN(decimal_value);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
int2my_decimal(E_DEC_FATAL_ERROR, value, 0, decimal_value);
- return decimal_value;
+ DBUG_RETURN(decimal_value);
}
bool Item_exists_subselect::val_bool()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_bool res= cached_value->val_bool();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return value != 0;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value != 0);
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.h 2010-05-31 21:25:54 +0000
@@ -27,6 +27,7 @@
class subselect_hash_sj_engine;
class Item_bool_func2;
class Cached_item;
+class Subquery_cache;
/* base class for subselects */
@@ -57,6 +58,10 @@
subselect_engine *engine;
/* old engine if engine was changed */
subselect_engine *old_engine;
+ /* subquery cache */
+ Subquery_cache *scache;
+ /* null consrtant for caching */
+ Item_null const_null_value;
/* cache of used external tables */
table_map used_tables_cache;
/* allowed number of columns (1 for single value subqueries) */
@@ -67,7 +72,7 @@
bool have_to_be_excluded;
/* cache of constant state */
bool const_item_cache;
-
+
bool inside_first_fix_fields;
bool done_first_fix_fields;
public:
@@ -88,13 +93,21 @@
*/
List<Ref_to_outside> upper_refs;
st_select_lex *parent_select;
-
- /*
+
+ /**
+ List of references on items subquery depends on (externally resolved);
+
+ @note We can't store direct links on Items because it could be
+ substituted with other item (for example for grouping).
+ */
+ List<Item*> depends_on;
+
+ /*
TRUE<=>Table Elimination has made it redundant to evaluate this select
(and so it is not part of QEP, etc)
- */
+ */
bool eliminated;
-
+
/* changed engine indicator */
bool engine_changed;
/* subquery is transformed */
@@ -178,6 +191,8 @@
return trace_unsupported_by_check_vcol_func_processor("subselect");
}
+ Item *check_cache();
+
/**
Get the SELECT_LEX structure associated with this Item.
@return the SELECT_LEX structure associated with this Item
@@ -202,6 +217,7 @@
{
protected:
Item_cache *value, **row;
+
public:
Item_singlerow_subselect(st_select_lex *select_lex);
Item_singlerow_subselect() :Item_subselect(), value(0), row (0) {}
@@ -268,6 +284,8 @@
{
protected:
bool value; /* value of this item (boolean: exists/not-exists) */
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
public:
Item_exists_subselect(st_select_lex *select_lex);
=== modified file 'sql/item_sum.cc'
--- a/sql/item_sum.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_sum.cc 2010-05-31 21:25:54 +0000
@@ -319,6 +319,7 @@
if (aggr_level >= 0)
{
ref_by= ref;
+ thd->lex->current_select->register_dependency_item(aggr_sel, ref);
/* Add the object to the list of registered objects assigned to aggr_sel */
if (!aggr_sel->inner_sum_func_list)
next= this;
=== modified file 'sql/mysql_priv.h'
--- a/sql/mysql_priv.h 2010-03-20 12:01:47 +0000
+++ b/sql/mysql_priv.h 2010-05-31 21:25:54 +0000
@@ -568,12 +568,13 @@
#define OPTIMIZER_SWITCH_SEMIJOIN 256
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE 512
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN 1024
+#define OPTIMIZER_SWITCH_SUBQUERY_CACHE (1<<11)
#ifdef DBUG_OFF
-# define OPTIMIZER_SWITCH_LAST 2048
+# define OPTIMIZER_SWITCH_LAST (1<<12)
#else
-# define OPTIMIZER_SWITCH_TABLE_ELIMINATION 2048
-# define OPTIMIZER_SWITCH_LAST 4096
+# define OPTIMIZER_SWITCH_TABLE_ELIMINATION (1<<12)
+# define OPTIMIZER_SWITCH_LAST (1<<13)
#endif
#ifdef DBUG_OFF
@@ -588,7 +589,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#else
# define OPTIMIZER_SWITCH_DEFAULT (OPTIMIZER_SWITCH_INDEX_MERGE | \
OPTIMIZER_SWITCH_INDEX_MERGE_UNION | \
@@ -601,7 +603,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#endif
/*
@@ -936,6 +939,7 @@
#ifdef MYSQL_SERVER
#include "sql_servers.h"
#include "opt_range.h"
+#include "sql_subquery_cache.h"
#ifdef HAVE_QUERY_CACHE
struct Query_cache_query_flags
@@ -1269,6 +1273,10 @@
Item *having, ORDER *proc_param, ulonglong select_type,
select_result *result, SELECT_LEX_UNIT *unit,
SELECT_LEX *select_lex);
+
+struct st_join_table *create_index_lookup_join_tab(TABLE *table);
+int join_read_key2(THD *thd, struct st_join_table *tab, TABLE *table,
+ struct st_table_ref *table_ref);
void free_underlaid_joins(THD *thd, SELECT_LEX *select);
bool mysql_explain_union(THD *thd, SELECT_LEX_UNIT *unit,
select_result *result);
@@ -1288,6 +1296,7 @@
bool table_cant_handle_bit_fields,
bool make_copy_field,
uint convert_blob_length);
+bool open_tmp_table(TABLE *table);
void sp_prepare_create_field(THD *thd, Create_field *sql_field);
int prepare_create_field(Create_field *sql_field,
uint *blob_columns,
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2010-03-20 12:01:47 +0000
+++ b/sql/mysqld.cc 2010-05-31 21:25:54 +0000
@@ -305,6 +305,7 @@
"firstmatch","loosescan","materialization", "semijoin",
"partial_match_rowid_merge",
"partial_match_table_scan",
+ "subquery_cache",
#ifndef DBUG_OFF
"table_elimination",
#endif
@@ -325,6 +326,7 @@
sizeof("semijoin") - 1,
sizeof("partial_match_rowid_merge") - 1,
sizeof("partial_match_table_scan") - 1,
+ sizeof("subquery_cache") - 1,
#ifndef DBUG_OFF
sizeof("table_elimination") - 1,
#endif
@@ -404,8 +406,9 @@
static const char *optimizer_switch_str="index_merge=on,index_merge_union=on,"
"index_merge_sort_union=on,"
"index_merge_intersection=on,"
- "index_condition_pushdown=on"
-#ifndef DBUG_OFF
+ "index_condition_pushdown=on,"
+ "subquery_cache=on"
+#ifndef DBUG_OFF
",table_elimination=on";
#else
;
@@ -5872,7 +5875,9 @@
OPT_RECORD_RND_BUFFER, OPT_DIV_PRECINCREMENT, OPT_RELAY_LOG_SPACE_LIMIT,
OPT_RELAY_LOG_PURGE,
OPT_SLAVE_NET_TIMEOUT, OPT_SLAVE_COMPRESSED_PROTOCOL, OPT_SLOW_LAUNCH_TIME,
- OPT_SLAVE_TRANS_RETRIES, OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
+ OPT_SLAVE_TRANS_RETRIES,
+ OPT_SUBQUERY_CACHE,
+ OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
OPT_DEBUGGING, OPT_DEBUG_FLUSH,
OPT_SORT_BUFFER, OPT_TABLE_OPEN_CACHE, OPT_TABLE_DEF_CACHE,
OPT_THREAD_CONCURRENCY, OPT_THREAD_CACHE_SIZE,
@@ -7164,7 +7169,7 @@
{"optimizer_switch", OPT_OPTIMIZER_SWITCH,
"optimizer_switch=option=val[,option=val...], where option={index_merge, "
"index_merge_union, index_merge_sort_union, index_merge_intersection, "
- "index_condition_pushdown"
+ "index_condition_pushdown, subquery_cache"
#ifndef DBUG_OFF
", table_elimination"
#endif
@@ -7868,6 +7873,8 @@
{"Ssl_version", (char*) &show_ssl_get_version, SHOW_FUNC},
#endif /* HAVE_OPENSSL */
{"Syncs", (char*) &my_sync_count, SHOW_LONG_NOFLUSH},
+ {"Subquery_cache_hit", (char*) &subquery_cache_hit, SHOW_LONG},
+ {"Subquery_cache_miss", (char*) &subquery_cache_miss, SHOW_LONG},
{"Table_locks_immediate", (char*) &locks_immediate, SHOW_LONG},
{"Table_locks_waited", (char*) &locks_waited, SHOW_LONG},
#ifdef HAVE_MMAP
@@ -8006,6 +8013,7 @@
abort_loop= select_thread_in_use= signal_thread_in_use= 0;
ready_to_exit= shutdown_in_progress= grant_option= 0;
aborted_threads= aborted_connects= 0;
+ subquery_cache_miss= subquery_cache_hit= 0;
delayed_insert_threads= delayed_insert_writes= delayed_rows_in_use= 0;
delayed_insert_errors= thread_created= 0;
specialflag= 0;
=== modified file 'sql/sql_base.cc'
--- a/sql/sql_base.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_base.cc 2010-05-31 21:25:54 +0000
@@ -8062,6 +8062,10 @@
if (*conds)
{
thd->where="where clause";
+ DBUG_EXECUTE("where",
+ print_where(*conds,
+ "WHERE in setup_conds",
+ QT_ORDINARY););
if ((!(*conds)->fixed && (*conds)->fix_fields(thd, conds)) ||
(*conds)->check_cols(1))
goto err_no_arena;
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.cc 2010-05-31 21:25:54 +0000
@@ -3020,6 +3020,7 @@
table_charset= 0;
precomputed_group_by= 0;
bit_fields_as_long= 0;
+ skip_create_table= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.h 2010-05-31 21:25:54 +0000
@@ -2786,12 +2786,17 @@
that MEMORY tables cannot index BIT columns.
*/
bool bit_fields_as_long;
+ /*
+ Whether to create or postpone actual creation of this temporary table.
+ TRUE <=> create_tmp_table will create only the TABLE structure.
+ */
+ bool skip_create_table;
TMP_TABLE_PARAM()
:copy_field(0), group_parts(0),
group_length(0), group_null_parts(0), convert_blob_length(0),
schema_table(0), precomputed_group_by(0), force_copy_fields(0),
- bit_fields_as_long(0)
+ bit_fields_as_long(0), skip_create_table(0)
{}
~TMP_TABLE_PARAM()
{
=== modified file 'sql/sql_lex.cc'
--- a/sql/sql_lex.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.cc 2010-05-31 21:25:54 +0000
@@ -1829,6 +1829,52 @@
}
+/**
+ Registers reference on items on which the subqueries depends
+
+ @param last pointer to last st_select_lex struct, before
+ which all st_select_lex have to be marked as
+ dependent
+ @param dependency reference on the item on which all this
+ subqueries depends
+
+*/
+
+void st_select_lex::register_dependency_item(st_select_lex *last,
+ Item **dependency)
+{
+ SELECT_LEX *s= this;
+ DBUG_ENTER("st_select_lex::register_dependency_item");
+ DBUG_ASSERT(this != last);
+ DBUG_ASSERT(*dependency);
+ do
+ {
+ /* check duplicates */
+ List_iterator_fast<Item*> li(s->master_unit()->item->depends_on);
+ Item **dep;
+ while ((dep= li++))
+ {
+ if ((*dep)->eq(*dependency, FALSE))
+ {
+ DBUG_PRINT("info", ("dependency %s already present",
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ DBUG_VOID_RETURN;
+ }
+ }
+
+ s->master_unit()->item->depends_on.push_back(dependency);
+ DBUG_PRINT("info", ("depends_on: Select: %d added: %s",
+ s->select_number,
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ } while ((s= s->outer_select()) != last && s != 0);
+ DBUG_VOID_RETURN;
+}
+
+
/*
st_select_lex_node::mark_as_dependent mark all st_select_lex struct from
this to 'last' as dependent
@@ -1843,7 +1889,7 @@
bool st_select_lex::mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency)
{
-
+ DBUG_ENTER("st_select_lex::mark_as_dependent");
DBUG_ASSERT(this != last);
/*
@@ -1872,11 +1918,11 @@
Item_subselect *subquery_expr= s->master_unit()->item;
if (subquery_expr && subquery_expr->mark_as_dependent(thd, last,
dependency))
- return TRUE;
+ DBUG_RETURN(TRUE);
} while ((s= s->outer_select()) != last && s != 0);
is_correlated= TRUE;
this->master_unit()->item->is_correlated= TRUE;
- return FALSE;
+ DBUG_RETURN(FALSE);
}
bool st_select_lex_node::set_braces(bool value) { return 1; }
=== modified file 'sql/sql_lex.h'
--- a/sql/sql_lex.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.h 2010-05-31 21:25:54 +0000
@@ -748,6 +748,7 @@
}
bool mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency);
+ void register_dependency_item(st_select_lex *last, Item **dependency);
bool set_braces(bool value);
bool inc_in_sum_expr();
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-05-10 13:46:08 +0000
+++ b/sql/sql_select.cc 2010-05-31 21:25:54 +0000
@@ -151,7 +151,6 @@
static int join_read_system(JOIN_TAB *tab);
static int join_read_const(JOIN_TAB *tab);
static int join_read_key(JOIN_TAB *tab);
-static int join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref);
static void join_read_key_unlock_row(st_join_table *tab);
static int join_read_always_key(JOIN_TAB *tab);
static int join_read_last_key(JOIN_TAB *tab);
@@ -5209,7 +5208,7 @@
'join->best_positions' contains a complete optimal extension of the
current partial QEP.
*/
- DBUG_EXECUTE("opt", print_plan(join, join->tables,
+ DBUG_EXECUTE("opt", print_plan(join, n_tables,
record_count, read_time, read_time,
"optimal"););
DBUG_RETURN(FALSE);
@@ -7625,6 +7624,40 @@
/**
+ Creates and fills JOIN_TAB for index look up in temporary table
+
+ @param table The table where to look up
+
+ @return JOIN_TAB object or NULL in case of error
+*/
+
+JOIN_TAB *create_index_lookup_join_tab(TABLE *table)
+{
+ JOIN_TAB *tab;
+ DBUG_ENTER("create_index_lookup_join_tab");
+
+ if (!((tab= new JOIN_TAB)))
+ DBUG_RETURN(NULL);
+ tab->read_record.table= table;
+ tab->read_record.file=table->file;
+ /*tab->read_record.unlock_row= rr_unlock_row;*/
+ tab->next_select=0;
+ tab->sorted= 1;
+
+ table->status= STATUS_NO_RECORD;
+ tab->read_first_record= join_read_key;
+ /*tab->read_record.unlock_row= join_read_key_unlock_row;*/
+ tab->read_record.read_record= join_no_more_records;
+ if (table->covering_keys.is_set(tab->ref.key) &&
+ !table->no_keyread)
+ {
+ table->key_read=1;
+ table->file->extra(HA_EXTRA_KEYREAD);
+ }
+ DBUG_RETURN(tab);
+}
+
+/**
Give error if we some tables are done with a full join.
This is used by multi_table_update and multi_table_delete when running
@@ -10778,6 +10811,7 @@
case Item::REF_ITEM:
case Item::NULL_ITEM:
case Item::VARBIN_ITEM:
+ case Item::CACHE_ITEM:
if (make_copy_field)
{
DBUG_ASSERT(((Item_result_field*)item)->result_field);
@@ -11552,7 +11586,8 @@
¶m->recinfo, select_options))
goto err;
}
- if (open_tmp_table(table))
+ DBUG_PRINT("info", ("skip_create_table: %d", (int)param->skip_create_table));
+ if (!param->skip_create_table && open_tmp_table(table))
goto err;
thd->mem_root= mem_root_save;
@@ -11700,16 +11735,17 @@
bool open_tmp_table(TABLE *table)
{
int error;
+ DBUG_ENTER("open_tmp_table");
if ((error= table->file->ha_open(table, table->s->table_name.str, O_RDWR,
HA_OPEN_TMP_TABLE |
HA_OPEN_INTERNAL_TABLE)))
{
table->file->print_error(error,MYF(0)); /* purecov: inspected */
table->db_stat=0;
- return(1);
+ DBUG_RETURN(1);
}
(void) table->file->extra(HA_EXTRA_QUICK); /* Faster */
- return(0);
+ DBUG_RETURN(0);
}
@@ -12540,7 +12576,8 @@
else
{
/* Do index lookup in the materialized table */
- if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
+ if ((res= join_read_key2(join_tab->join->thd, join_tab,
+ sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
@@ -13323,61 +13360,62 @@
static int
join_read_key(JOIN_TAB *tab)
{
- return join_read_key2(tab, tab->table, &tab->ref);
+ return join_read_key2(tab->join->thd, tab, tab->table, &tab->ref);
}
-/*
+/*
eq_ref access handler but generalized a bit to support TABLE and TABLE_REF
not from the join_tab. See join_read_key for detailed synopsis.
*/
-static int
-join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
+int join_read_key2(THD *thd, JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
{
int error;
+ DBUG_ENTER("join_read_key2");
if (!table->file->inited)
{
table->file->ha_index_init(table_ref->key, tab->sorted);
}
/* TODO: Why don't we do "Late NULLs Filtering" here? */
- if (cmp_buffer_with_ref(tab->join->thd, table, table_ref) ||
+ if (cmp_buffer_with_ref(thd, table, table_ref) ||
(table->status & (STATUS_GARBAGE | STATUS_NO_PARENT | STATUS_NULL_ROW)))
{
if (table_ref->key_err)
{
table->status=STATUS_NOT_FOUND;
- return -1;
+ DBUG_RETURN(-1);
}
/*
Moving away from the current record. Unlock the row
in the handler if it did not match the partial WHERE.
*/
- if (tab->ref.has_record && tab->ref.use_count == 0)
+ if (table_ref->has_record )
+ if (table_ref->use_count == 0)
{
tab->read_record.file->unlock_row();
- tab->ref.has_record= FALSE;
+ table_ref->has_record= FALSE;
}
error=table->file->ha_index_read_map(table->record[0],
table_ref->key_buff,
make_prev_keypart_map(table_ref->key_parts),
HA_READ_KEY_EXACT);
if (error && error != HA_ERR_KEY_NOT_FOUND && error != HA_ERR_END_OF_FILE)
- return report_error(table, error);
+ DBUG_RETURN(report_error(table, error));
if (! error)
{
- tab->ref.has_record= TRUE;
- tab->ref.use_count= 1;
+ table_ref->has_record= TRUE;
+ table_ref->use_count= 1;
}
}
else if (table->status == 0)
{
- DBUG_ASSERT(tab->ref.has_record);
- tab->ref.use_count++;
+ DBUG_ASSERT(table_ref->has_record);
+ table_ref->use_count++;
}
table->null_row=0;
- return table->status ? -1 : 0;
+ DBUG_RETURN(table->status ? -1 : 0);
}
=== added file 'sql/sql_subquery_cache.cc'
--- a/sql/sql_subquery_cache.cc 1970-01-01 00:00:00 +0000
+++ b/sql/sql_subquery_cache.cc 2010-05-31 21:25:54 +0000
@@ -0,0 +1,360 @@
+
+#include "mysql_priv.h"
+#include "sql_select.h"
+
+ulonglong subquery_cache_miss, subquery_cache_hit;
+
+/**
+ Creates structures which we need for index look up
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static my_bool createtmp_table_search_structures(THD *thd,
+ TABLE *table,
+ List_iterator_fast<Item> &li,
+ TABLE_REF **ref)
+{
+ /*
+ Create/initialize everything we will need to index lookups into the
+ temptable.
+ */
+ TABLE_REF *tab_ref;
+ KEY *tmp_key; /* The only index on the temporary table. */
+ Item *item;
+ uint tmp_key_parts; /* Number of keyparts in tmp_key. */
+ uint i;
+
+ DBUG_ENTER("createtmp_table_search_structures");
+
+ tmp_key= table->key_info;
+ tmp_key_parts= tmp_key->key_parts;
+
+ if (!(tab_ref= (TABLE_REF*) thd->alloc(sizeof(TABLE_REF))))
+ DBUG_RETURN(TRUE);
+
+ tab_ref->key= 0; /* The only temp table index. */
+ tab_ref->key_length= tmp_key->key_length;
+ if (!(tab_ref->key_buff=
+ (uchar*) thd->calloc(ALIGN_SIZE(tmp_key->key_length) * 2)) ||
+ !(tab_ref->key_copy=
+ (store_key**) thd->alloc((sizeof(store_key*) *
+ (tmp_key_parts + 1)))) ||
+ !(tab_ref->items=
+ (Item**) thd->alloc(sizeof(Item*) * tmp_key_parts)))
+ DBUG_RETURN(TRUE); /* purecov: inspected */
+
+ tab_ref->key_buff2=tab_ref->key_buff+ALIGN_SIZE(tmp_key->key_length);
+ tab_ref->key_err=1;
+ tab_ref->null_rejecting= 1;
+ tab_ref->disable_cache= FALSE;
+ tab_ref->has_record= 0;
+
+ KEY_PART_INFO *cur_key_part= tmp_key->key_part;
+ store_key **ref_key= tab_ref->key_copy;
+ uchar *cur_ref_buff= tab_ref->key_buff;
+
+ for (i= 0; i < tmp_key_parts; i++, cur_key_part++, ref_key++)
+ {
+ item= li++;
+ DBUG_ASSERT(item);
+ tab_ref->items[i]= item;
+ int null_count= test(cur_key_part->field->real_maybe_null());
+ *ref_key= new store_key_item(thd, cur_key_part->field,
+ /* TODO:
+ the NULL byte is taken into account in
+ cur_key_part->store_length, so instead of
+ cur_ref_buff + test(maybe_null), we could
+ use that information instead.
+ */
+ cur_ref_buff + null_count,
+ null_count ? tab_ref->key_buff : 0,
+ cur_key_part->length, tab_ref->items[i]);
+ cur_ref_buff+= cur_key_part->store_length;
+ }
+ *ref_key= NULL; /* End marker. */
+ tab_ref->key_err= 1;
+ tab_ref->key_parts= tmp_key_parts;
+ *ref= tab_ref;
+
+ DBUG_RETURN(FALSE);
+}
+
+
+Subquery_cache_tmptable::Subquery_cache_tmptable(THD *thd,
+ List<Item*> &dependance,
+ Item *value)
+ :cache_table(NULL), table_thd(thd), list(&dependance), val(value),
+ equalities(NULL), inited (0)
+{
+ DBUG_ENTER("Subquery_cache_tmptable::Subquery_cache_tmptable");
+ DBUG_VOID_RETURN;
+};
+
+
+/**
+ Creates equalities expression.
+
+ @note For some type of fields index lookup do not return failure but set
+ pointer on the next record. To check exact match we use expression like:
+ field1=value1 and field2=value2 ...
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+bool Subquery_cache_tmptable::make_equalities()
+{
+ List<Item> args;
+ List_iterator_fast<Item*> li(*list);
+ Item **ref;
+ Name_resolution_context *cn= NULL;
+ DBUG_ENTER("Subquery_cache_tmptable::make_equalities");
+
+ for (uint i= 1 /* skip result filed */; (ref= li++); i++)
+ {
+ Field *fld= cache_table->field[i];
+ /* Only some field types should be checked after lookup */
+ if (fld->type() == MYSQL_TYPE_VARCHAR ||
+ fld->type() == MYSQL_TYPE_TINY_BLOB ||
+ fld->type() == MYSQL_TYPE_MEDIUM_BLOB ||
+ fld->type() == MYSQL_TYPE_LONG_BLOB ||
+ fld->type() == MYSQL_TYPE_BLOB ||
+ fld->type() == MYSQL_TYPE_VAR_STRING ||
+ fld->type() == MYSQL_TYPE_STRING ||
+ fld->type() == MYSQL_TYPE_NEWDECIMAL ||
+ fld->type() == MYSQL_TYPE_DECIMAL)
+ {
+ if (!cn)
+ {
+ // dummy resolution context
+ cn= new Name_resolution_context();
+ cn->init();
+ }
+ args.push_front(new Item_func_eq(new Item_ref(cn, ref, "", "", FALSE),
+ new Item_field(fld)));
+ }
+ }
+ if (args.elements == 1)
+ equalities= args.head();
+ else
+ equalities= new Item_cond_and(args);
+
+ DBUG_RETURN(equalities->fix_fields(table_thd, &equalities));
+}
+
+
+/**
+ Enumerates all fields in field number order.
+
+ @param arg reference on current field number
+
+ @return field number
+*/
+
+static uint field_enumerator(uchar *arg)
+{
+ return ((uint*)arg)[0]++;
+}
+
+/**
+ Initializes temporary table and index for this cache
+*/
+
+void Subquery_cache_tmptable::init()
+{
+ List_iterator_fast<Item*> li(*list);
+ List_iterator_fast<Item> li_items(items);
+ Item **item;
+ uint field_counter;
+ DBUG_ENTER("Subquery_cache_tmptable::init");
+ DBUG_ASSERT(!inited);
+ inited= TRUE;
+
+ if (!(ULONGLONG_MAX >> (list->elements + 1)))
+ {
+ DBUG_PRINT("info", ("Too many dependencies"));
+ DBUG_VOID_RETURN;
+ }
+
+ cache_table= NULL;
+ while ((item= li++))
+ {
+ DBUG_ASSERT(item);
+ DBUG_ASSERT(*item);
+ DBUG_ASSERT((*item)->fixed);
+ items.push_back((*item));
+ }
+
+ cache_table_param.init();
+ /* dependance items and result */
+ cache_table_param.field_count= list->elements + 1;
+ /* postpone table creation to index description */
+ cache_table_param.skip_create_table= 1;
+
+
+ items.push_front(val);
+ if (!(cache_table= create_tmp_table(table_thd, &cache_table_param,
+ items, (ORDER*) NULL,
+ FALSE, FALSE,
+ ((table_thd->options |
+ TMP_TABLE_ALL_COLUMNS) &
+ ~(OPTION_BIG_TABLES |
+ TMP_TABLE_FORCE_MYISAM)),
+ HA_POS_ERROR,
+ (char *)"subquery-cache-table")))
+ {
+ DBUG_PRINT("error", ("create_tmp_table failed, caching switched off"));
+ DBUG_VOID_RETURN;
+ }
+
+ if (cache_table->s->db_type() != heap_hton)
+ {
+ DBUG_PRINT("error", ("we need only heap table"));
+ goto error;
+ }
+
+ /* first field in the table is result value, so we skip it */
+ li_items++;
+ field_counter=1;
+
+ if (cache_table->alloc_keys(1) ||
+ (cache_table->add_tmp_key(0, items.elements - 1,
+ &field_enumerator,
+ (uchar*)&field_counter) < 0) ||
+ createtmp_table_search_structures(table_thd, cache_table, li_items,
+ &tab_ref) ||
+ !(tab= create_index_lookup_join_tab(cache_table)))
+ {
+ DBUG_PRINT("error", ("creating index failed"));
+ goto error;
+ }
+ cache_table->s->keys= 1;
+ cache_table->s->uniques= 1;
+
+ if (open_tmp_table(cache_table))
+ {
+ DBUG_PRINT("error", ("Opening (creating) temporary table failed"));
+ goto error;
+ }
+
+ if (!(chached_result= new Item_field(cache_table->field[0])))
+ {
+ DBUG_PRINT("error", ("Creating Item_field failed"));
+ goto error;
+ }
+
+ if (make_equalities())
+ {
+ DBUG_PRINT("error", ("Creating equalities failed"));
+ goto error;
+ }
+
+ DBUG_VOID_RETURN;
+
+error:
+ /* switch off cache */
+ free_tmp_table(table_thd, cache_table);
+ cache_table= NULL;
+ DBUG_VOID_RETURN;
+}
+
+
+Subquery_cache_tmptable::~Subquery_cache_tmptable()
+{
+ if (cache_table)
+ free_tmp_table(table_thd, cache_table);
+}
+
+
+/**
+ Checks if current key present in the cache and returns value if it is true
+
+ @param value assigned Item with value from the cache if key
+ is found
+ @return result of the key lookup
+*/
+
+Subquery_cache::result Subquery_cache_tmptable::check_value(Item **value)
+{
+ int res;
+ DBUG_ENTER("Subquery_cache_tmptable::check_value");
+
+ /*
+ We delay cache initialization to get item references which should be
+ used at the moment of query execution. I.e. we store reference on item
+ reference at the moment of class creation but for table creation and
+ index supply structures (join_tab) we need real Items which used at the
+ moment of execution so we can resolve reference only at this point.
+ */
+ if (!inited)
+ init();
+
+ if (cache_table)
+ {
+ DBUG_PRINT("info", ("status: %u has_record %u",
+ (uint)cache_table->status, (uint)tab_ref->has_record));
+ if ((res= join_read_key2(table_thd, tab, cache_table, tab_ref)) == 1)
+ DBUG_RETURN(ERROR);
+ if (res || (equalities && !equalities->val_int()))
+ {
+ subquery_cache_miss++;
+ DBUG_RETURN(MISS);
+ }
+
+ subquery_cache_hit++;
+ *value= chached_result;
+ DBUG_RETURN(Subquery_cache::HIT);
+ }
+ DBUG_RETURN(Subquery_cache::MISS);
+}
+
+
+/**
+ Puts given value in the cache
+
+ @param value Value to put in the cache
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+my_bool Subquery_cache_tmptable::put_value(Item *value)
+{
+ int error;
+ DBUG_ENTER("Subquery_cache_tmptable::put_value");
+ DBUG_ASSERT(inited);
+
+ if (!cache_table)
+ {
+ DBUG_PRINT("info", ("No table so behave as we successfully put value"));
+ DBUG_RETURN(FALSE);
+ }
+
+ *(items.head_ref())= value;
+ fill_record(table_thd, cache_table->field, items, 1);
+ if (table_thd->is_error())
+ goto err;;
+
+ if ((error= cache_table->file->ha_write_row(cache_table->record[0])))
+ {
+ /* create_myisam_from_heap will generate error if needed */
+ if (cache_table->file->is_fatal_error(error, HA_CHECK_DUP) &&
+ create_internal_tmp_table_from_heap(table_thd, cache_table,
+ cache_table_param.start_recinfo,
+ &cache_table_param.recinfo,
+ error, 1))
+ goto err;
+ }
+ cache_table->status= 0; /* cache_table->record contains an existed record */
+ tab_ref->has_record= TRUE; /* the same as above */
+ DBUG_PRINT("info", ("has_record: TRUE status: 0"));
+
+ DBUG_RETURN(FALSE);
+
+err:
+ free_tmp_table(table_thd, cache_table);
+ cache_table= NULL;
+ DBUG_RETURN(TRUE);
+}
=== added file 'sql/sql_subquery_cache.h'
--- a/sql/sql_subquery_cache.h 1970-01-01 00:00:00 +0000
+++ b/sql/sql_subquery_cache.h 2010-05-31 21:25:54 +0000
@@ -0,0 +1,74 @@
+#ifndef _SQL_SUBQUERY_CACHE_H_
+#define _SQL_SUBQUERY_CACHE_H_
+
+/**
+ Interface for subquery cache
+*/
+
+extern ulonglong subquery_cache_miss, subquery_cache_hit;
+
+class Subquery_cache :public Sql_alloc
+{
+public:
+ enum result {ERROR, HIT, MISS};
+
+ Subquery_cache(){};
+ virtual ~Subquery_cache() {};
+ /**
+ Checks presence of the key (taken from cache owner) and if found return
+ it via value parameter
+ */
+ virtual result check_value(Item **value)= 0;
+ /**
+ Puts value into this cache (key should be taken from cache owner)
+ */
+ virtual my_bool put_value(Item *value)= 0;
+};
+
+struct st_table_ref;
+struct st_join_table;
+//class Item_cache;
+class Item_field;
+
+/**
+ Implementation of subquery cache over temporary table
+*/
+
+class Subquery_cache_tmptable :public Subquery_cache
+{
+public:
+ Subquery_cache_tmptable(THD *thd, List<Item*> &dependance, Item *value);
+ virtual ~Subquery_cache_tmptable();
+ virtual result check_value(Item **value);
+ virtual my_bool put_value(Item *value);
+
+private:
+ void init();
+ bool make_equalities();
+
+ /* tmp table parameters */
+ TMP_TABLE_PARAM cache_table_param;
+ /* temporary table to store this cache */
+ TABLE *cache_table;
+ /* Thread handler for the temporary table */
+ THD *table_thd;
+ /* tab_ref for index search */
+ struct st_table_ref *tab_ref;
+ /* cache of subquery value to avoid evaluating it twice */
+ //Item_cache *value_cache;
+ /* JOIN_TAB for index lookup */
+ st_join_table *tab;
+ /* Chached result */
+ Item_field *chached_result;
+ /* List of references to items */
+ List<Item*> *list;
+ /* List of items */
+ List<Item> items;
+ /* Value Item example */
+ Item *val;
+ /* Expression to check after index lookup */
+ Item *equalities;
+ /* set if structures are inited */
+ bool inited;
+};
+#endif
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-03-20 12:01:47 +0000
+++ b/sql/table.cc 2010-05-31 21:25:54 +0000
@@ -20,6 +20,7 @@
#include "sql_trigger.h"
#include <m_ctype.h>
#include "my_md5.h"
+#include "my_bit.h"
/* INFORMATION_SCHEMA name */
LEX_STRING INFORMATION_SCHEMA_NAME= {C_STRING_WITH_LEN("information_schema")};
@@ -5096,6 +5097,115 @@
file->column_bitmaps_signal();
}
+
+/**
+ @brief
+ Allocate space for keys
+
+ @param key_count number of keys to allocate.
+
+ @details
+ Allocate space enough to fit 'key_count' keys for this table.
+
+ @return FALSE space was successfully allocated.
+ @return TRUE an error occur.
+*/
+
+bool TABLE::alloc_keys(uint key_count)
+{
+ DBUG_ASSERT(!s->keys);
+ key_info= s->key_info= (KEY*) alloc_root(&mem_root, sizeof(KEY)*key_count);
+ max_keys= key_count;
+ return !(key_info);
+}
+
+
+/**
+ @brief Adds one key to a temporary table.
+
+ @param key key number.
+ @param key_parts number of fields in the key
+ @param next_field_no function which returns field numbers which
+ should be included in the key
+ @param arg above function argement
+
+ @return <0 an error occur.
+ @return >=0 number of newly added key.
+*/
+
+bool TABLE::add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg)
+{
+ DBUG_ASSERT(key < max_keys);
+
+ char buf[NAME_CHAR_LEN];
+ KEY* keyinfo;
+ Field **reg_field;
+ uint i;
+ bool key_start= TRUE;
+ KEY_PART_INFO* key_part_info=
+ (KEY_PART_INFO*) alloc_root(&mem_root, sizeof(KEY_PART_INFO)*key_parts);
+ if (!key_part_info)
+ return TRUE;
+ keyinfo= key_info + key;
+ keyinfo->key_part= key_part_info;
+ keyinfo->usable_key_parts= keyinfo->key_parts = key_parts;
+ keyinfo->key_length=0;
+ keyinfo->algorithm= HA_KEY_ALG_UNDEF;
+ keyinfo->flags= HA_GENERATED_KEY;
+ sprintf(buf, "key%i", key);
+ if (!(keyinfo->name= strdup_root(&mem_root, buf)))
+ return TRUE;
+ keyinfo->rec_per_key= (ulong*) alloc_root(&mem_root,
+ sizeof(ulong)*key_parts);
+ if (!keyinfo->rec_per_key)
+ return TRUE;
+ bzero(keyinfo->rec_per_key, sizeof(ulong)*key_parts);
+ for (i= 0; i < key_parts; i++)
+ {
+ reg_field= field + next_field_no(arg);
+ if (key_start)
+ (*reg_field)->key_start.set_bit(key);
+ key_start= FALSE;
+ (*reg_field)->part_of_key.set_bit(key);
+ (*reg_field)->flags|= PART_KEY_FLAG;
+ key_part_info->null_bit= (*reg_field)->null_bit;
+ key_part_info->null_offset= (uint) ((*reg_field)->null_ptr -
+ (uchar*) record[0]);
+ key_part_info->field= *reg_field;
+ key_part_info->offset= (*reg_field)->offset(record[0]);
+ key_part_info->length= (uint16) (*reg_field)->pack_length();
+ keyinfo->key_length+= key_part_info->length;
+ /* TODO:
+ The below method of computing the key format length of the
+ key part is a copy/paste from opt_range.cc, and table.cc.
+ This should be factored out, e.g. as a method of Field.
+ In addition it is not clear if any of the Field::*_length
+ methods is supposed to compute the same length. If so, it
+ might be reused.
+ */
+ key_part_info->store_length= key_part_info->length;
+
+ if ((*reg_field)->real_maybe_null())
+ key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
+ (*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+
+ key_part_info->type= (uint8) (*reg_field)->key_type();
+ key_part_info->key_type =
+ ((ha_base_keytype) key_part_info->type == HA_KEYTYPE_TEXT ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT1 ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT2) ?
+ 0 : FIELDFLAG_BINARY;
+ key_part_info++;
+ }
+ set_if_bigger(s->max_key_length, keyinfo->key_length);
+ s->keys++;
+ return FALSE;
+}
+
+
/**
@brief Check if this is part of a MERGE table with attached children.
=== modified file 'sql/table.h'
--- a/sql/table.h 2010-03-20 12:01:47 +0000
+++ b/sql/table.h 2010-05-31 21:25:54 +0000
@@ -781,6 +781,7 @@
uint temp_pool_slot; /* Used by intern temp tables */
uint status; /* What's in record[0] */
uint db_stat; /* mode of file as in handler.h */
+ uint max_keys; /* Size of allocated key_info array. */
/* number of select if it is derived table */
uint derived_select_number;
int current_lock; /* Type of lock on table */
@@ -913,6 +914,9 @@
*/
inline bool needs_reopen_or_name_lock()
{ return s->version != refresh_version; }
+ bool alloc_keys(uint key_count);
+ bool add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg);
bool is_children_attached(void);
};
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-03-20 12:01:47 +0000
+++ b/storage/maria/ha_maria.cc 2010-05-31 21:25:54 +0000
@@ -995,6 +995,8 @@
{
MARIA_HA *tmp= file;
file= 0;
+ if (!tmp)
+ return 0;
return maria_close(tmp);
}
1
0
[Maria-developers] Rev 2789: Subquery cache for pre-review (MWL#66) in file:///home/bell/maria/bzr/work-maria-5.3-scache2/
by sanja@askmonty.org 31 May '10
by sanja@askmonty.org 31 May '10
31 May '10
At file:///home/bell/maria/bzr/work-maria-5.3-scache2/
------------------------------------------------------------
revno: 2789
revision-id: sanja(a)askmonty.org-20100531212240-qwphnvofu9f0l06l
parent: sergii(a)pisem.net-20100510134608-oyi2vznyghgcrt0x
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-scache2
timestamp: Tue 2010-06-01 00:22:40 +0300
message:
Subquery cache for pre-review (MWL#66)
=== modified file 'libmysqld/Makefile.am'
--- a/libmysqld/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/libmysqld/Makefile.am 2010-05-31 21:22:40 +0000
@@ -80,7 +80,8 @@
sql_tablespace.cc \
rpl_injector.cc my_user.c partition_info.cc \
sql_servers.cc event_parse_data.cc opt_table_elimination.cc \
- multi_range_read.cc opt_index_cond_pushdown.cc
+ multi_range_read.cc opt_index_cond_pushdown.cc \
+ sql_subquery_cache.cc
libmysqld_int_a_SOURCES= $(libmysqld_sources)
nodist_libmysqld_int_a_SOURCES= $(libmysqlsources) $(sqlsources)
=== modified file 'mysql-test/r/index_merge_myisam.result'
--- a/mysql-test/r/index_merge_myisam.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/index_merge_myisam.result 2010-05-31 21:22:40 +0000
@@ -1419,19 +1419,19 @@
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge=off,index_merge_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='index_merge_union=on';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,index_merge_sort_union=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=4;
ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
set optimizer_switch=NULL;
@@ -1458,21 +1458,21 @@
set optimizer_switch='index_merge=off,index_merge_union=off,default';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set @@global.optimizer_switch=default;
select @@global.optimizer_switch;
@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
#
# Check index_merge's @@optimizer_switch flags
#
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -1582,5 +1582,5 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
drop table t0, t1;
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-03-11 21:43:31 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-05-31 21:22:40 +0000
@@ -394,7 +394,7 @@
# - engine_condition_pushdown does not affect ICP
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== modified file 'mysql-test/r/subselect3.result'
--- a/mysql-test/r/subselect3.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3.result 2010-05-31 21:22:40 +0000
@@ -105,6 +105,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -123,6 +124,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect3_jcl6.result'
--- a/mysql-test/r/subselect3_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect3_jcl6.result 2010-05-31 21:22:40 +0000
@@ -109,6 +109,7 @@
Handler_read_rnd_next 5
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
oref a Z
@@ -127,6 +128,7 @@
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
Z
No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
create table t1 (a int, b int, primary key (a));
insert into t1 values (1,1), (3,1),(100,1);
=== modified file 'mysql-test/r/subselect_no_mat.result'
--- a/mysql-test/r/subselect_no_mat.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_mat.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_opts.result'
--- a/mysql-test/r/subselect_no_opts.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_opts.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off,semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_semijoin.result'
--- a/mysql-test/r/subselect_no_semijoin.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/subselect_no_semijoin.result 2010-05-31 21:22:40 +0000
@@ -1,6 +1,6 @@
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4826,4 +4826,4 @@
set optimizer_switch=default;
show variables like 'optimizer_switch';
Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-05-31 21:22:40 +0000
@@ -202,39 +202,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-03-29 14:04:35 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-05-31 21:22:40 +0000
@@ -206,39 +206,39 @@
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,semijoin=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,semijoin=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='default,materialization=off,loosescan=off';
select @@optimizer_switch;
@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/t/subselect3.test'
--- a/mysql-test/t/subselect3.test 2010-03-20 12:01:47 +0000
+++ b/mysql-test/t/subselect3.test 2010-05-31 21:22:40 +0000
@@ -98,10 +98,12 @@
delete from t2;
insert into t2 values (NULL, 0),(NULL, 0), (NULL, 0), (NULL, 0);
+set optimizer_switch='subquery_cache=off';
flush status;
select oref, a, a in (select a from t1 where oref=t2.oref) Z from t2;
show status like '%Handler_read%';
select 'No key lookups, seq reads: 29= 5 reads from t2 + 4 * 6 reads from t1.' Z;
+set @@optimizer_switch=@save_optimizer_switch;
drop table t1, t2;
=== modified file 'sql/CMakeLists.txt'
--- a/sql/CMakeLists.txt 2010-03-20 12:01:47 +0000
+++ b/sql/CMakeLists.txt 2010-05-31 21:22:40 +0000
@@ -78,7 +78,7 @@
rpl_rli.cc rpl_mi.cc sql_servers.cc
sql_connect.cc scheduler.cc
sql_profile.cc event_parse_data.cc opt_table_elimination.cc
- ds_mrr.cc
+ ds_mrr.cc sql_subquery_cache.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.cc
${PROJECT_SOURCE_DIR}/sql/sql_yacc.h
${PROJECT_SOURCE_DIR}/include/mysqld_error.h
=== modified file 'sql/Makefile.am'
--- a/sql/Makefile.am 2010-03-20 12:01:47 +0000
+++ b/sql/Makefile.am 2010-05-31 21:22:40 +0000
@@ -80,7 +80,7 @@
event_data_objects.h event_scheduler.h \
sql_partition.h partition_info.h partition_element.h \
contributors.h sql_servers.h \
- multi_range_read.h
+ multi_range_read.h sql_subquery_cache.h
mysqld_SOURCES = sql_lex.cc sql_handler.cc sql_partition.cc \
item.cc item_sum.cc item_buff.cc item_func.cc \
@@ -130,7 +130,7 @@
sql_servers.cc event_parse_data.cc \
opt_table_elimination.cc \
multi_range_read.cc \
- opt_index_cond_pushdown.cc
+ opt_index_cond_pushdown.cc sql_subquery_cache.cc
nodist_mysqld_SOURCES = mini_client_errors.c pack.c client.c my_time.c my_user.c
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item.cc 2010-05-31 21:22:40 +0000
@@ -28,6 +28,9 @@
const String my_null_string("NULL", 4, default_charset_info);
+static int save_field_in_field(Field *from,my_bool * null_value,
+ Field *to, bool no_conversions);
+
/****************************************************************************/
/* Hybrid_type_traits {_real} */
@@ -2273,6 +2276,13 @@
str->append(str_value);
}
+void Item_bool_cache::print(String *str, enum_query_type query_type)
+{
+ if (null_value)
+ str->append("NULL", 4);
+ else
+ Item_int::print(str, query_type);
+}
Item_uint::Item_uint(const char *str_arg, uint length):
Item_int(str_arg, length)
@@ -3646,12 +3656,17 @@
resolved_item->db_name : "");
const char *table_name= (resolved_item->table_name ?
resolved_item->table_name : "");
+ DBUG_ENTER("mark_as_dependent");
+ DBUG_PRINT("enter", ("Field '%s.%s.%s in select %d resolved in %d",
+ db_name, table_name,
+ resolved_item->field_name, current->select_number,
+ last->select_number));
/* store pointer on SELECT_LEX from which item is dependent */
if (mark_item)
mark_item->depended_from= last;
if (current->mark_as_dependent(thd, last, /** resolved_item psergey-thu
**/mark_item))
- return TRUE;
+ DBUG_RETURN(TRUE);
if (thd->lex->describe & DESCRIBE_EXTENDED)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE,
@@ -3661,7 +3676,7 @@
resolved_item->field_name,
current->select_number, last->select_number);
}
- return FALSE;
+ DBUG_RETURN(FALSE);
}
@@ -3698,6 +3713,7 @@
resolving)
*/
SELECT_LEX *previous_select= current_sel;
+
for (; previous_select->outer_select() != last_select;
previous_select= previous_select->outer_select())
{
@@ -3726,6 +3742,7 @@
mark_as_dependent(thd, last_select, current_sel, resolved_item,
dependent);
}
+ return;
}
@@ -4098,6 +4115,9 @@
((ref_type == REF_ITEM ||
ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) : 0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
return 0;
}
}
@@ -4113,7 +4133,9 @@
((ref_type == REF_ITEM || ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
-
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference to a view field had been found and we
substituted it instead of this Item (find_field_in_tables
@@ -4215,6 +4237,10 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, rf,
rf);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
+
return 0;
}
else
@@ -4222,6 +4248,9 @@
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex,
this, (Item_ident*)*reference);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
if (last_checked_context->select_lex->having_fix_field)
{
Item_ref *rf;
@@ -5082,39 +5111,48 @@
/**
+ Saves one Fields of an Item of in other Field
+
+ @param from Field to copy value from
+ @param null_value reference on item null_value to set it if it is needed
+ @param to Field to cope value to
+ @param no_conversions how to deal with NULL value (see
+ set_field_to_null_with_conversions())
+
+ @retval FALSE OK
+ @retval TRUE Error
+*/
+
+static int save_field_in_field(Field *from, my_bool *null_value,
+ Field *to, bool no_conversions)
+{
+ int res;
+ if (from->is_null())
+ {
+ (*null_value)= 1;
+ res= set_field_to_null_with_conversions(to, no_conversions);
+ }
+ else
+ {
+ to->set_notnull();
+ res= field_conv(to, from);
+ (*null_value)= 0;
+ }
+ return res;
+}
+
+/**
Set a field's value from a item.
*/
void Item_field::save_org_in_field(Field *to)
{
- if (field->is_null())
- {
- null_value=1;
- set_field_to_null_with_conversions(to, 1);
- }
- else
- {
- to->set_notnull();
- field_conv(to,field);
- null_value=0;
- }
+ save_field_in_field(field, &null_value, to, TRUE);
}
int Item_field::save_in_field(Field *to, bool no_conversions)
{
- int res;
- if (result_field->is_null())
- {
- null_value=1;
- res= set_field_to_null_with_conversions(to, no_conversions);
- }
- else
- {
- to->set_notnull();
- res= field_conv(to,result_field);
- null_value=0;
- }
- return res;
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
}
@@ -5973,6 +6011,9 @@
refer_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
view reference found, we substituted it instead of this
Item, so can quit
@@ -6023,6 +6064,9 @@
thd->change_item_tree(reference, fld);
mark_as_dependent(thd, last_checked_context->select_lex,
thd->lex->current_select, fld, fld);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ reference);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6046,6 +6090,9 @@
DBUG_ASSERT(*ref && (*ref)->fixed);
mark_as_dependent(thd, last_checked_context->select_lex,
context->select_lex, this, this);
+ context->select_lex->
+ register_dependency_item(last_checked_context->select_lex,
+ ref);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6312,7 +6359,8 @@
int Item_ref::save_in_field(Field *to, bool no_conversions)
{
int res;
- DBUG_ASSERT(!result_field);
+ if (result_field)
+ return save_field_in_field(result_field, &null_value, to, no_conversions);
res= (*ref)->save_in_field(to, no_conversions);
null_value= (*ref)->null_value;
return res;
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-03-20 12:01:47 +0000
+++ b/sql/item.h 2010-05-31 21:22:40 +0000
@@ -1922,8 +1922,31 @@
virtual void print(String *str, enum_query_type query_type);
Item_num *neg ();
uint decimal_precision() const { return max_length; }
- bool check_partition_func_processor(uchar *bool_arg) { return FALSE;}
- bool check_vcol_func_processor(uchar *arg) { return FALSE;}
+};
+
+
+/**
+ Item represent TRUE/FALSE/NULL for subquery values
+*/
+
+class Item_bool_cache: public Item_int
+{
+public:
+ Item_bool_cache(): Item_int(0, 1)
+ {
+ unsigned_flag= maybe_null= null_value= TRUE;
+ name= (char *)"bool chache";
+ }
+ Item_bool_cache(my_bool val, my_bool null): Item_int(val, 1)
+ {
+ unsigned_flag= maybe_null= TRUE;
+ null_value= null;
+ name= (char *)"bool chache";
+ }
+ Item *clone_item() { return new Item_bool_cache(value, null_value); }
+ uint decimal_precision() const { return 1; }
+ virtual void print(String *str, enum_query_type query_type);
+ void set(my_bool val, my_bool null) {value= test(val); null_value= null;}
};
@@ -3146,7 +3169,8 @@
example(0), used_table_map(0), cached_field(0), cached_field_type(MYSQL_TYPE_STRING),
value_cached(0)
{
- fixed= 1;
+ fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
Item_cache(enum_field_types field_type_arg):
@@ -3154,6 +3178,7 @@
value_cached(0)
{
fixed= 1;
+ maybe_null= 1;
null_value= 1;
}
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.cc 2010-05-31 21:22:40 +0000
@@ -1736,6 +1736,15 @@
used_tables_cache|= args[1]->used_tables();
not_null_tables_cache|= args[1]->not_null_tables();
const_item_cache&= args[1]->const_item();
+ DBUG_ASSERT(scache == NULL);
+ if (args[0]->cols() ==1 &&
+ thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE &&
+ !(sub->engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ sub->depends_on.push_front((Item**)&cache);
+ scache= new Subquery_cache_tmptable(thd, sub->depends_on, &result);
+ }
fixed= 1;
return FALSE;
}
@@ -1744,10 +1753,26 @@
longlong Item_in_optimizer::val_int()
{
bool tmp;
+ DBUG_ENTER("Item_in_optimizer::val_int");
+
DBUG_ASSERT(fixed == 1);
cache->store(args[0]);
cache->cache_value();
-
+
+ /* check if result is in the cache */
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ {
+ tmp= cached_value->val_int();
+ null_value= cached_value->null_value;
+ DBUG_RETURN(tmp);
+ }
+ }
+
if (cache->null_value)
{
/*
@@ -1818,11 +1843,18 @@
for (uint i= 0; i < ncols; i++)
item_subs->set_cond_guard_var(i, TRUE);
}
- return 0;
+ DBUG_RETURN(0);
}
tmp= args[1]->val_bool_result();
null_value= args[1]->null_value;
- return tmp;
+
+ /* put result in the cache */
+ if (scache)
+ {
+ result.set(tmp, null_value);
+ scache->put_value(&result);
+ }
+ DBUG_RETURN(tmp);
}
@@ -1839,6 +1871,11 @@
Item_bool_func::cleanup();
if (!save_cache)
cache= 0;
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
DBUG_VOID_RETURN;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2010-03-20 12:01:47 +0000
+++ b/sql/item_cmpfunc.h 2010-05-31 21:22:40 +0000
@@ -215,6 +215,7 @@
class Item_cache;
+class Subquery_cache;
#define UNKNOWN ((my_bool)-1)
@@ -237,6 +238,10 @@
{
protected:
Item_cache *cache;
+ /* Subquery cache */
+ Subquery_cache *scache;
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
bool save_cache;
/*
Stores the value of "NULL IN (SELECT ...)" for uncorrelated subqueries:
@@ -247,7 +252,7 @@
my_bool result_for_null_param;
public:
Item_in_optimizer(Item *a, Item_in_subselect *b):
- Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0),
+ Item_bool_func(a, my_reinterpret_cast(Item *)(b)), cache(0), scache(NULL),
save_cache(0), result_for_null_param(UNKNOWN)
{}
bool fix_fields(THD *, Item **);
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.cc 2010-05-31 21:22:40 +0000
@@ -34,11 +34,10 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
- engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0),
- const_item_cache(1),
- inside_first_fix_fields(0), done_first_fix_fields(FALSE),
- eliminated(FALSE),
- engine_changed(0), changed(0), is_correlated(FALSE)
+ engine(0), old_engine(0), scache(0), used_tables_cache(0),
+ have_to_be_excluded(0), const_item_cache(1), inside_first_fix_fields(0),
+ done_first_fix_fields(FALSE), eliminated(FALSE), engine_changed(0),
+ changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -116,6 +115,12 @@
}
if (engine)
engine->cleanup();
+ depends_on.empty();
+ if (scache)
+ {
+ delete scache;
+ scache= 0;
+ }
reset();
value_assigned= 0;
DBUG_VOID_RETURN;
@@ -148,6 +153,8 @@
Item_subselect::~Item_subselect()
{
delete engine;
+ if (scache)
+ delete scache;
}
Item_subselect::trans_res
@@ -746,9 +753,22 @@
void Item_singlerow_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_singlerow_subselect::fix_length_and_dec");
if ((max_columns= engine->cols()) == 1)
{
+ DBUG_PRINT("info", ("one, elements: %u flag %u",
+ (uint)depends_on.elements,
+ (uint)test(thd->variables.optimizer_switch & OPTIMIZER_SWITCH_SUBQUERY_CACHE)));
engine->fix_length_and_dec(row= &value);
+ if (depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, value);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
}
else
{
@@ -765,6 +785,7 @@
*/
if (engine->no_tables())
maybe_null= engine->may_be_null();
+ DBUG_VOID_RETURN;
}
uint Item_singlerow_subselect::cols()
@@ -797,77 +818,206 @@
exec();
}
+/**
+ Checks subquery cache for value
+
+ @retval NULL nothing found
+ @retval reference on item representing value found in the cache
+*/
+
+Item *Item_subselect::check_cache()
+{
+ DBUG_ENTER("Item_subselect::check_cache");
+ if (scache)
+ {
+ Subquery_cache_tmptable::result res;
+ Item *cached_value;
+ res= scache->check_value(&cached_value);
+ if (res == Subquery_cache_tmptable::HIT)
+ DBUG_RETURN(cached_value);
+ }
+ DBUG_RETURN(NULL);
+}
+
double Item_singlerow_subselect::val_real()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_real");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_real();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_real());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
longlong Item_singlerow_subselect::val_int()
{
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_int");
DBUG_ASSERT(fixed == 1);
- if (!exec() && !value->null_value)
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_int();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_int());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
String *Item_singlerow_subselect::val_str(String *str)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_str");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_str(str);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_str(str));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
my_decimal *Item_singlerow_subselect::val_decimal(my_decimal *decimal_value)
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_decimal");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_decimal(decimal_value);
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_decimal(decimal_value));
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
bool Item_singlerow_subselect::val_bool()
{
- if (!exec() && !value->null_value)
+ Item *cached_value;
+ bool err;
+ DBUG_ENTER("Item_singlerow_subselect::val_bool");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ bool res= cached_value->val_bool();
+ if ((null_value= cached_value->null_value))
+ {
+ reset();
+ DBUG_RETURN(0);
+ }
+ else
+ DBUG_RETURN(res);
+ }
+
+ if (!(err= exec()) && !value->null_value)
{
null_value= 0;
- return value->val_bool();
+ if (scache)
+ scache->put_value(value);
+ DBUG_RETURN(value->val_bool());
}
else
{
reset();
- return 0;
+ DBUG_PRINT("info", ("error: %u", (uint)err));
+ if (scache && !err)
+ scache->put_value(&const_null_value);
+ DBUG_RETURN(0);
}
}
@@ -952,33 +1102,79 @@
void Item_exists_subselect::fix_length_and_dec()
{
+ DBUG_ENTER("Item_exists_subselect::fix_length_and_dec");
decimals= 0;
max_length= 1;
max_columns= engine->cols();
/* We need only 1 row to determine existence */
unit->global_parameters->select_limit= new Item_int((int32) 1);
+ if (substype() == EXISTS_SUBS && depends_on.elements &&
+ optimizer_flag(thd, OPTIMIZER_SWITCH_SUBQUERY_CACHE) &&
+ !(engine->uncacheable() & (UNCACHEABLE_RAND |
+ UNCACHEABLE_SIDEEFFECT)))
+ {
+ DBUG_ASSERT(scache == NULL);
+ scache= new Subquery_cache_tmptable(thd, depends_on, &result);
+ DBUG_PRINT("info", ("cache: 0x%lx", (ulong) scache));
+ }
+ DBUG_VOID_RETURN;
}
double Item_exists_subselect::val_real()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_int");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ double res= cached_value->val_real();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return (double) value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN((double) value);
}
longlong Item_exists_subselect::val_int()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
+ DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ longlong res= cached_value->val_int();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
DBUG_ASSERT(fixed == 1);
if (exec())
{
reset();
- return 0;
- }
- return value;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value);
}
@@ -997,11 +1193,32 @@
String *Item_exists_subselect::val_str(String *str)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_str");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ String *res= cached_value->val_str(str);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ str->set((ulonglong)0,&my_charset_bin);
+ DBUG_RETURN(str);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
str->set((ulonglong)value,&my_charset_bin);
- return str;
+ DBUG_RETURN(str);
}
@@ -1020,23 +1237,61 @@
my_decimal *Item_exists_subselect::val_decimal(my_decimal *decimal_value)
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_decvimal");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_decimal *res= cached_value->val_decimal(decimal_value);
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
+ {
reset();
+ int2my_decimal(E_DEC_FATAL_ERROR, 0, 0, decimal_value);
+ DBUG_RETURN(decimal_value);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
int2my_decimal(E_DEC_FATAL_ERROR, value, 0, decimal_value);
- return decimal_value;
+ DBUG_RETURN(decimal_value);
}
bool Item_exists_subselect::val_bool()
{
+ Item *cached_value;
+ DBUG_ENTER("Item_exists_subselect::val_real");
DBUG_ASSERT(fixed == 1);
+
+ if ((cached_value = check_cache()))
+ {
+ my_bool res= cached_value->val_bool();
+ DBUG_ASSERT(!cached_value->null_value);
+ DBUG_RETURN(res);
+ }
+
if (exec())
{
reset();
- return 0;
- }
- return value != 0;
+ DBUG_RETURN(0);
+ }
+
+ if (scache)
+ {
+ result.set(value, FALSE);
+ scache->put_value(&result);
+ }
+
+ DBUG_RETURN(value != 0);
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-03-29 14:04:35 +0000
+++ b/sql/item_subselect.h 2010-05-31 21:22:40 +0000
@@ -27,6 +27,7 @@
class subselect_hash_sj_engine;
class Item_bool_func2;
class Cached_item;
+class Subquery_cache;
/* base class for subselects */
@@ -57,6 +58,10 @@
subselect_engine *engine;
/* old engine if engine was changed */
subselect_engine *old_engine;
+ /* subquery cache */
+ Subquery_cache *scache;
+ /* null consrtant for caching */
+ Item_null const_null_value;
/* cache of used external tables */
table_map used_tables_cache;
/* allowed number of columns (1 for single value subqueries) */
@@ -67,7 +72,7 @@
bool have_to_be_excluded;
/* cache of constant state */
bool const_item_cache;
-
+
bool inside_first_fix_fields;
bool done_first_fix_fields;
public:
@@ -88,13 +93,21 @@
*/
List<Ref_to_outside> upper_refs;
st_select_lex *parent_select;
-
- /*
+
+ /**
+ List of references on items subquery depends on (externally resolved);
+
+ @note We can't store direct links on Items because it could be
+ substituted with other item (for example for grouping).
+ */
+ List<Item*> depends_on;
+
+ /*
TRUE<=>Table Elimination has made it redundant to evaluate this select
(and so it is not part of QEP, etc)
- */
+ */
bool eliminated;
-
+
/* changed engine indicator */
bool engine_changed;
/* subquery is transformed */
@@ -178,6 +191,8 @@
return trace_unsupported_by_check_vcol_func_processor("subselect");
}
+ Item *check_cache();
+
/**
Get the SELECT_LEX structure associated with this Item.
@return the SELECT_LEX structure associated with this Item
@@ -202,6 +217,7 @@
{
protected:
Item_cache *value, **row;
+
public:
Item_singlerow_subselect(st_select_lex *select_lex);
Item_singlerow_subselect() :Item_subselect(), value(0), row (0) {}
@@ -268,6 +284,8 @@
{
protected:
bool value; /* value of this item (boolean: exists/not-exists) */
+ /* result representation for the subquery cache */
+ Item_bool_cache result;
public:
Item_exists_subselect(st_select_lex *select_lex);
=== modified file 'sql/item_sum.cc'
--- a/sql/item_sum.cc 2010-03-20 12:01:47 +0000
+++ b/sql/item_sum.cc 2010-05-31 21:22:40 +0000
@@ -319,6 +319,7 @@
if (aggr_level >= 0)
{
ref_by= ref;
+ thd->lex->current_select->register_dependency_item(aggr_sel, ref);
/* Add the object to the list of registered objects assigned to aggr_sel */
if (!aggr_sel->inner_sum_func_list)
next= this;
=== modified file 'sql/mysql_priv.h'
--- a/sql/mysql_priv.h 2010-03-20 12:01:47 +0000
+++ b/sql/mysql_priv.h 2010-05-31 21:22:40 +0000
@@ -568,12 +568,13 @@
#define OPTIMIZER_SWITCH_SEMIJOIN 256
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE 512
#define OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN 1024
+#define OPTIMIZER_SWITCH_SUBQUERY_CACHE (1<<11)
#ifdef DBUG_OFF
-# define OPTIMIZER_SWITCH_LAST 2048
+# define OPTIMIZER_SWITCH_LAST (1<<12)
#else
-# define OPTIMIZER_SWITCH_TABLE_ELIMINATION 2048
-# define OPTIMIZER_SWITCH_LAST 4096
+# define OPTIMIZER_SWITCH_TABLE_ELIMINATION (1<<12)
+# define OPTIMIZER_SWITCH_LAST (1<<13)
#endif
#ifdef DBUG_OFF
@@ -588,7 +589,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#else
# define OPTIMIZER_SWITCH_DEFAULT (OPTIMIZER_SWITCH_INDEX_MERGE | \
OPTIMIZER_SWITCH_INDEX_MERGE_UNION | \
@@ -601,7 +603,8 @@
OPTIMIZER_SWITCH_MATERIALIZATION | \
OPTIMIZER_SWITCH_SEMIJOIN | \
OPTIMIZER_SWITCH_PARTIAL_MATCH_ROWID_MERGE|\
- OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN)
+ OPTIMIZER_SWITCH_PARTIAL_MATCH_TABLE_SCAN|\
+ OPTIMIZER_SWITCH_SUBQUERY_CACHE)
#endif
/*
@@ -936,6 +939,7 @@
#ifdef MYSQL_SERVER
#include "sql_servers.h"
#include "opt_range.h"
+#include "sql_subquery_cache.h"
#ifdef HAVE_QUERY_CACHE
struct Query_cache_query_flags
@@ -1269,6 +1273,10 @@
Item *having, ORDER *proc_param, ulonglong select_type,
select_result *result, SELECT_LEX_UNIT *unit,
SELECT_LEX *select_lex);
+
+struct st_join_table *create_index_lookup_join_tab(TABLE *table);
+int join_read_key2(THD *thd, struct st_join_table *tab, TABLE *table,
+ struct st_table_ref *table_ref);
void free_underlaid_joins(THD *thd, SELECT_LEX *select);
bool mysql_explain_union(THD *thd, SELECT_LEX_UNIT *unit,
select_result *result);
@@ -1288,6 +1296,7 @@
bool table_cant_handle_bit_fields,
bool make_copy_field,
uint convert_blob_length);
+bool open_tmp_table(TABLE *table);
void sp_prepare_create_field(THD *thd, Create_field *sql_field);
int prepare_create_field(Create_field *sql_field,
uint *blob_columns,
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2010-03-20 12:01:47 +0000
+++ b/sql/mysqld.cc 2010-05-31 21:22:40 +0000
@@ -305,6 +305,7 @@
"firstmatch","loosescan","materialization", "semijoin",
"partial_match_rowid_merge",
"partial_match_table_scan",
+ "subquery_cache",
#ifndef DBUG_OFF
"table_elimination",
#endif
@@ -325,6 +326,7 @@
sizeof("semijoin") - 1,
sizeof("partial_match_rowid_merge") - 1,
sizeof("partial_match_table_scan") - 1,
+ sizeof("subquery_cache") - 1,
#ifndef DBUG_OFF
sizeof("table_elimination") - 1,
#endif
@@ -404,8 +406,9 @@
static const char *optimizer_switch_str="index_merge=on,index_merge_union=on,"
"index_merge_sort_union=on,"
"index_merge_intersection=on,"
- "index_condition_pushdown=on"
-#ifndef DBUG_OFF
+ "index_condition_pushdown=on,"
+ "subquery_cache=on"
+#ifndef DBUG_OFF
",table_elimination=on";
#else
;
@@ -5872,7 +5875,9 @@
OPT_RECORD_RND_BUFFER, OPT_DIV_PRECINCREMENT, OPT_RELAY_LOG_SPACE_LIMIT,
OPT_RELAY_LOG_PURGE,
OPT_SLAVE_NET_TIMEOUT, OPT_SLAVE_COMPRESSED_PROTOCOL, OPT_SLOW_LAUNCH_TIME,
- OPT_SLAVE_TRANS_RETRIES, OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
+ OPT_SLAVE_TRANS_RETRIES,
+ OPT_SUBQUERY_CACHE,
+ OPT_READONLY, OPT_ROWID_MERGE_BUFF_SIZE,
OPT_DEBUGGING, OPT_DEBUG_FLUSH,
OPT_SORT_BUFFER, OPT_TABLE_OPEN_CACHE, OPT_TABLE_DEF_CACHE,
OPT_THREAD_CONCURRENCY, OPT_THREAD_CACHE_SIZE,
@@ -7164,7 +7169,7 @@
{"optimizer_switch", OPT_OPTIMIZER_SWITCH,
"optimizer_switch=option=val[,option=val...], where option={index_merge, "
"index_merge_union, index_merge_sort_union, index_merge_intersection, "
- "index_condition_pushdown"
+ "index_condition_pushdown, subquery_cache"
#ifndef DBUG_OFF
", table_elimination"
#endif
@@ -7868,6 +7873,8 @@
{"Ssl_version", (char*) &show_ssl_get_version, SHOW_FUNC},
#endif /* HAVE_OPENSSL */
{"Syncs", (char*) &my_sync_count, SHOW_LONG_NOFLUSH},
+ {"Subquery_cache_hit", (char*) &subquery_cache_hit, SHOW_LONG},
+ {"Subquery_cache_miss", (char*) &subquery_cache_miss, SHOW_LONG},
{"Table_locks_immediate", (char*) &locks_immediate, SHOW_LONG},
{"Table_locks_waited", (char*) &locks_waited, SHOW_LONG},
#ifdef HAVE_MMAP
@@ -8006,6 +8013,7 @@
abort_loop= select_thread_in_use= signal_thread_in_use= 0;
ready_to_exit= shutdown_in_progress= grant_option= 0;
aborted_threads= aborted_connects= 0;
+ subquery_cache_miss= subquery_cache_hit= 0;
delayed_insert_threads= delayed_insert_writes= delayed_rows_in_use= 0;
delayed_insert_errors= thread_created= 0;
specialflag= 0;
=== modified file 'sql/sql_base.cc'
--- a/sql/sql_base.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_base.cc 2010-05-31 21:22:40 +0000
@@ -8062,6 +8062,10 @@
if (*conds)
{
thd->where="where clause";
+ DBUG_EXECUTE("where",
+ print_where(*conds,
+ "WHERE in setup_conds",
+ QT_ORDINARY););
if ((!(*conds)->fixed && (*conds)->fix_fields(thd, conds)) ||
(*conds)->check_cols(1))
goto err_no_arena;
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.cc 2010-05-31 21:22:40 +0000
@@ -3020,6 +3020,7 @@
table_charset= 0;
precomputed_group_by= 0;
bit_fields_as_long= 0;
+ skip_create_table= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_class.h 2010-05-31 21:22:40 +0000
@@ -2786,12 +2786,17 @@
that MEMORY tables cannot index BIT columns.
*/
bool bit_fields_as_long;
+ /*
+ Whether to create or postpone actual creation of this temporary table.
+ TRUE <=> create_tmp_table will create only the TABLE structure.
+ */
+ bool skip_create_table;
TMP_TABLE_PARAM()
:copy_field(0), group_parts(0),
group_length(0), group_null_parts(0), convert_blob_length(0),
schema_table(0), precomputed_group_by(0), force_copy_fields(0),
- bit_fields_as_long(0)
+ bit_fields_as_long(0), skip_create_table(0)
{}
~TMP_TABLE_PARAM()
{
=== modified file 'sql/sql_lex.cc'
--- a/sql/sql_lex.cc 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.cc 2010-05-31 21:22:40 +0000
@@ -1829,6 +1829,52 @@
}
+/**
+ Registers reference on items on which the subqueries depends
+
+ @param last pointer to last st_select_lex struct, before
+ which all st_select_lex have to be marked as
+ dependent
+ @param dependency reference on the item on which all this
+ subqueries depends
+
+*/
+
+void st_select_lex::register_dependency_item(st_select_lex *last,
+ Item **dependency)
+{
+ SELECT_LEX *s= this;
+ DBUG_ENTER("st_select_lex::register_dependency_item");
+ DBUG_ASSERT(this != last);
+ DBUG_ASSERT(*dependency);
+ do
+ {
+ /* check duplicates */
+ List_iterator_fast<Item*> li(s->master_unit()->item->depends_on);
+ Item **dep;
+ while ((dep= li++))
+ {
+ if ((*dep)->eq(*dependency, FALSE))
+ {
+ DBUG_PRINT("info", ("dependency %s already present",
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ DBUG_VOID_RETURN;
+ }
+ }
+
+ s->master_unit()->item->depends_on.push_back(dependency);
+ DBUG_PRINT("info", ("depends_on: Select: %d added: %s",
+ s->select_number,
+ ((*dependency)->name ?
+ (*dependency)->name :
+ "<no name>")));
+ } while ((s= s->outer_select()) != last && s != 0);
+ DBUG_VOID_RETURN;
+}
+
+
/*
st_select_lex_node::mark_as_dependent mark all st_select_lex struct from
this to 'last' as dependent
@@ -1843,7 +1889,7 @@
bool st_select_lex::mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency)
{
-
+ DBUG_ENTER("st_select_lex::mark_as_dependent");
DBUG_ASSERT(this != last);
/*
@@ -1872,11 +1918,11 @@
Item_subselect *subquery_expr= s->master_unit()->item;
if (subquery_expr && subquery_expr->mark_as_dependent(thd, last,
dependency))
- return TRUE;
+ DBUG_RETURN(TRUE);
} while ((s= s->outer_select()) != last && s != 0);
is_correlated= TRUE;
this->master_unit()->item->is_correlated= TRUE;
- return FALSE;
+ DBUG_RETURN(FALSE);
}
bool st_select_lex_node::set_braces(bool value) { return 1; }
=== modified file 'sql/sql_lex.h'
--- a/sql/sql_lex.h 2010-03-20 12:01:47 +0000
+++ b/sql/sql_lex.h 2010-05-31 21:22:40 +0000
@@ -748,6 +748,7 @@
}
bool mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency);
+ void register_dependency_item(st_select_lex *last, Item **dependency);
bool set_braces(bool value);
bool inc_in_sum_expr();
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-05-10 13:46:08 +0000
+++ b/sql/sql_select.cc 2010-05-31 21:22:40 +0000
@@ -151,7 +151,6 @@
static int join_read_system(JOIN_TAB *tab);
static int join_read_const(JOIN_TAB *tab);
static int join_read_key(JOIN_TAB *tab);
-static int join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref);
static void join_read_key_unlock_row(st_join_table *tab);
static int join_read_always_key(JOIN_TAB *tab);
static int join_read_last_key(JOIN_TAB *tab);
@@ -5209,7 +5208,7 @@
'join->best_positions' contains a complete optimal extension of the
current partial QEP.
*/
- DBUG_EXECUTE("opt", print_plan(join, join->tables,
+ DBUG_EXECUTE("opt", print_plan(join, n_tables,
record_count, read_time, read_time,
"optimal"););
DBUG_RETURN(FALSE);
@@ -7625,6 +7624,40 @@
/**
+ Creates and fills JOIN_TAB for index look up in temporary table
+
+ @param table The table where to look up
+
+ @return JOIN_TAB object or NULL in case of error
+*/
+
+JOIN_TAB *create_index_lookup_join_tab(TABLE *table)
+{
+ JOIN_TAB *tab;
+ DBUG_ENTER("create_index_lookup_join_tab");
+
+ if (!((tab= new JOIN_TAB)))
+ DBUG_RETURN(NULL);
+ tab->read_record.table= table;
+ tab->read_record.file=table->file;
+ /*tab->read_record.unlock_row= rr_unlock_row;*/
+ tab->next_select=0;
+ tab->sorted= 1;
+
+ table->status= STATUS_NO_RECORD;
+ tab->read_first_record= join_read_key;
+ /*tab->read_record.unlock_row= join_read_key_unlock_row;*/
+ tab->read_record.read_record= join_no_more_records;
+ if (table->covering_keys.is_set(tab->ref.key) &&
+ !table->no_keyread)
+ {
+ table->key_read=1;
+ table->file->extra(HA_EXTRA_KEYREAD);
+ }
+ DBUG_RETURN(tab);
+}
+
+/**
Give error if we some tables are done with a full join.
This is used by multi_table_update and multi_table_delete when running
@@ -10778,6 +10811,7 @@
case Item::REF_ITEM:
case Item::NULL_ITEM:
case Item::VARBIN_ITEM:
+ case Item::CACHE_ITEM:
if (make_copy_field)
{
DBUG_ASSERT(((Item_result_field*)item)->result_field);
@@ -11552,7 +11586,8 @@
¶m->recinfo, select_options))
goto err;
}
- if (open_tmp_table(table))
+ DBUG_PRINT("info", ("skip_create_table: %d", (int)param->skip_create_table));
+ if (!param->skip_create_table && open_tmp_table(table))
goto err;
thd->mem_root= mem_root_save;
@@ -11700,16 +11735,17 @@
bool open_tmp_table(TABLE *table)
{
int error;
+ DBUG_ENTER("open_tmp_table");
if ((error= table->file->ha_open(table, table->s->table_name.str, O_RDWR,
HA_OPEN_TMP_TABLE |
HA_OPEN_INTERNAL_TABLE)))
{
table->file->print_error(error,MYF(0)); /* purecov: inspected */
table->db_stat=0;
- return(1);
+ DBUG_RETURN(1);
}
(void) table->file->extra(HA_EXTRA_QUICK); /* Faster */
- return(0);
+ DBUG_RETURN(0);
}
@@ -12540,7 +12576,8 @@
else
{
/* Do index lookup in the materialized table */
- if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
+ if ((res= join_read_key2(join_tab->join->thd, join_tab,
+ sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
@@ -13323,61 +13360,62 @@
static int
join_read_key(JOIN_TAB *tab)
{
- return join_read_key2(tab, tab->table, &tab->ref);
+ return join_read_key2(tab->join->thd, tab, tab->table, &tab->ref);
}
-/*
+/*
eq_ref access handler but generalized a bit to support TABLE and TABLE_REF
not from the join_tab. See join_read_key for detailed synopsis.
*/
-static int
-join_read_key2(JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
+int join_read_key2(THD *thd, JOIN_TAB *tab, TABLE *table, TABLE_REF *table_ref)
{
int error;
+ DBUG_ENTER("join_read_key2");
if (!table->file->inited)
{
table->file->ha_index_init(table_ref->key, tab->sorted);
}
/* TODO: Why don't we do "Late NULLs Filtering" here? */
- if (cmp_buffer_with_ref(tab->join->thd, table, table_ref) ||
+ if (cmp_buffer_with_ref(thd, table, table_ref) ||
(table->status & (STATUS_GARBAGE | STATUS_NO_PARENT | STATUS_NULL_ROW)))
{
if (table_ref->key_err)
{
table->status=STATUS_NOT_FOUND;
- return -1;
+ DBUG_RETURN(-1);
}
/*
Moving away from the current record. Unlock the row
in the handler if it did not match the partial WHERE.
*/
- if (tab->ref.has_record && tab->ref.use_count == 0)
+ if (table_ref->has_record )
+ if (table_ref->use_count == 0)
{
tab->read_record.file->unlock_row();
- tab->ref.has_record= FALSE;
+ table_ref->has_record= FALSE;
}
error=table->file->ha_index_read_map(table->record[0],
table_ref->key_buff,
make_prev_keypart_map(table_ref->key_parts),
HA_READ_KEY_EXACT);
if (error && error != HA_ERR_KEY_NOT_FOUND && error != HA_ERR_END_OF_FILE)
- return report_error(table, error);
+ DBUG_RETURN(report_error(table, error));
if (! error)
{
- tab->ref.has_record= TRUE;
- tab->ref.use_count= 1;
+ table_ref->has_record= TRUE;
+ table_ref->use_count= 1;
}
}
else if (table->status == 0)
{
- DBUG_ASSERT(tab->ref.has_record);
- tab->ref.use_count++;
+ DBUG_ASSERT(table_ref->has_record);
+ table_ref->use_count++;
}
table->null_row=0;
- return table->status ? -1 : 0;
+ DBUG_RETURN(table->status ? -1 : 0);
}
=== modified file 'sql/table.cc'
--- a/sql/table.cc 2010-03-20 12:01:47 +0000
+++ b/sql/table.cc 2010-05-31 21:22:40 +0000
@@ -20,6 +20,7 @@
#include "sql_trigger.h"
#include <m_ctype.h>
#include "my_md5.h"
+#include "my_bit.h"
/* INFORMATION_SCHEMA name */
LEX_STRING INFORMATION_SCHEMA_NAME= {C_STRING_WITH_LEN("information_schema")};
@@ -5096,6 +5097,115 @@
file->column_bitmaps_signal();
}
+
+/**
+ @brief
+ Allocate space for keys
+
+ @param key_count number of keys to allocate.
+
+ @details
+ Allocate space enough to fit 'key_count' keys for this table.
+
+ @return FALSE space was successfully allocated.
+ @return TRUE an error occur.
+*/
+
+bool TABLE::alloc_keys(uint key_count)
+{
+ DBUG_ASSERT(!s->keys);
+ key_info= s->key_info= (KEY*) alloc_root(&mem_root, sizeof(KEY)*key_count);
+ max_keys= key_count;
+ return !(key_info);
+}
+
+
+/**
+ @brief Adds one key to a temporary table.
+
+ @param key key number.
+ @param key_parts number of fields in the key
+ @param next_field_no function which returns field numbers which
+ should be included in the key
+ @param arg above function argement
+
+ @return <0 an error occur.
+ @return >=0 number of newly added key.
+*/
+
+bool TABLE::add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg)
+{
+ DBUG_ASSERT(key < max_keys);
+
+ char buf[NAME_CHAR_LEN];
+ KEY* keyinfo;
+ Field **reg_field;
+ uint i;
+ bool key_start= TRUE;
+ KEY_PART_INFO* key_part_info=
+ (KEY_PART_INFO*) alloc_root(&mem_root, sizeof(KEY_PART_INFO)*key_parts);
+ if (!key_part_info)
+ return TRUE;
+ keyinfo= key_info + key;
+ keyinfo->key_part= key_part_info;
+ keyinfo->usable_key_parts= keyinfo->key_parts = key_parts;
+ keyinfo->key_length=0;
+ keyinfo->algorithm= HA_KEY_ALG_UNDEF;
+ keyinfo->flags= HA_GENERATED_KEY;
+ sprintf(buf, "key%i", key);
+ if (!(keyinfo->name= strdup_root(&mem_root, buf)))
+ return TRUE;
+ keyinfo->rec_per_key= (ulong*) alloc_root(&mem_root,
+ sizeof(ulong)*key_parts);
+ if (!keyinfo->rec_per_key)
+ return TRUE;
+ bzero(keyinfo->rec_per_key, sizeof(ulong)*key_parts);
+ for (i= 0; i < key_parts; i++)
+ {
+ reg_field= field + next_field_no(arg);
+ if (key_start)
+ (*reg_field)->key_start.set_bit(key);
+ key_start= FALSE;
+ (*reg_field)->part_of_key.set_bit(key);
+ (*reg_field)->flags|= PART_KEY_FLAG;
+ key_part_info->null_bit= (*reg_field)->null_bit;
+ key_part_info->null_offset= (uint) ((*reg_field)->null_ptr -
+ (uchar*) record[0]);
+ key_part_info->field= *reg_field;
+ key_part_info->offset= (*reg_field)->offset(record[0]);
+ key_part_info->length= (uint16) (*reg_field)->pack_length();
+ keyinfo->key_length+= key_part_info->length;
+ /* TODO:
+ The below method of computing the key format length of the
+ key part is a copy/paste from opt_range.cc, and table.cc.
+ This should be factored out, e.g. as a method of Field.
+ In addition it is not clear if any of the Field::*_length
+ methods is supposed to compute the same length. If so, it
+ might be reused.
+ */
+ key_part_info->store_length= key_part_info->length;
+
+ if ((*reg_field)->real_maybe_null())
+ key_part_info->store_length+= HA_KEY_NULL_LENGTH;
+ if ((*reg_field)->type() == MYSQL_TYPE_BLOB ||
+ (*reg_field)->real_type() == MYSQL_TYPE_VARCHAR)
+ key_part_info->store_length+= HA_KEY_BLOB_LENGTH;
+
+ key_part_info->type= (uint8) (*reg_field)->key_type();
+ key_part_info->key_type =
+ ((ha_base_keytype) key_part_info->type == HA_KEYTYPE_TEXT ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT1 ||
+ (ha_base_keytype) key_part_info->type == HA_KEYTYPE_VARTEXT2) ?
+ 0 : FIELDFLAG_BINARY;
+ key_part_info++;
+ }
+ set_if_bigger(s->max_key_length, keyinfo->key_length);
+ s->keys++;
+ return FALSE;
+}
+
+
/**
@brief Check if this is part of a MERGE table with attached children.
=== modified file 'sql/table.h'
--- a/sql/table.h 2010-03-20 12:01:47 +0000
+++ b/sql/table.h 2010-05-31 21:22:40 +0000
@@ -781,6 +781,7 @@
uint temp_pool_slot; /* Used by intern temp tables */
uint status; /* What's in record[0] */
uint db_stat; /* mode of file as in handler.h */
+ uint max_keys; /* Size of allocated key_info array. */
/* number of select if it is derived table */
uint derived_select_number;
int current_lock; /* Type of lock on table */
@@ -913,6 +914,9 @@
*/
inline bool needs_reopen_or_name_lock()
{ return s->version != refresh_version; }
+ bool alloc_keys(uint key_count);
+ bool add_tmp_key(uint key, uint key_parts,
+ uint (*next_field_no) (uchar *), uchar *arg);
bool is_children_attached(void);
};
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-03-20 12:01:47 +0000
+++ b/storage/maria/ha_maria.cc 2010-05-31 21:22:40 +0000
@@ -995,6 +995,8 @@
{
MARIA_HA *tmp= file;
file= 0;
+ if (!tmp)
+ return 0;
return maria_close(tmp);
}
1
0
[Maria-developers] IRC log of hingo and cafuego about the deb packages still named "mysql"
by Henrik Ingo 31 May '10
by Henrik Ingo 31 May '10
31 May '10
---------- Forwarded Message ----------
Subject: IRC log of hingo and cafuego about the deb packages still named "mysql"
Date: Friday 28 May 2010
From: Henrik Ingo <hingo(a)askmonty.org>
To: maria-developers(a)lists.launchpad.net
Archiving this here for later reference.
The background is that current MariaDB packaging (which is based on ourdelta)
still has a few packages called mysql-something instead of mariadb-something.
This works for ourdelta, but since you obviously cannot have 2 identically
named packages in the same repository, this is a showstopper for getting
MariaDB into Debian.
My gut feeling of the below is that this is a bug in apt. If there is one
package called mysql-common (which is one of the problematic packages) and one
called mariadb-common that Provides: mysql-common, then if user chooses to
install mariadb-* and uninstall mysql-*, apt should be happy and let the user
do that.
[12:54:43] <hingo> cafuego?
[12:55:24] <evil_steve> is the six foot something dutch guy in the corner
nursing the fruitiest, girliest drink in the place.
[12:56:20] <capitol> ^^
[13:03:57] <cafuego> hingo: yes?
[13:04:16] <hingo> cafuego: You are the one doing the ourdelta deb packages?
[13:04:30] <cafuego> Yup
[13:08:30] <-- monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has quit (Quit: Computer has
gone to sleep)
[13:13:06] --> monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has joined #ourdelta
[13:14:42] <hingo> Oh sorry, I drifted off...
[13:15:27] <hingo> So, I was looking into some old emails and returned to the
fact we ship a "mysql-common" package with MariaDB.
[13:16:00] <hingo> Arjen says this is because some other debian package is
"hard-coded" to depend on that, but he is never able to remember more details.
Do you?
[13:17:01] <hingo> Details such as 1) do you remember which packages in debian
break if we rename it to mariadb-common and 2) since packages do "depends" and
"provides", how is it even possible to depend on the package name so that it
cannot be solved with a provides:?
[13:17:11] <hingo> cafuego ^
[13:37:32] <-- monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has quit (Quit: Computer has
gone to sleep)
[13:43:18] <cafuego> hingo: ummm... i think it was perl-dbi or somesuch
[13:44:16] <cafuego> hingo: The problem was that Provides can't be versioned,
so the distro pkg always wins.
[13:44:43] <hingo> cafuego: Ok, that makes more sense.
[13:45:01] <cafuego> hingo: A friend suggested sticking an empty mysql-common
package in and upping the epoch on that so the distro always loses :-)
[13:45:38] <hingo> cafuego: So perl-dbi depends always on a specific version.
[13:45:55] <cafuego> hingo: No, but it always grabs the newest version.
[13:46:09] <cafuego> I think, let me check
[13:46:36] <cafuego> wrong pkg
[13:47:38] <cafuego> libdbd-mysql-perl
[13:48:13] <cafuego> On my box that has a versioned depend on
libmysqlclient15off (>= 5.0.27-1)
[13:48:25] <hingo> cafuego: yes, but that is perl-dbi in commonspeak :-)
[13:48:34] <cafuego> ;-)
[13:48:53] <cafuego> So unless I stick libmysqlclient15off in my pkg it'll
always keep the distro version.
[13:49:01] <cafuego> a provides won't do it :-(
[13:49:16] <hingo> cafuego: So not actually mysql-common as such, just that
file?
[13:49:41] <cafuego> Yeah I think so. It's been a while since I worked in it.
[13:49:45] <cafuego> s/in/on/
[13:50:15] <cafuego> The client won't install without libdbd-mysql-perl and
libdbd-mysql-perl won't install wtihout libmysqlclient15off
[13:50:43] --> monk-eeee (~monk-
eeee(a)c220-237-92-67.kelvn3.qld.optusnet.com.au) has joined #ourdelta
[13:51:03] <hingo> cafuego: So why is the package name relevant at all then?
it depends on a file name of a library. Can't the package name be called
anything?
[13:51:15] <cafuego> hingo: sorry?
[13:51:27] <cafuego> hingo: I'm only talking pkg names here.
[13:51:41] <hingo> the package name is mysql-common
[13:51:54] <cafuego> So with my maria 5.1.42-mariadb68 package
[13:53:01] <hingo> mysql-common_5.1.42-mariadb68_all.deb
[13:53:22] <cafuego> mariadb-client-5.1 depends on libdbd-mysql-perl (which is
provided by the distro, and thus I can't edit its depends) depends on
libmysqlclient16 (>= 5.1.21-1)
[13:54:02] <cafuego> if I create libmariadbclient16 witha Provides:
libmysqlclient16 the distro will NOT install that in preference
[13:54:12] <hingo> Ok, I get the libmysqlclient packages. I was speaking about
mysql-common in http://mirror.ourdelta.org/deb/dists/lenny/mariadb-ourdelta/
[13:55:52] <cafuego> AH yep. So something depends on libmysqlclient15off
[13:57:00] <cafuego> I don't think I have a lenny box handy :-/
[13:57:18] <hingo> But libmysqlclient15off is a separate package?
[13:57:22] <cafuego> yes
[13:57:43] <cafuego> that's the name of the pkg provided by the distro, that
would override a Provides in the maria packages
[13:59:38] <hingo> cafuego: I'm confused. mariadb-common does not contain
libmysqlclient15off or any other libmysqlclient.
[13:59:39] <cafuego> There's 134 packages in Lenny that depend on
libmysqlclient15off.
[14:00:02] <hingo> I mean of course mysql-common from the mariadb repo.
[14:00:47] <cafuego> I swear I had a good reason at the time ;-)
[14:01:38] <cafuego> Oh that's right.
[14:01:40] <hingo> Ok. I can see how there could be a similar reason as for
libmysqlclient* problems. You kind of answered my second question anyway.
[14:01:56] <cafuego> Packages depend on libmysqlclient15off and
libmysqlclient15off depends on mysql-common.
[14:02:09] <hingo> ok.
[14:02:24] <hingo> Yes, of course.
[14:02:44] <cafuego> libmysqlclient15off has a versioned depend, so a provides
line in mariadb-common doesn't override that
[14:03:21] <cafuego> I think I got stuck in circular depependency land and
yelled at my machine a lot. Then I decided to just not rename everything :-)
[14:03:36] <hingo> Hmm... I bet that versioned depend isn't really necessary.
It's just a .cnf file there...
[14:04:08] <cafuego> hingo: probably, but it's a distro pkg so I can't change
it without actually providing a package of that name anyway.
[14:04:25] <hingo> cafuego: No, of course.
[14:04:40] <cafuego> So it went into the "currently unfixable" basket
[14:04:45] <cafuego> Well
[14:05:25] <cafuego> It's easily fixable as long as I don't expect users to
want to simply 'aptitude upgrade', but instead download depdns and manually
install them with dpkg.
[14:05:50] <hingo> cafuego: Btw, do you have an idea why RPM based systems
avoid this same problem? For them it works with a Provides?
[14:06:11] <hingo> cafuego: No we of course want to support apt-get/aptitude.
[14:06:33] <hingo> cafuego: We are looking into getting MariaDB into Debian
itself, but then we cannot have 2 packages with the same name.
[14:07:18] <cafuego> Ah yes.
[14:07:46] <cafuego> Well, in *theory* they shouldn't have that awful depend
in squeeze
[14:07:47] <hingo> This smells apt bug to me actually.
--
Henrik Ingo
Project Manager and COO, Monty Program Ab
hingo(a)askmonty.org, skype:henrik.ingo, +358405697354
http://askmonty.org/wiki/index.php/About_Us
What's up with MariaDB?
http://askmonty.org/wiki/index.php/MariaDB
-------------------------------------------------------
--
email: henrik.ingo(a)avoinelama.fi
tel: +358-40-5697354
www: www.avoinelama.fi/~hingo
book: www.openlife.cc
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 30 May '10
by worklog-noreply@askmonty.org 30 May '10
30 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 30 May '10
by worklog-noreply@askmonty.org 30 May '10
30 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 30 May '10
by worklog-noreply@askmonty.org 30 May '10
30 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 30 May '10
by worklog-noreply@askmonty.org 30 May '10
30 May '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 35
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
-=-=(Knielsen - Mon, 29 Mar 2010, 10:59)=-=-
Status updated.
--- /tmp/wklog.47.old.27790 2010-03-29 10:59:53.000000000 +0000
+++ /tmp/wklog.47.new.27790 2010-03-29 10:59:53.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Alexi - Thu, 18 Feb 2010, 19:29)=-=-
Worked 20 hours (alexi)
Worked 20 hours and estimate 15 hours remain (original estimate unchanged).
-=-=(Serg - Fri, 05 Feb 2010, 14:04)=-=-
Observers changed: Knielsen,Serg
------------------------------------------------------------
-=-=(View All Progress Notes, 32 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0