The handler API in MySQL and MariaDB is operating on a single record at a time. There are some ‘batched’ primitives that in my understanding were developed for the NDB cluster, but generally InnoDB does not use them.

The only nontrivial thing that InnoDB does during reads is the Index Condition Pushdown (ICP) and the end-of-range detection (DsMrr) that was ported from the never-released MySQL 6.0 or 5.2 to MySQL 5.6.

To improve the performance of range scans, InnoDB uses a caching mechanism to prefetch results. Starting with the 4th consecutive read on the same ha_innobase handle, InnoDB would prefetch 8 records into row_prebuilt_t::fetch_cache. There are several drawbacks of this:

The parameters of this cache are hard-coded, and these caches can make wrong guesses (for SELECT * FROM t LIMIT 5, it would unnecessarily read 8 records, possibly causing unnecessary page accesses)
The caches waste memory, especially on partitioned tables, where each partition handle would have its own cache. (The main benefit of ha_innopart or InnoDB native partitioning in MySQL 5.7 is that the fetch_cache of the table is shared between partitions. It is only a benefit of saving memory, because the cache will be emptied when switching partitions.)

As far as I understand, the entire purpose of the prebuilt->fetch_cache is to reduce the cost of acquiring the page latch and repositioning the cursor on each read request. If we did not call mtr_t::commit() between each request, maybe we could remove this cache altogether.

When I was in the InnoDB team at MySQL, I did bring up a few times the idea of eliminating the fetch_cache, but the optimizer developers were in a separate team, and to do this, we would have needed support from both managers. I am hoping that with MariaDB this would merely be a technical challenge to first write a proof-of-concept prototype, and then to polish and test it.

When I discussed this with the late Olav Sandstå some years ago, there was some thinking that we should stop hard-coding the TABLE::record[0] in various parts of the code, and the storage engine should be able to return multiple records to a larger TABLE::record[] buffer. I wonder if a simpler approach would work, and if we could do without any prefetching:

Remove row_sel_dequeue_cached_row_for_mysql(), prebuilt->fetch_cache, and related code.
Add prebuilt->mtr and let it be reused across multiple subsequent calls to row_search_mvcc(), until the query executor stops the current read batch.
A call to btr_pcur_store_position() and mtr_t::commit() will be necessary before any potentially long delay, such as when the result is being sent to a client connection, or when the query executor is going to switch to read records from another table.
The "end of read batch" could be signalled by invoking handler::extra() with a parameter.

With such changes, less copying would take place, and there should be less contention on dict_index_t::lock and the upper-level buf_block_t::lock, due to fewer calls to btr_pcur_restore_position(). While that function does have a fast path ("optimistic restore"), it cannot switch to the next page without a key lookup. Range scans would be more efficient if we can always simply advance to the right leaf page and release the lock on the left page.

I would love to do this in InnoDB. But I think that the effort should be lead from the optimizer side. Someone should first implement the "end of batched read" calls and some DBUG_ASSERT() that would catch any wrong-doing. That is, whenever we start to send data to a client or start to write into a temporary table, or switch to read records from another cursor, we must assert that the "end of batched read" flag is set on all cursor handles. Similarly, at the end and start of a statement, and at commit or rollback, all cursor handles must be at "end of batched read".

Who would take the challenge?

Best regards,

Marko

Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation

DON’T MISS

M|18

MariaDB User Conference

February 26 - 27, 2018

New York City

https://m18.mariadb.com/