Re: [Maria-developers] MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables

10 Sep 2013

      Hi Sergei,

On Tue, Sep 10, 2013 at 09:11:16PM +0200, Sergei Golubchik wrote:
...
Hi, Sergey!
On Sep 10, Sergey Vojtovich wrote:
...
Hi Sergei,
thanks for looking into this patch. Frankly speaking I find it a bit
questionable too. Below are links that should answer your questions...
What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html
How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956
Yes, I've seen and remember both, but they don't answer my question,
which was about specific changes that you've done, not about the goal.
But ok, see below.
...
For every statment we acquire table from table cache and then release table back
to the cache. That involves update of 3 lists: unused_tables, per-share
used_tables and free_tables. These lists are protected by LOCK_open
(see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex?
I would have done that already if it would give us considerable performance gain.
Alas, it doesn't solve CPU cache coherence problem.
Another reason is global unused_tables list, which cannot be protected by
per-share mutex. So we'll have to lock additional mutex twice per query.

The third reason (less important) is that I'm not sure if
TABLE_SHARE::visit_subgraph can be protected by per-share index. I suspect
it needs to see consistent table cache state.

The fourth reason (even less important) is that it is quite complex:
there're invariants between free_tables, unused_tables and in_use.
...
...
Every time we update global pointer, corresponding cache lines of
sibling CPUs have to be invalidated. This is causing expensive memory
reads while LOCK_open is held.
Oracle solved this problem by partitioning table cache, allowing
emulation of something like per-CPU lists.
We attempted to split LOCK_open logically, and succeeded at everything
but these 3 lists. I attempted lock-free list for free_tables, but TPS
rate didn't improve.
How did you do the lock-free list, could you show, please?
Please find it attached. It is mixed with different changes, just search for
my_atomic_casptr.
...
...
What we need is to reduce number of these expensive memory reads, and
there are two solutions: partition these lists or get rid of them. As
we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a
thread adding a table only to "this thread's" list. Of course, it's not
complete partitioning betwen CPUs, as any thread can remove a table from
any list. But at least there won't be one global list head pointer.
Yes, that's what Oracle did and what we're trying to avoid.
...
...
Why I find this patch questionable? It reduces LOCK_open wait time by
30%, to get close to Oracle wait time, we need to reduce wait time by
90%. We could remove unused_tables as well, but it will be 60% not 90%.
Hmm, if you're only interested in optimizing this specific use case -
one table, many threads - then yes, may be. But if you have many tables,
then modifying per-share lists under the share own mutex is, basically,
a must.
If you mean the oposite situation when N threads accessing N tables
(each thread accessing it's own table): there should be no problem, because
every thread is accessing it's own lists. Well, except for unused_tables of
course, but it can't be protected by per-share mutex anyway (tested a few
months ago, but we could ask XL to double check the above case if needed).

Thanks,
Sergey