Hi Sergei, On Tue, Sep 10, 2013 at 09:11:16PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 10, Sergey Vojtovich wrote:
Hi Sergei,
thanks for looking into this patch. Frankly speaking I find it a bit questionable too. Below are links that should answer your questions... What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956
Yes, I've seen and remember both, but they don't answer my question, which was about specific changes that you've done, not about the goal. But ok, see below.
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex? I would have done that already if it would give us considerable performance gain. Alas, it doesn't solve CPU cache coherence problem.
Another reason is global unused_tables list, which cannot be protected by per-share mutex. So we'll have to lock additional mutex twice per query. The third reason (less important) is that I'm not sure if TABLE_SHARE::visit_subgraph can be protected by per-share index. I suspect it needs to see consistent table cache state. The fourth reason (even less important) is that it is quite complex: there're invariants between free_tables, unused_tables and in_use.
Every time we update global pointer, corresponding cache lines of sibling CPUs have to be invalidated. This is causing expensive memory reads while LOCK_open is held.
Oracle solved this problem by partitioning table cache, allowing emulation of something like per-CPU lists.
We attempted to split LOCK_open logically, and succeeded at everything but these 3 lists. I attempted lock-free list for free_tables, but TPS rate didn't improve.
How did you do the lock-free list, could you show, please?
Please find it attached. It is mixed with different changes, just search for my_atomic_casptr.
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer.
Yes, that's what Oracle did and what we're trying to avoid.
Why I find this patch questionable? It reduces LOCK_open wait time by 30%, to get close to Oracle wait time, we need to reduce wait time by 90%. We could remove unused_tables as well, but it will be 60% not 90%.
Hmm, if you're only interested in optimizing this specific use case - one table, many threads - then yes, may be. But if you have many tables, then modifying per-share lists under the share own mutex is, basically, a must.
If you mean the oposite situation when N threads accessing N tables (each thread accessing it's own table): there should be no problem, because every thread is accessing it's own lists. Well, except for unused_tables of course, but it can't be protected by per-share mutex anyway (tested a few months ago, but we could ask XL to double check the above case if needed). Thanks, Sergey