Re: [Maria-developers] [GSoC] self-tuning optimizer

18 Jun 2014

      Hi, Anshu!

On Jun 18, Anshu Avinash wrote:
...
Hi serg,
Sorry that I had been out of touch for 1-2 days. I got caught up in some
other work.
I have read through the comments you made on MDEV and I have just started
the implementation for the first approach, the approximate method. I have
pushed a prototype at:
https://github.com/igniting/server/commit/3bd4f5b6fd20d11575207ac36e8a3cc2c9...
Here is what I'm thinking, we add to the array of measurement data, after
each query and try to solve the system of equations because we don't know
when the system of equations would become solvable.
Eh. To solve a system of linear equations with N unknowns, you need N
equations. So, you wait until you get N equations and solve it.

Few thoughts:

1. it might be that out of all factors that you have in your data
structures (currently it's 2 + 2 * MAX_HA), only few will be actually
used in queries (for example, only TIME_FOR_COMPARE and read_time for
InnoDB).  Then you only need two queries to solve that, not 2+2*MAX_HA.
But it's an optimization, try to do the basic implementation without it
first.

2. You normally store these data (for N queries) in the THD, right?
It might be that the thread will disconnect before it'll do
the required number of queries. To avoid throwing the useful data away,
you can store one extra set of data globally and when some THD
disconnects and it doesn't have enough data to solve the equation, you
append this data to the global set and solve that as a whole.
Again, this is an extension, only do it when the basic implementation
works.
...
However this might be time consuming, trying to solve equations after
each query. Can we store the entire data (from all the queries) and
apply a best fitting model, instead of solving them exactly? This we
will do at disconnect.
I don't have a good answer for that.
It's a trade-off. Either you collect exactly N equations and solve them.
Then you collect the next N equations (or you get another one and
delete the first one) - and solve that. And so on.

Or you collect many equations, more than N. And find a best fitting
solution. This way you do less solving, but you'll need more memory.

I think, for now you can do that - use a best fitting solution.
We only have very few factors at the moment. Still, even now you'll need
some limit to how many equations you'll collect.

Later we'll see what to do - perhaps we'll need to decrease the limit,
or set it exactly to N  (which means, the solution won't be "best
fitting" anymore). But this will be not in this GSoC, not your project.
I'm sure that in this project we won't have enough factors for memory
usage to become an issue :)
...
Should it be in ~THD() or somewhere else?
Yes, ~THD is fine, just make sure you only do it for real connection
THD's, where SELECT queries were issues and optimizer was run. Not for
various internal THD's, like the one you create in cost_factors::init.

For example, as you've noticed, I've mentioned in the last comment about
using read_time() result and Nrows_estimated (the same applies to
scan_time(), of course). So, when ha_read_time() is called, you need to
remember (in your data set) the return value of read_time() and
Nrows_estimated (that's the last argument of read_time()). This way, you
only solve your equations if you have this read_time() and
Nrows_estimated (or scan_time()) stored. If not - the optimizer was
never run and you don't do anything.

Regards,
Sergei

Sergei Golubchik

tags

participants (1)