Hi, Anshu! On Jun 18, Anshu Avinash wrote:
Hi serg,
Sorry that I had been out of touch for 1-2 days. I got caught up in some other work.
I have read through the comments you made on MDEV and I have just started the implementation for the first approach, the approximate method. I have pushed a prototype at: https://github.com/igniting/server/commit/3bd4f5b6fd20d11575207ac36e8a3cc2c9...
Here is what I'm thinking, we add to the array of measurement data, after each query and try to solve the system of equations because we don't know when the system of equations would become solvable.
Eh. To solve a system of linear equations with N unknowns, you need N equations. So, you wait until you get N equations and solve it. Few thoughts: 1. it might be that out of all factors that you have in your data structures (currently it's 2 + 2 * MAX_HA), only few will be actually used in queries (for example, only TIME_FOR_COMPARE and read_time for InnoDB). Then you only need two queries to solve that, not 2+2*MAX_HA. But it's an optimization, try to do the basic implementation without it first. 2. You normally store these data (for N queries) in the THD, right? It might be that the thread will disconnect before it'll do the required number of queries. To avoid throwing the useful data away, you can store one extra set of data globally and when some THD disconnects and it doesn't have enough data to solve the equation, you append this data to the global set and solve that as a whole. Again, this is an extension, only do it when the basic implementation works.
However this might be time consuming, trying to solve equations after each query. Can we store the entire data (from all the queries) and apply a best fitting model, instead of solving them exactly? This we will do at disconnect.
I don't have a good answer for that. It's a trade-off. Either you collect exactly N equations and solve them. Then you collect the next N equations (or you get another one and delete the first one) - and solve that. And so on. Or you collect many equations, more than N. And find a best fitting solution. This way you do less solving, but you'll need more memory. I think, for now you can do that - use a best fitting solution. We only have very few factors at the moment. Still, even now you'll need some limit to how many equations you'll collect. Later we'll see what to do - perhaps we'll need to decrease the limit, or set it exactly to N (which means, the solution won't be "best fitting" anymore). But this will be not in this GSoC, not your project. I'm sure that in this project we won't have enough factors for memory usage to become an issue :)
Should it be in ~THD() or somewhere else?
Yes, ~THD is fine, just make sure you only do it for real connection THD's, where SELECT queries were issues and optimizer was run. Not for various internal THD's, like the one you create in cost_factors::init. For example, as you've noticed, I've mentioned in the last comment about using read_time() result and Nrows_estimated (the same applies to scan_time(), of course). So, when ha_read_time() is called, you need to remember (in your data set) the return value of read_time() and Nrows_estimated (that's the last argument of read_time()). This way, you only solve your equations if you have this read_time() and Nrows_estimated (or scan_time()) stored. If not - the optimizer was never run and you don't do anything. Regards, Sergei