Re: [Maria-developers] [GSoC] self-tuning optimizer
Hi, Anshu! On Jun 13, Anshu Avinash wrote:
So I have refactored Cost_factors::init() into Cost_factors::init() and Cost_factors::re_init(). The diff is at: https://github.com/igniting/server/commit/41e360309c067d072d788495e4e9480546...
Looks great. I've added one comment there.
I had started writing a test for checking whether if loading a new storage engine works or not. I loaded 'Example' storage engine, but was unable to test the current t1 table on it, as t1 contains a primary key. Also I need to write test cases for 'time_for_compare'. Any hints on that?
Hm. Yes. Example engine, is pretty useless for you, as you cannot create tables in it. You can use InnoDB plugin. But create a separate test file for that. For example (I'll use the short name below, don't commit it): a.test: ============================================== if (!$HA_INNODB_SO) { skip Need InnoDB plugin; } install plugin innodb soname 'ha_innodb'; ... do your tests here uninstall plugin innodb; ============================================== you will also need a.opt file: ============================================== --ignore-builtin-innodb --loose-innodb ============================================== As for time_for_compare... Here's a universal approach, you can do it yourself. This time_for_compare is 5 by default, right? Change it to be, say, 500, and run existing tests (start from the main suite, and there from optimizer tests, e.g. from select.test). If some test will fail - here's your query, extract it, simplify, perhaps, and put in your test file. That's exactly what I just did :) And here's a simple test: create table t1 (a blob, index(a(20))); create table t2 (a blob, index(a(20))); insert into t1 values ('one'),('two'),('three'),('four'),('five'); insert into t2 values ('one'),('two'),('three'),('four'),('five'); insert into t2 values ('one'),('two'),('three'),('four'),('five'); drop table t1, t2; Perhaps you can get it down to one table - I didn't try.
Also I think I can start with the next part, i.e. collect statistics. We can have a structure like:
struct Cost_factor_stats { char *cost_factor_name; double cost_factor_value; double total_queries;
total_queries should be ulonglong, not double.
double total_query_time; double total_query_time_squared; };
We will be updating these structures after every query per THD and write the data into a persistent table on disconnect. Is this workflow correct?
Almost. You need a Cost_factor_stats structure per cost factor (and per engine where appropriate). You don't need the name there. You have these structures per THD and globally. Basically, instead of your current double time_for_compare_val; you will have Cost_factor_stats time_for_compare_val; hmm, perhaps better to name the structure simply "Cost_factor": Cost_factor time_for_compare; And you put them all in one structure: struct Engine_cost_factors { Cost_factor read_time; Cost_factor scan_time; }; struct Global_cost_factors { Cost_factor time_for_compare; Cost_factor time_for_compare_rowid; }; class Cost_factors { Global_cost_factors global; Engine_cost_factors engine[MAX_HA]; double time_for_compare() const { return global.read_time.value; } double time_for_compare_rowid() const; double read_time(const handler*) const; double scan_time(const handler*) const; }; Now, you stick this structure in the THD, and also declare one instance globally. On connect you initialize THD::Cost_factor values from the global Cost_factor object. On disconnect you add THD::Cost_factor values back to the global Cost_factor object. There's no writing to disk so far, we'll think about it later (you don't want to write to disk on *every* disconnect).
Few more questions: 1.What will we do if the query had used more than one cost factor? 2.How would joins be handled?
Yes, a query will certainly use more than one cost factor. In fact, this is something we might want to experiment with. I see two approaches - exact and approximate. Approximate one is described in MDEV. Say, you've counted 500 index reads, 20,000 sequential row reads (table scan), some 10,000,000 comparisons, etc. And you know how long the query took. For another query you've counted different numbers, and so on. Basically, you'll have a system of linear equations that one can solve to know how the time was split. The problem with this approach - the server is doing much more than purely index reads, table scans, and comparisons. So this unaccounted time will be split between cost factors. I don't know if it's a problem. The exact approach - we only measure times spent on the actual operation. Basically, you measure how long *each row read* took, how long *each index read* took, etc. This will show no unaccounted time. But it's more expensive, it might be too expensive to be practically useful - I don't know that. These two approaches need different data structures, the first one needs more memory, because you cannot start aggregating before you solve the system of equations. Basically, if you have four constant (time_for_compare, time_for_compare_rowid, read_time, scan_time, only one engine), you'll need four equations, that is, per THD you need to store four measurements: struct measurement { ulong time_for_compare; ulong time_for_compare_rowid; struct per_engine { ulong scan_time; ulong read_time; } per_engine[MAX_HA]; double total_time; }; and in THD you need struct measurement data[4]; Also, note that I'm writing per_engine[MAX_HA], but it's a simplification, good enough for a prototype. But in the final code we cannot allocate that much memory per THD, as most of the time we'll have only few engines, not MAX_HA. Regards, Sergei
participants (1)
-
Sergei Golubchik