Hi, Dammina! On Apr 17, Dammina Sahabandu wrote:
Hi All, I'm Dammina Sahabandu, 3rd year undergraduate of University of Moratuwa(dept. of Computer of Science and Engineering) which has consecutively won GSOC’s most number of participant’s award from 2009 to 2012. I'm willing to contribute for a MariaDB project in this years GSoC. I went through the project ideas page and found the project "Self-Tuning Optimizer" interesting. So can you give me an insight of the project.
Great! I hope, you've read https://mariadb.atlassian.net/browse/MDEV-350 already. So, this is the idea. The optimizer code (in the sql_select.cc, for example) contains lines like best_time= tmp + records/(double) TIME_FOR_COMPARE; (where TIME_FOR_COMPARE is simply 5) or double keys_per_block= (stats.block_size/2.0/len+1); or (stats.data_file_length) / IO_SIZE + 2 (where IO_SIZE is 4096). These magic constants are not universal, they should be different for different engines, different for SSD, etc. This task is about replacing them all with variables, like best_time= tmp + records * time_for_compare; tmp= engine_factor * table->file->keyread_time(...) etc. Then we measure the query execution time (T), number or rows read from every table, etc. And we split the time, approximately, between those, like t1 seconds were spent reading N1 rows from the first table, t2 seconds - reading N2 rows from the second table. t3 seconds - doing N1*N2 comparisons. And we've had cost estimates - optimizer requests them from the engine before executing the query - C1, C2. We assume that engine estimates are perfect - that is, proportional to the actual time, spent on retrieval. t1 = engine_factor * C1 t2 = engine_factor * C2 t3 = N1 * N2 * time_for_compare T = t1 + t2 + t3 One cannot solve the above directly, too many unknowns (lowercase), but we can collect the data from more than one query. This way we collect statistics and extract values for engine_factor (which is different for different storage engines), and time_for_compare (which is the the same for the whole server). And we feed these values back to the optimizer, that can use them to generate more adequate execution plans. This is a continuous process - statistics is always collected while the server is running, and periodically optimizer constants are updated. Regards, Sergei