[Maria-developers] GSoC 2015: Indexes on Virtual Columns and Array Based UDFs

Dear mentors, I am currently pursuing Master's in University of Illinois at Urbana Champaign, USA and completed by B.Tech from Indian Institute of Technology - Delhi (IIT- Delhi) This is regarding GSoC 2015. I am really interested in databases, and I was very excited to see all these projects listed here. The exciting part was that some of the projects are really “hard” as in they have challenged the database community since a long time, and thus it would be very interesting to solve some of these challenges as part of GSoC. I want to discuss 2 projects: A. Indexes on virtual columns Materialization gives us two things: 1. A name to the column which we can use in queries 2. A formal "regular" column which is stored and indexed in the regular fashion - Disadvantage: Extra memory requirements for the materialized column. My initial thoughts on this project are the following: We do need the name of the column which can be used to query. So maybe we can expose a command such as: create virtual_index <name> on <column_name> <expression> What this would do would run a regular query which evaluates expressions (like in WHERE clause) and the feed the result into the indexer. This index can then be stored in the regular fashion. B. Having UDFs returning an array/set There are three approaches that I can think of: 1. Supporting array/set as native datatype inside MariaDB (like int64, double, etc) - This might be hard and touches all levels of stack. 2. Have the array/set pass in serialized form to the above node of query execution and have appropriate deserializer when we want to interpret the result - Coming up with ser/deser strategy might be tough and this would be expensive too. 3. The query execution would be in a Tree structure where each node must be exposing functions like init(), next(), read(int col_index), etc. Maybe we can use this to emulate the evaluation of UDF against row. I think this is the suggestion that is listed in the project. I would like to get some direction on this from the mentors. I would like to discuss these and then decide on one of them. Am I approaching this in the right direction? Can you please point me to the next steps? Thanks Richa

Hi, Richa! On Mar 13, Richa Sehgal wrote:
Dear mentors,
I am currently pursuing Master's in University of Illinois at Urbana Champaign, USA and completed by B.Tech from Indian Institute of Technology - Delhi (IIT- Delhi)
This is regarding GSoC 2015. I am really interested in databases, and I was very excited to see all these projects listed here. The exciting part was that some of the projects are really “hard” as in they have challenged the database community since a long time, and thus it would be very interesting to solve some of these challenges as part of GSoC.
I want to discuss 2 projects:
A. Indexes on virtual columns
Materialization gives us two things:
1. A name to the column which we can use in queries
Not quite. Even non-materialized columns always have names. See https://mariadb.com/kb/en/mariadb/virtual-computed-columns/
2. A formal "regular" column which is stored and indexed in the regular fashion - Disadvantage: Extra memory requirements for the materialized column.
Exactly.
My initial thoughts on this project are the following:
We do need the name of the column which can be used to query. So maybe we can expose a command such as:
create virtual_index <name> on <column_name> <expression>
What this would do would run a regular query which evaluates expressions (like in WHERE clause) and the feed the result into the indexer. This index can then be stored in the regular fashion.
Yes, in a sense. The update_virtual_fields() function calculates values of virtual columns (e.g. if one does SELECT * from table_with_virtual columns). In MyISAM, _mi_make_key() creates a key from the row image. Pure virtual columns are not in the row image, so they cannot be indexed that way. So, in a very simplified way, this project is about calling update_virtual_fields() from _mi_make_key().
B. Having UDFs returning an array/set
There are three approaches that I can think of: ... I would like to discuss these and then decide on one of them. Am I approaching this in the right direction? Can you please point me to the next steps?
I'm afraid, the project was badly formulated. I've rewritten it now to highlight what it is really about. It is not about arrays, the important part is a "result set". See MDEV-5199 Regards, Sergei

Thanks Sergei for the explanation. The modified project for MDEV-5199 (Table Functions) now makes sense - as in the proposals mentioned in the ticket as possible solutions now looks appropriate :) I am looking into the index of virtual columns. One question: can we submit proposals for 2 projects at MariaDB? Now as for Table Functions: As mentioned there are two ways: 1. Materializing the table - this has a disadvantage for extra space. 2. Treating the UDF result as an "abstract table". For eg, the code might have a base class called Table. Then we can create a derived class called UDFTable and pass this instance in the regular query processing. This class would implement all functionalities like Init(), Next(), Read(int col_index), Seek(), etc Consider: Select A where A.x > 10 FROM UDF_TABLE In this case, we will first construct the table UDF_TABLE through regular query - evaluation of the UDF. Then we can create an instance of UDFTable from the table and run normal query processing on this. Is this the right approach? Can you send me the next steps? Thanks Richa On Fri, Mar 13, 2015 at 4:23 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Richa!
Dear mentors,
I am currently pursuing Master's in University of Illinois at Urbana Champaign, USA and completed by B.Tech from Indian Institute of Technology - Delhi (IIT- Delhi)
This is regarding GSoC 2015. I am really interested in databases, and I was very excited to see all these projects listed here. The exciting part was that some of the projects are really “hard” as in they have challenged
On Mar 13, Richa Sehgal wrote: the
database community since a long time, and thus it would be very interesting to solve some of these challenges as part of GSoC.
I want to discuss 2 projects:
A. Indexes on virtual columns
Materialization gives us two things:
1. A name to the column which we can use in queries
Not quite. Even non-materialized columns always have names. See https://mariadb.com/kb/en/mariadb/virtual-computed-columns/
2. A formal "regular" column which is stored and indexed in the regular fashion - Disadvantage: Extra memory requirements for the materialized column.
Exactly.
My initial thoughts on this project are the following:
We do need the name of the column which can be used to query. So maybe we can expose a command such as:
create virtual_index <name> on <column_name> <expression>
What this would do would run a regular query which evaluates expressions (like in WHERE clause) and the feed the result into the indexer. This index can then be stored in the regular fashion.
Yes, in a sense.
The update_virtual_fields() function calculates values of virtual columns (e.g. if one does SELECT * from table_with_virtual columns).
In MyISAM, _mi_make_key() creates a key from the row image.
Pure virtual columns are not in the row image, so they cannot be indexed that way.
So, in a very simplified way, this project is about calling update_virtual_fields() from _mi_make_key().
B. Having UDFs returning an array/set
There are three approaches that I can think of: ... I would like to discuss these and then decide on one of them. Am I approaching this in the right direction? Can you please point me to the next steps?
I'm afraid, the project was badly formulated.
I've rewritten it now to highlight what it is really about. It is not about arrays, the important part is a "result set". See MDEV-5199
Regards, Sergei

Hi, Richa! On Mar 14, Richa Sehgal wrote:
Thanks Sergei for the explanation. The modified project for MDEV-5199 (Table Functions) now makes sense - as in the proposals mentioned in the ticket as possible solutions now looks appropriate :)
I am looking into the index of virtual columns. One question: can we submit proposals for 2 projects at MariaDB?
We don't mind, but rules are set by Google, if they allow it - you can, otherwise you cannot.
Now as for Table Functions:
As mentioned there are two ways: 1. Materializing the table - this has a disadvantage for extra space. 2. Treating the UDF result as an "abstract table". For eg, the code might have a base class called Table. Then we can create a derived class called UDFTable and pass this instance in the regular query processing. This class would implement all functionalities like Init(), Next(), Read(int col_index), Seek(), etc Consider: Select A where A.x > 10 FROM UDF_TABLE
In this case, we will first construct the table UDF_TABLE through regular query - evaluation of the UDF. Then we can create an instance of UDFTable from the table and run normal query processing on this.
Is this the right approach? Can you send me the next steps?
the "abstract table" that can do Next, Read, Seek, etc in MariaDB is represented by the "handler" class. All real storage engines inherit from the handler class. See the SEQUENCE engine, it's in the storage/sequence/sequence.cc This is, exactly, the engine that generates data on the fly, so is a table function, in a sense. But it's a hard-coded function and the way of passing arguments is not very nice. Still, it'll give you an idea what a table function handler might look like. If you want an even simpler storage engines - there are some in our book, "MySQL 5.1 Plugin Development". Regards, Sergei

Dear Sergei and Maria community, Finally I have settled on one project: Indexing over virtual columns. I spent the entire day today trying to compile MariaDB in Ubuntu and reading the code. On reading code side, I spent some time in the /storage/myiasm/ : read fiels like mi_key.cc, mi_create.cc, ha_myisam and so on. Took a brief overview arounf _mi_make_key function. I also looked around update_virtual_fields() function. The code path is still not very clear to me - especially the transition point between sql folders to /storage folders. I chatted on #maria IRC and they gave an idea to run Maria in gdb mode. I tried compiling the code it was almost successful! At the very final step, I got the following error: [ 98%] Built target thr_lock Linking CXX shared library libmysqlclient.so /usr/bin/ld: error: /home/shikhar/mariadb/maria/trunk/libmysql/libmysql_versions.ld:155:9: invalid use of VERSION in input file collect2: ld returned 1 exit status make[2]: *** [libmysql/libmysqlclient.so.18.0.0] Error 1 make[1]: *** [libmysql/CMakeFiles/libmysql.dir/all] Error 2 make: *** [all] Error 2 I am not able to proceed now. I tred googling around, but could not find the fix. Can you please guide me in my next steps? Thanks Richa On Sat, Mar 14, 2015 at 10:46 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Richa!
On Mar 14, Richa Sehgal wrote:
Thanks Sergei for the explanation. The modified project for MDEV-5199 (Table Functions) now makes sense - as in the proposals mentioned in the ticket as possible solutions now looks appropriate :)
I am looking into the index of virtual columns. One question: can we submit proposals for 2 projects at MariaDB?
We don't mind, but rules are set by Google, if they allow it - you can, otherwise you cannot.
Now as for Table Functions:
As mentioned there are two ways: 1. Materializing the table - this has a disadvantage for extra space. 2. Treating the UDF result as an "abstract table". For eg, the code might have a base class called Table. Then we can create a derived class called UDFTable and pass this instance in the regular query processing. This class would implement all functionalities like Init(), Next(), Read(int col_index), Seek(), etc Consider: Select A where A.x > 10 FROM UDF_TABLE
In this case, we will first construct the table UDF_TABLE through regular query - evaluation of the UDF. Then we can create an instance of UDFTable from the table and run normal query processing on this.
Is this the right approach? Can you send me the next steps?
the "abstract table" that can do Next, Read, Seek, etc in MariaDB is represented by the "handler" class. All real storage engines inherit from the handler class.
See the SEQUENCE engine, it's in the storage/sequence/sequence.cc This is, exactly, the engine that generates data on the fly, so is a table function, in a sense. But it's a hard-coded function and the way of passing arguments is not very nice. Still, it'll give you an idea what a table function handler might look like.
If you want an even simpler storage engines - there are some in our book, "MySQL 5.1 Plugin Development".
Regards, Sergei

Hi, Richa! On Mar 15, Richa Sehgal wrote:
Dear Sergei and Maria community,
Finally I have settled on one project: Indexing over virtual columns.
I spent the entire day today trying to compile MariaDB in Ubuntu and reading the code. On reading code side, I spent some time in the /storage/myiasm/ : read fiels like mi_key.cc, mi_create.cc, ha_myisam and so on. Took a brief overview arounf _mi_make_key function. I also looked around update_virtual_fields() function.
The code path is still not very clear to me - especially the transition point between sql folders to /storage folders. I chatted on #maria IRC and they gave an idea to run Maria in gdb mode. I tried compiling the code it was almost successful! At the very final step, I got the following error:
[ 98%] Built target thr_lock Linking CXX shared library libmysqlclient.so /usr/bin/ld: error: /home/shikhar/mariadb/maria/trunk/libmysql/libmysql_versions.ld:155:9: invalid use of VERSION in input file collect2: ld returned 1 exit status make[2]: *** [libmysql/libmysqlclient.so.18.0.0] Error 1 make[1]: *** [libmysql/CMakeFiles/libmysql.dir/all] Error 2 make: *** [all] Error 2
I am not able to proceed now. I tred googling around, but could not find the fix. Can you please guide me in my next steps?
I've just tried to google for "mariadb invalid use of VERSION in input file" and the very first match was the one that explains everything. This is a bug in gold linker, see https://sourceware.org/bugzilla/show_bug.cgi?id=16895 and https://mariadb.atlassian.net/browse/MDEV-5982 Use ld as a workaround. Regards, Sergei

On Sun, Mar 15, 2015 at 7:48 AM, Richa Sehgal <richasehgal2908@gmail.com> wrote:
Dear Sergei and Maria community,
Finally I have settled on one project: Indexing over virtual columns.
I spent the entire day today trying to compile MariaDB in Ubuntu and reading the code. On reading code side, I spent some time in the /storage/myiasm/ : read fiels like mi_key.cc, mi_create.cc, ha_myisam and so on. Took a brief overview arounf _mi_make_key function. I also looked around update_virtual_fields() function.
The code path is still not very clear to me - especially the transition point between sql folders to /storage folders. I chatted on #maria IRC and they gave an idea to run Maria in gdb mode. I tried compiling the code it was almost successful! At the very final step, I got the following error:
[ 98%] Built target thr_lock Linking CXX shared library libmysqlclient.so /usr/bin/ld: error: /home/shikhar/mariadb/maria/trunk/libmysql/libmysql_versions.ld:155:9: invalid use of VERSION in input file collect2: ld returned 1 exit status make[2]: *** [libmysql/libmysqlclient.so.18.0.0] Error 1 make[1]: *** [libmysql/CMakeFiles/libmysql.dir/all] Error 2 make: *** [all] Error 2
I am not able to proceed now. I tred googling around, but could not find the fix. Can you please guide me in my next steps?
I think you hit the ld/gold issue. https://mariadb.atlassian.net/browse/MDEV-5982 Best, Nirbhay
Thanks Richa
On Sat, Mar 14, 2015 at 10:46 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Richa!
On Mar 14, Richa Sehgal wrote:
Thanks Sergei for the explanation. The modified project for MDEV-5199 (Table Functions) now makes sense - as in the proposals mentioned in the ticket as possible solutions now looks appropriate :)
I am looking into the index of virtual columns. One question: can we submit proposals for 2 projects at MariaDB?
We don't mind, but rules are set by Google, if they allow it - you can, otherwise you cannot.
Now as for Table Functions:
As mentioned there are two ways: 1. Materializing the table - this has a disadvantage for extra space. 2. Treating the UDF result as an "abstract table". For eg, the code might have a base class called Table. Then we can create a derived class called UDFTable and pass this instance in the regular query processing. This class would implement all functionalities like Init(), Next(), Read(int col_index), Seek(), etc Consider: Select A where A.x > 10 FROM UDF_TABLE
In this case, we will first construct the table UDF_TABLE through regular query - evaluation of the UDF. Then we can create an instance of UDFTable from the table and run normal query processing on this.
Is this the right approach? Can you send me the next steps?
the "abstract table" that can do Next, Read, Seek, etc in MariaDB is represented by the "handler" class. All real storage engines inherit from the handler class.
See the SEQUENCE engine, it's in the storage/sequence/sequence.cc This is, exactly, the engine that generates data on the fly, so is a table function, in a sense. But it's a hard-coded function and the way of passing arguments is not very nice. Still, it'll give you an idea what a table function handler might look like.
If you want an even simpler storage engines - there are some in our book, "MySQL 5.1 Plugin Development".
Regards, Sergei
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

Thanks Sergei and Nirbhay! I have managed to install Maria. Then I wanted to start working on understanding virtual columns and index in the code, and decided to run it in gdb. That was pretty tricky until I saw the page for ptrace() permission on Ubuntu. The knowledge base is pretty good :) I put a breakpoint in the functions _mi_make_keys() and update_virtual_fields(). I then ran sql commands on the mariadb terminal to create tables with virtual persistent columns and also inserted values. At each of these, my breakpoint in the function update_virtual_fields was hit, but not at _mi_make_keys(). I Googled a bit, and saw this: MyISAM was the default storage engine from MySQL 3.23 until it was replaced by InnoDB in MariaDB and MySQL 5.5. on this page: https://mariadb.com/kb/en/mariadb/myisam-storage-engine/ I guess that MyISAM is no longer used by default and that's why breakpoint in _mi_make_keys was never hit. So I have two questions to proceed with this GSoC project: 1. In the GSoC, would we still concentrate on MyISAM? If not, can you please guide me and give me the function that constructs the key in InnoDB code? 2. I tried specifying the engine = myiasm on the terminal using this link on mysql http://dev.mysql.com/doc/refman/5.1/en/storage-engine-setting.html But this didn't help. I believe we have to change some setting in the config file? I request you to guide me to the next step. From my side, I am working hard to understand the code base and playing with it in GDB and through terminal. Thanks Richa On Sun, Mar 15, 2015 at 2:31 PM, Nirbhay Choubey <nirbhay@mariadb.com> wrote:
On Sun, Mar 15, 2015 at 7:48 AM, Richa Sehgal <richasehgal2908@gmail.com> wrote:
Dear Sergei and Maria community,
Finally I have settled on one project: Indexing over virtual columns.
I spent the entire day today trying to compile MariaDB in Ubuntu and reading the code. On reading code side, I spent some time in the /storage/myiasm/ : read fiels like mi_key.cc, mi_create.cc, ha_myisam and so on. Took a brief overview arounf _mi_make_key function. I also looked around update_virtual_fields() function.
The code path is still not very clear to me - especially the transition point between sql folders to /storage folders. I chatted on #maria IRC and they gave an idea to run Maria in gdb mode. I tried compiling the code it was almost successful! At the very final step, I got the following error:
[ 98%] Built target thr_lock Linking CXX shared library libmysqlclient.so /usr/bin/ld: error: /home/shikhar/mariadb/maria/trunk/libmysql/libmysql_versions.ld:155:9: invalid use of VERSION in input file collect2: ld returned 1 exit status make[2]: *** [libmysql/libmysqlclient.so.18.0.0] Error 1 make[1]: *** [libmysql/CMakeFiles/libmysql.dir/all] Error 2 make: *** [all] Error 2
I am not able to proceed now. I tred googling around, but could not find the fix. Can you please guide me in my next steps?
I think you hit the ld/gold issue. https://mariadb.atlassian.net/browse/MDEV-5982
Best, Nirbhay
Thanks Richa
On Sat, Mar 14, 2015 at 10:46 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Richa!
Thanks Sergei for the explanation. The modified project for MDEV-5199 (Table Functions) now makes sense - as in the proposals mentioned in
On Mar 14, Richa Sehgal wrote: the
ticket as possible solutions now looks appropriate :)
I am looking into the index of virtual columns. One question: can we submit proposals for 2 projects at MariaDB?
We don't mind, but rules are set by Google, if they allow it - you can, otherwise you cannot.
Now as for Table Functions:
As mentioned there are two ways: 1. Materializing the table - this has a disadvantage for extra space. 2. Treating the UDF result as an "abstract table". For eg, the code might have a base class called Table. Then we can create a derived class called UDFTable and pass this instance in the regular query processing. This class would implement all functionalities like Init(), Next(), Read(int col_index), Seek(), etc Consider: Select A where A.x > 10 FROM UDF_TABLE
In this case, we will first construct the table UDF_TABLE through regular query - evaluation of the UDF. Then we can create an instance of UDFTable from the table and run normal query processing on this.
Is this the right approach? Can you send me the next steps?
the "abstract table" that can do Next, Read, Seek, etc in MariaDB is represented by the "handler" class. All real storage engines inherit from the handler class.
See the SEQUENCE engine, it's in the storage/sequence/sequence.cc This is, exactly, the engine that generates data on the fly, so is a table function, in a sense. But it's a hard-coded function and the way of passing arguments is not very nice. Still, it'll give you an idea what a table function handler might look like.
If you want an even simpler storage engines - there are some in our book, "MySQL 5.1 Plugin Development".
Regards, Sergei
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

Hi, Richa! On Mar 16, Richa Sehgal wrote:
1. In the GSoC, would we still concentrate on MyISAM? If not, can you please guide me and give me the function that constructs the key in InnoDB code?
No, and no. The idea is to cover as many engines as possible, but at least MyISAM, Aria, InnoDB, and XtraDB. I thought that MyISAM will be the one to start from, because it's much simpler and I know it much better than InnoDB. Regards, Sergei

Hi, I have submitted an initial draft of my proposal covering the project details on Melange. I would be completing it soon, but it would be good to get feedback from the mentors in the initial stages itself. Please let me know. Thanks Richa On Tue, Mar 17, 2015 at 1:19 AM, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Richa!
On Mar 16, Richa Sehgal wrote:
1. In the GSoC, would we still concentrate on MyISAM? If not, can you please guide me and give me the function that constructs the key in
InnoDB
code?
No, and no. The idea is to cover as many engines as possible, but at least MyISAM, Aria, InnoDB, and XtraDB.
I thought that MyISAM will be the one to start from, because it's much simpler and I know it much better than InnoDB.
Regards, Sergei
participants (3)
-
Nirbhay Choubey
-
Richa Sehgal
-
Sergei Golubchik