On 9 May 2009, at 00:35, Michael Widenius wrote:
Hi!
I fixed the maria-developers address so that everyone else can follow and participate in the discussion.
jim> jim> Monty, jim> jim> jim> jim> My T-SQL parser is now complete and a number of people are using jim> jim> the Java, C# and C version. Would you like to have a look at it? jim> jim> Monty> Yes, I would be intrested to know more about this! jim> jim> OK - I have my demo online now. If you visit http://www.temporal-wave.com jim> jim> and take the Online Demos->T-SQL menu option you can enter T-SQL jim> batches and see the parse results. Over the next day or so, I am jim> adding demos of my C# and VB.Net parsers, so if the site goes up and jim> down then just wait a few minutes and try again :-) Once I have jim> evaluated some of the things you mention below then I will let you jim> have a scan at the source code.
ok. I will take a look as soon as I have got through the emails that has batched up during the last 3 weeks.
jim> jim> I think that I could steal a lot of common syntax from this for a jim> new MySQL parser. There are 1100 regression tests culled from the jim> T-SQL language spec and user examples; on my Linux system the parse jim> time, including building and walking the AST is: jim> jim> jim> jim> jimi(50_64)-clean: wc -l *.sql jim> jim> ... jim> jim> 12923 total jim> jim> jim> jim> jimi(50_64)-clean: time tsqlc . >out jim> jim> 0.15user 0.19system 0:00.34elapsed 100%CPU (0avgtext +0avgdata jim> 0maxresident)k jim> jim> 0inputs+232outputs (0major+102772minor)pagefaults 0swaps jim> jim> jim> Which includes the time taken to read each of the 1100 files from jim> the file system and to print a message for each saying the parse was jim> successful of course. jim> jim> Anyway, if you want to take a look at the code, just let me know :-) jim> jim> Monty> The big question is how much work it would be to jim> replace the MySQLparser with your T-SQL parser.
jim> Indeed. Obviously syntactical changes are just slog really and jim> most of that is ensuring that there is enough coverage in regression jim> tests to exercise all the syntactic differences between T-SQL and jim> MySql, or rather, all the syntax of MySQL. A lot of commonality but jim> still a fair bit of work. What is the definitive language reference jim> for MySQL Marina? I can locate reference manuals and ANSI specs of jim> course, but if you have one particular document in mind, then it would jim> be useful to work off that so I can evaluate the syntactical changes I jim> would need to make.
Unfortunately there is no other spec than the sql_yacc.yy file that comes with the MySQL source code.
I have a tool somewhere which can build .dot files from the .yy files in the same way how ANTLRworks draws its syntax diagrams.... Copies of the tool exist somewhere on the old MySQL internal repositories but I am sure I have a backup copy somewhere.
jim> jim> I would like to do this in two steps: jim> jim> a) Replace the mysql_yacc.yy file, with as few changes as possible in jim> the rest of the MySQL source. jim> jim> jim> ANTLR is essentially the same gestalt as bison: one or more source files define the grammar and contain actions in C, triggered at the parse points; These grammar definition files are converted into C source by invoking an external program (in the case of ANTLR it is a Java program) and the resulting .c/.h files are compiled with the rest of the code. jim> jim> I wrote the C output to be machine independent. The idea being that you can generate the .c files on one machine, check that source code in and use it on any other machine without re-generating. This can be useful when targeting systems without Java machines (or ones that don't work very well). So, like many other systems you can have a 'build completely from scratch' option and a 'use pre-generated files option'. jim> jim> In the main then, I suspect that the bigger changes to the MySQL base would be build changes. I say this based upon the assumption that any new parser should build the same AST for a query as the yacc based version does. This means the new parser makes the same calls as the old one and thus the code external to the parser would probably require little, if anything in the way of change (but see b below).
That sounds very good.
Not possible to build the same AST as the yacc/bison based parser because there is no AST. The execution structures are directly built during the parsing phase.
jim> b) Optimze MySQL to be even faster, with the new parser at a base. jim> jim> Obviously there are two aspects, being the speed of the parser itself, which I don't think will be an issue, and then what kind of transformations are done on the resulting AST for optimization purposes. If we use exactly the same AST as MySQL does now, then improving MySQL performance is the same problem domain as it is now. However, perhaps there is opportunity for a new parser to provide more information about the query that would aid optimization. Possibilities here include modifications to the AST, external information structures that current code can use when walking the AST, or even a completely new AST. The latter would probably have quite an effect on the front end of the parse->AST->execution phase of course, so I would think that the first step is to try to achieve your step a with the existing AST as output.
Agree
You will have to implement an all-new AST which then can be used to 'compile' the execution structures. (this was a minor sticking part of an earlier attempt to replace the parser - there were objections to building an AST and adding an additional stage to the execution)
jim> Of course maintaining and changing an ANTLR based grammar is somewhat easier than changing bison/yacc, if for no other reason than ANTLR doesn't need much in the way of code assists for correct parsing, whereas Bison quite often does.
jim> The a) option is quite important to be able to do easy merges with the jim> main MySQL tree. jim> jim> I have time this week to take a look at this. Mechanically I know it is easy enough, but I need to look at the calls that are made from the current yacc/bison parser and see if there is anything in there that would be awkward. The only things I could think of would be if the existing code relies on internal structures of bison rather than more generic structures. That should be easy to define.
The MySQL code doesn't rely on bison structures at all (as far as I know).
AFAIK, MySQL does not use any Bison internals as such. A past project from many years ago, I put into MySQL a PCCTS (pre-ANTLR) LL parser, basically keeping MySQL's custom lexer but it generated Token objects instead suitable for the PCCTS generated parser. The project was never completed because integration of Stored Procedures became the objective for MySQL 5.0 instead of implementation of a new parser. I have asked Stewart Smith to see if he can help make public that old source repository as a Bazaar repository as people may be able to learn from it how to replace the parser without major pain.
Did you get time to do the analyse?
jim> jim> A few other things that may be of note: jim> jim> 1) The ANTLR generated code is free threading so long as any action code that you place withing the parser is also free threading.
In other words, same as with bison
The ANTLR runtime does, of course, require Java. So developers who wish to extend the grammar do have to have the Java Runtime installed as well as the ANTLR jars.
jim> 2) The only system calls it uses is {m|c|re}alloc/free, which the C runtime component can re-direct (say to the MySQL overrides). For completely free threading, those calls also need to be free threading and so they are the only limit to threading of course.
Sounds perfect.
I would *really* like to see a new parser for MariaDB. Anything that we can do to help get this done ?
I too am interested to help if at all possible. Regards, Antony,