On 9 May 2009, at 10:22, Jim Idle wrote:
Antony T Curtis wrote:
On 9 May 2009, at 00:35, Michael Widenius wrote:
jim> Indeed. Obviously syntactical changes are just slog really and
jim> most of that is ensuring that there is enough coverage in
regression
jim> tests to exercise all the syntactic differences between T-SQL
and
jim> MySql, or rather, all the syntax of MySQL. A lot of
commonality but
jim> still a fair bit of work. What is the definitive language
reference
jim> for MySQL Marina? I can locate reference manuals and ANSI
specs of
jim> course, but if you have one particular document in mind, then
it would
jim> be useful to work off that so I can evaluate the syntactical
changes I
jim> would need to make.
Unfortunately there is no other spec than the sql_yacc.yy file that
comes with the MySQL source code.
I have a tool somewhere which can build .dot files from the .yy
files in the same way how ANTLRworks draws its syntax diagrams....
Copies of the tool exist somewhere on the old MySQL internal
repositories but I am sure I have a backup copy somewhere.
Yes, I added this ability to ANTLR C runtime too. You can see the
effect here:
http://www.temporal-wave.com/index.php?option=com_psrrun&view=psrrun&Itemid=56
What I mean was... Feed in the sql_yacc.yy file and output a human
understandable syntax diagram.
jim>
jim> I would like to do this in two steps:
jim>
jim> a) Replace the mysql_yacc.yy file, with as few changes as
possible in
jim> the rest of the MySQL source.
jim>
jim>
jim> ANTLR is essentially the same gestalt as bison: one or more
source files define the grammar and contain actions in C,
triggered at the parse points; These grammar definition files are
converted into C source by invoking an external program (in the
case of ANTLR it is a Java program) and the resulting .c/.h files
are compiled with the rest of the code.
jim>
jim> I wrote the C output to be machine independent. The idea
being that you can generate the .c files on one machine, check
that source code in and use it on any other machine without re-
generating. This can be useful when targeting systems without Java
machines (or ones that don't work very well). So, like many other
systems you can have a 'build completely from scratch' option and
a 'use pre-generated files option'.
jim>
jim> In the main then, I suspect that the bigger changes to the
MySQL base would be build changes. I say this based upon the
assumption that any new parser should build the same AST for a
query as the yacc based version does. This means the new parser
makes the same calls as the old one and thus the code external to
the parser would probably require little, if anything in the way
of change (but see b below).
That sounds very good.
Not possible to build the same AST as the yacc/bison based parser
because there is no AST.
The execution structures are directly built during the parsing phase.
Sure - for AST, you can read execution structures - there didn't
really seem to be any nomenclature but when I discussed this with
some people at Sun a while back, they used the terms AST and tree,
so I just copied them :-) So I think I would use the AST I already
build, to generate the execution structures there currently. Having
an AST means that there is a structure that can be later used for
optimization (should that be any better than optimization the
current execution structures).
For a few of us, AST != execution structures .... mostly because the
execution structures are very much not "abstract" and were dictated by
historical and implementation issues.
I have argued years ago that a good AST would allow better
optimization later cheaply.
For example, right now, there is a lot of work to try to simplify the
Item trees for boolean expressions. This work would be better done
with ASTs without wasting as much memory.
Agree
You will have to implement an all-new AST which then can be used to
'compile' the execution structures.
(this was a minor sticking part of an earlier attempt to replace
the parser - there were objections to building an AST and adding an
additional stage to the execution)
Yeah - I just wrote that before I read this paragraph ;-) As I
already have this structure, it can be shown to pretty much zero
overhead compared to query optimization and execution time. A good
AST mostly simplifies rather than complicates things of course.
Whether my current AST needs shaping to optimally generate MySQL
remains to be seen, but it is easy to change if needed.
++Agree
The only tricky issues is to try to generate exactly the same
structures but in some instances later, there will be scope for
actually creating simpler structures than what MySQL currently
generates.
n more generic structures. That should be easy to define.
The MySQL code doesn't rely on bison structures at all (as far as
I know).
AFAIK, MySQL does not use any Bison internals as such.
Yes - I could not see any reliance myself.
A past project from many years ago, I put into MySQL a PCCTS (pre-
ANTLR) LL parser, basically keeping MySQL's custom lexer but it
generated Token objects instead suitable for the PCCTS generated
parser. The project was never completed because integration of
Stored Procedures became the objective for MySQL 5.0 instead of
implementation of a new parser.
I have asked Stewart Smith to see if he can help make public that
old source repository as a Bazaar repository as people may be able
to learn from it how to replace the parser without major pain.
Sure - any prior work obviously helps, but I think that the lexing
is already good. I might make changes to keyword detection to become
MySql friendly; T-SQL is a historically a bit unsure exactly when
keywords can be used. This would reduce the DFA sizes for the parser.
Did you get time to do the analyse?
jim>
jim> A few other things that may be of note:
jim>
jim> 1) The ANTLR generated code is free threading so long as any
action code that you place withing the parser is also free
threading.
In other words, same as with bison
The ANTLR runtime does, of course, require Java.
This, this would be my C runtime as Java is way too slow for this.
There is no Java required at runtime, only for developers to
generate the C.
So developers who wish to extend the grammar do have to have the
Java Runtime installed as well as the ANTLR jars.
For grammar yes as the ANTLR tool runs in Java. In practice this
isn't really a problem I think?
Yeah. When I wrote "ANTLR runtime", I meant "ANTLR compiler".
I too am interested to help if at all possible.
One thing that this would need would be a good regression testing
suite. Right now I have 1300 .sql files that prove the lexer/parser/
ast, but there would need to be more of these and runtime testing
and so on.
There will have to be proper unit tests to show that a certain AST
builds the correct required MySQL execution structures (Item trees,
JOIN, TABLE_LIST etc etc).
The best part of such an effort would be "At last! a formal
specification of how the execution structures should look like".
This would be exciting and cool if we can (at last) have a sane
parser... especially as LL parsers always generate much nicer error
messages for the users.
Regards,
Antony.