ANTLR common pitfalls
Personally I think ANTLR is a great tool, It has a steep learning curve and it has a few quirks. I hope the description here will help you find your problem, understand it and help you fix the issue.
Recognition problems
Recognizes number as ‘1234’ but not as ‘1’:
Example:
grammar number; number : INT; DIGIT : '0'..'9'; INT : DIGIT+;
Explanation, the input such as ‘1’ or ‘4’ is just one char, it will be recognized as a ‘DIGIT’, not as an ‘INT’. You have two options :
– delete the DIGIT rule, and rewrite the INT rule as ” INT:’0′..’9′; ” (works)
– place the ‘fragment’ keyword in front of DIGIT. DIGIT will not be seen as a token.
Grammar Check Errors
The following token definitions are unreachable: INT (AntlrWorks 1.1.7)
The following token definitions can never be matched because prior tokens match the same input: INT (AntlrWorks 1.2) (AntlrWorks 1.3)
Example:
grammar number; DIGIT: '0'..'9'; INT : DIGIT;
What has INT to offer? As is it is a redundant rule. Probably you meant more than one number (thus matching 43) so make it INT : DIGIT+;
Another possibility:
factExpression: Fact fact; fact : ID; propertyExpression: fact Property property; property: ID; NEWLINE : '\r'?'\n'; WS : (' ' | '\t' | '\n' | '\r') { skip(); }; Fact : 'There is ' ARTICLE //added extra space since it needs to be there ; Property : 'has ' ARTICLE //added extra space since it needs to be there ; ID : ('a'..'z'|'A'..'Z')+; ARTICLE : ('a'|'an') ;
Here the error is “The following token…match the same input: ARTICLE”. ID can both match ‘a’ and ‘an’, but in this case ARTICLE is more important. Flip the ID and ARTICLE rule, and it’s fine.
More on Ambiguous rules.
syntax error: codegen: :0:0: unexpected end of subtree (AntlrWorks 1.1.7 & 1.2)
Example:
grammar number; //number : INT | FLOAT; DIGIT : '0'..'9'; INT : DIGIT+; FLOAT : DIGIT* '.' DIGIT+;
You are working with a mixed grammar (both LEXER and PARSER).
You did not include any parser rules, please do so. Uncomment the ‘number’ parser rule.
Another possibility: is that the last line of your grammar is comment, just move it.
*Updated 10-nov-2009 : Added internal link to explain more on Ambiguous rules.;
*Updated 13-oct-2009 : Syntax Highlighting;
*Updated 25-jan-2010 : Small updates
*Updated 17-ock-2014 : Ending support for antlr 3. Thank you, it was fun!
Antlr, AST and rewriting rules.
Onward with my little project, with writing a SVG renderer. It’s intended as a study to get to know ANTLR and SVG.
The good folks of W3.org are so kind to publish a BNF that can be rewritten to an AntLr Grammar.
The goal of this post is to interpret:
M115 285 C115 400 285 400 285 285 C400 285 400 115 285 115 C285 0 115 0 115 115 C0 115 0 285 115 285 z
Into a drawing of some sort.
After reading the BNF it’s clear that the letters stand for commands, and are usually followed by one or more coordinates. EG. ‘M’ is the move to command, and ‘C’ is a curve command, and so on.
The Grammar for SVG will be posted soon in another post. This post is about rewriting.
When using the grammar, that I need to post, the AST output looks like this. My first thought was “so I’ve done all this an this is what I get!?” The Definitive ANTLR Reference: Building Domain-Specific Languages (Pragmatic Programmers) showed me I can rewrite and restructure the tree.
Learn by a Simpler Example:
Let say we want to parse a C# function header in an interface, which kinda looks like:
public void foo(string prm1, out int val1);
the ANTLR grammar would look something like:
grammar functionheader; function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON; argument_list : argument (COMMA argument_list)? ; argument : argument_accessor? type argument_name ; argument_accessor : 'ref'|'out'; access_modifier : 'public'|'private'|'protected'|'internal'; return_type : type|'void'; type : 'string'|'int'|'float'|'double'|'long'; argument_name : NAME; function_name : NAME; //PARSER COMMA : ','; SEMICOLON : ';'; NAME : CHAR(DIGIT|CHAR)+; CHAR : ('a'..'z'|'A'..'Z'|'_'); DIGIT : ('0'..'9'); LPAREN : '('; RPAREN : ')'; WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C') { $channel = HIDDEN; } ;
Rewrite Rules, Omitting items
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;
the parser rule above contains some parts that are not really interesting at all. The LPAREN RPAREN and SEMICOLON are not really interesting at all. They are just there to help separate the different aspects from the function header. So when ANTLR has done it’s job there is no need for them any more and we can leave them out with rewriting rules. Thus the rewrite statement looks like:
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON ->access_modifier? return_type function_name argument_list?;
Resulting in:
You can clearly see the symbols being omitted.
Rewrite Rules, Creating children
Next big rewrite option is to structure the tree in a tree you are comfortable with to walk/interpret in your own code. The symbol that is used is the ‘^’ sign. Also used in to describe the POW function.
There are two way’s of doing it. This first one is to put it in your parser rule like:
function_header : access_modifier? return_type function_name^ LPAREN argument_list? RPAREN SEMICOLON;
making the function_name the root^ (Do not use rewrites ANTLR will throw errors in your face).
The second method is to use it in your rewrite rule (my personal favorite) :
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON -> ^(function_name access_modifier? return_type argument_list? )
so that is “^(root child child … child)”, you may also write “^(root child..child ^(subroot child..Child) child..child)”. etc..
Rewrite Rules, Adding some extra nodes
To make the AST tree just right, so your code can handle your AST much better you may need to insert some tokens. You have seen some rewriting rules, with those you can insert some ‘dummy’ nodes. Just tell ANTLR you need some extra tokens like this:
tokens { FUNCTION; NAME; MODIFIER; RETURNTYPE; PARAMS; TYPE; ACCESSER; } function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN -> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? ) ;
note there is a difference in opt1=”^(PARAMS argument_list)?” and opt2=”^(PARAMS argument_list?)”
opt1 will ommit the PARAMS token from the tree when arument_list is empty. While opt2 will show a PARAMS token, but it won’t have children when the agument_list is empty.
here is a final listing and a sample output of the AST:
grammar functionheader; options { output=AST; } tokens { FUNCTION; NAME; MODIFIER; RETURNTYPE; PARAMS; TYPE; ACCESSER; } function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN -> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? ) ; argument_list : argument (COMMA argument_list)? -> argument argument_list? ; argument : argument_accesser? type argument_name -> ^(argument_name (ACCESSER argument_accesser)? TYPE type ) ; argument_accesser : 'ref'|'out'; access_modifier : 'public'|'private'|'protected'|'internal'; return_type : type|'void'; type : 'string'|'int'|'float'|'double'|'long'; argument_name : NAME; function_name : NAME; //PARSER COMMA : ','; SEMICOLUMN : ';'; NAME : CHAR(DIGIT|CHAR)+; CHAR : ('a'..'z'|'A'..'Z'|'_'); DIGIT : ('0'..'9'); LPAREN : '('; RPAREN : ')'; WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C') { $channel = HIDDEN; } ;
this file contains the grammar.
Please, if you have questions or comments, let us know.
Simple Calculator with Antlr.
For quite some time I’ve been interested in creating scripting languages to make existing applications more flexible than thought possible. After writing a couple of parsers I’ve stumbled upon ANTLR (ANother Tool for Language Recognition) which is a Lexer and Parser all in one.
Although I’m aware of the existence of a couple of other tools, such as “yacc, lex,gold parser,.. ” this is the first tool I’m going to try out. I see a appliance to read svg files, and generate code (c#/Java/vb/ada) that will draw the figure.
Here is a preview of what this sample aims to:
But first things first, I longed to create a simple calculator, helping me to asses the possibilities of ANTLR, with a little help from :The Definitive ANTLR Reference: Building Domain-Specific Languages (Pragmatic Programmers) (Rather helpfull), I’ve managed the job quite well.
Since I’ve seen some questions on the Internet requesting a simple example for a simple calculator grammar, here it is:
grammar SimpleCalc; tokens { PLUS = '+' ; MINUS = '-' ; MULT = '*' ; DIV = '/' ; RPAREN = ')' ; LPAREN = '(' ; ASSIGN = '=' ; } /*---------------- * PARSER RULES *----------------*/ prog : stat+; stat : expr NEWLINE | ID ASSIGN expr NEWLINE | NEWLINE; //Do nothing expr : multExpr ((PLUS | MINUS )multExpr)*; multExpr: atom ((MULT | DIV) atom )*; atom : INT | ID | LPAREN expr RPAREN; /*---------------- * LEXER RULES *----------------*/ ID : ('a'..'z'|'A'..'Z')+; INT : '0'..'9'+; NEWLINE : '\r'?'\n'; WS : (' '|'\t'|'\n'|'\r')+;
The grammar above can handle expressions such as:
1+2*3
a=4-5
c=a*(6/2)
Now one has to add actions to the rules these will look like:
atom returns[int value] <==Tell Antlr this rule is implemented as a function, returning an int.
: INT {$value = int.Parse($INT.text);} <==The return value '$value' is assign the parsed as an integer string value of int.
Now all you need to do is fill every thing in. When an ID follow by an assignment (a=6) is encounter it will be placed in a Dictionary.
After adding the appropriate actions the calculator is finished, it works internally with an int, so all divisions are integer divisions!. Adaptation for doubles will be no problem at all, but is left to you.
grammar SimpleCalc3; options { language=CSharp; } tokens { PLUS = '+' ; MINUS = '-' ; MULT = '*' ; DIV = '/' ; RPAREN = ')' ; LPAREN = '(' ; ASSIGN = '=' ; } @members { public static System.Collections.Generic.Dictionary IDTable = new System.Collections.Generic.Dictionary(); public static void Main(string[] args) { Console.WriteLine("type '@' to quit..."); string line = ""; while (true) { line = Console.ReadLine(); if (line.Contains("@")) break; SimpleCalc3Lexer lex = new SimpleCalc3Lexer(new ANTLRStringStream(line+Environment.NewLine)); CommonTokenStream tokens = new CommonTokenStream(lex); SimpleCalc3Parser parser = new SimpleCalc3Parser(tokens); try { parser.prog(); } catch (RecognitionException e) { Console.Error.WriteLine(e.StackTrace); } } } } /*---------------- * PARSER RULES *----------------*/ prog : stat+; stat : expr NEWLINE {System.Console.WriteLine($expr.value);} | ID ASSIGN expr NEWLINE {IDTable[$ID.Text] =$expr.value;} | NEWLINE; //Do nothing expr returns[int value] : a=multExpr {$value = $a.value;} ( PLUS b=multExpr {$value+=$b.value;} | MINUS b=multExpr{$value-=$b.value;})*; multExpr returns[int value] : a=atom {$value = $a.value;} ( MULT b=atom {$value*=$b.value;} | DIV b=atom{$value/=$b.value;})*; atom returns[int value] : INT {$value = int.Parse($INT.text);} | ID {if (IDTable.ContainsKey($ID.Text)){$value = IDTable[$ID.Text];}else{System.Console.WriteLine("ID does not exist");}} | LPAREN expr RPAREN {$value = $expr.value;}; /*---------------- * LEXER RULES *----------------*/ ID : ('a'..'z'|'A'..'Z')+; INT : '0'..'9'+; NEWLINE : '\r'?'\n'; WS : (' '|'\t'|'\n'|'\r')+ {Skip();};
If you desire a working example, please download this file, Included are the .net binaries (.dll) from the antlr’s website. When your enviroment path is set correctly to the c# compiler, you will be able to run make.cmd to create the console application.
If you want to download the ANTLR binaries yourself, get them at the original site, look for “ANTLR Tool Binaries”.
enjoy!
*Update 24-aug-2008: Added the grammars to the zip.
*Update 27-apr-2009: Added sample movie.
*Update 13-oct-2009: Syntax Highlighting
Please, if you have questions or comments, let us know.
DataDriven Testing with Excel
Validating code with losts of scenario’s gets easier with validating the function with Excel. All you need is Excel.
Start of by naming the variables in the first row, one variable per column. Add an extra column for the result value (or more if needed). Fill out the variables and results.
[TestMethod] [DeploymentItem("bonusCalculation.xlsx")] [DataSource("System.Data.OleDb", "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=bonusCalculation.xlsx;Extended Properties=Excel 12.0;Persist Security Info=False", "bonus$", DataAccessMethod.Sequential)] public void BonusCalcualtion() { double result = Utils.GetSquareRoot(2); double omzet = (double)TestContext.DataRow ["Omzet"]; double winst = (double)TestContext.DataRow["Winst"]; double bonus = (double)TestContext.DataRow["Bonus"]; Assert.AreEqual(bonus, Utils.BonusCalculation(omzet, winst)); }
When you do not have Office 2007, but need to run the test with the .xlsx you might want to download the data driver :
at Microsoft
If you prefer using an older fileformat of excel, compatible with MS Excel 2003 and 2000, use:
[TestMethod] [DeploymentItem("bonusCalculation.xls")] [DataSource("System.Data.OleDb", "Provider=Microsoft.Jet.OLEDB.4.0;Data Source='Calculator.xls';Persist Security Info=False;Extended Properties='Excel 8.0'", "bonus$", DataAccessMethod.Sequential)] { //.... }
FireFox 3, thank you for your effort!
Today gmt+2 19:00 mozilla released their new version of their browser. I’ve been using it since version 1.0 so I’ve been quite happy seeing a new version come to be.
Let’s see how they are doing on web browser compatibility.
Well nice job on that one, let’s see how you have done on the acid3?
Well almost, will you reach this milestone with version 3.1? I’m not complaining thou, since the introduction of firefox the whole browser compatibility has reopend, and actully came to the common non-it/webdeveloper population (the other 97.3%).
Thank you so much for makeing Firefox, and please continue doing what you all do best!