Posts Tagged ‘ANTLR’

Ambiguity: Lexer rules

How to handle ambiguous rules?

Remember the “The following token definitions can never be matched because prior tokens match the same input”.

If you’ll like to match “INT” as a token and a bunch of other chars as another token.

WRONG!

ID 	:	('a'..'z'|'A'..'Z')+;
TYPE 	:	'INT';

RIGHT!

TYPE 	:	'INT';
ID 	:	('a'..'z'|'A'..'Z')+;

But this comes with consequences. There is NO WAY ID could be matched with ‘INT’ as a value, since that will match to the TYPE rule.

Now here is why that may not matter…
a function header in C# :

public int int (string string, object object)

“,.. well if you know C# or JAVA,, you’ll see that this is not a problem…

But it just might be a problem for you.

I’m a C# developer,.. and C# solves this ambiguity with a ‘@‘.
like so:

public static void @static(int @int, object @new, float @const)

This is of course not the final word on this topic.. I’ll post more here.. I’t would help if you’ll post a comment like hurry up already.. ūüôā

cheers.

About WhiteSpace and Antlr

Are you unsure if you need to include the space in your rules?

Most grammars will have this LEXER rule in their grammar:

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

Now you’ll like a rule that needs to match “a sample string” (quotes included)? How will Antlr respond to the space?

or

You want to parse a command line like parameter alike string such as “operation /option1 /option2” (quotes not included).

grammar test20091014;

//thanks to AntlrWorks 1.3 for it's useful grammar wizard.

prog	:	STRING+ OPTION*;

OPTION	:	'/' STRING;

STRING
    :  '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
    |	ID
    ;

fragment ID  :	('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

Parse Tree:

input: str /opt1

input: str /opt1

"str with space" /"some option"

"str with space" /"some option"

"str with space" another space /"some option"

"str with space" another space /"some option"

Conclusion:

No explicit need to include the WS in your lexer rule. “WS” will result in a token and splits input such as¬† “a b c d” (quote not include.

Lexer At work, a tricky sample

I was asked to help out with a grammar. It’s part of the “Package Manager Specification” see gentoo’s spec.

The spec allow to add dependencies, the syntax for this is : [operator][category]/[package name]-[version]

operator operator’s valid values are : ‘<‘ ‘<=’ ‘=’ ‘~’ ‘>=’ ‘>’
category 2.1.1 Category Names
A category name may contain any of the characters [A-Za-z0-9+_.-]. It must not begin with a hyphen or a
dot.
Note: A hyphen is not required because of the virtual category. Usually, however, category names will
contain a hyphen.
package name 2.1.2 Package Names
A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen, and
must not end in a hyphen followed by one or more digits.
version 2.2 Version Specifications
The package manager must not impose fixed limits upon the number of version components. Package managers
should indicate or reject any version that is invalid according to these rules.
A version starts with the number part, which is in the form [0-9]+(n.[0-9]+)* (a positive integer, followed
by zero or more dot-prefixed positive integers).
This may optionally be followed by one of [a-z] (a lowercase letter).

In the spec there are more requirements, but it will make this example so much more harder to understand.

now the trick is that package name (or pn for short), can not end in a hyphen followed by one or more digits’. Thus valids could be:

foo
foo-bla
foo1-bla-

but I’m not sure about the following.

foo-12a-12, if one follows the spec, it ends with -12.. so one could cut that of but you will still be left with -12a ,” a hyphen followed by one or more digits“. I would like to know it that is valid or not. Please tell me! I’m not sure.

I would like to know if this is valid: foo-123-foo? (does not end in.. I know)
I would like to know if this is valid:foo-12-12a

since category and pn and version are much alike and share much the same tokens it seemed logical to create parser rules instead of lexer rules.

dep    :    OPERATOR catagory SLASH pn HYPHEN version;

where OPERATOR is defined as:

OPERATOR:¬†¬†¬† ‘<‘|'<=’|’=’|’~’|’>=’|’>’;

and SLASH and HYPHEN :

HYPHEN¬†¬†¬† :¬†¬†¬† ‘-‘;
SLASH¬†¬†¬† :¬†¬†¬† ‘/’;

now here comes the difficult stuff:

Both Category and Pn share a lot of the same tokens. Category may contain a ‘.’¬† and pn may not. So if we would lex :
‘foo-bar’
would that result in¬† a CATEGORY token or a PN token? The answer is that Antlr would not stand by this and tell you that “the following token definitions can never be matched because prior tokens match the same input”. If there is no dot in the characters being lexed it could well have been a pn. In short both rules are quite ambigous, but it get’s wores. How do we tell the lexer the differences between a version and the pn if only characters are used that are valid for both rules. The answer is we need to do this ourselfs. The lexer will happely continue eating valid chars and thus eating characters in the version portion. It would have been easier if the final hypen would have been a character not in the range of valid chars of eiter rule.

On the other hand CATEGORY always comes after OPERATOR and PN always after SLASH. We can uses this to our advantage by using a construct named ‘semantic predicates’. We are going to use this to influence the lexer.

When the lexer is done wit operator it will set a flag telling it to match the tokens as with the category rules, and after the slash the flag will be set again to match tokes that apply to the package name rules. Resulting in “OPERATOR:¬†¬†¬† (‘<‘|'<=’|’=’|’~’|’>=’|’>’ ){category=true;};” and “SLASH¬†¬†¬† :¬†¬†¬† ‘/’¬† {category=false;};”.

a boolean will be set to true or to false. Which has still to be defined at the lexer by setting (basic programming rule:”declare what you want to use”) :

@lexer::members
{
boolean category=false;
}

Now Category and pn share common tokens, these are defined as follows:
fragment GENERIC    :    (UPPER|LOWER|UNDERSCORE|DIGIT);

(i use fragment so GENERIC won’t be seen as a individual token, it can be used over and over again.)

Next we are putting both

STR:
GENERIC
(
{category}?=>(GENERIC|PLUS|DOT|HYPHEN)*
|
{!category}?=>(GENERIC|PLUS|HYPHEN)*
)

So if we now set dep to : “dep¬†¬†¬† :¬†¬†¬† OPERATOR category SLASH pn; // HYPHEN VERSION;” (skipping the version) and test it:

>=cat.a.gory/packag-eName

Input was “>=cat.a.gory/packag-eName” Input was fully taken correctly.

>=cat.a.gory/packag.eName

Input was “>=cat.a.gory/packag.eName” The lexer stopped at “packag” since ‘.’ was not a valid token for package name.

Now we can add version to it, thus changing dep to : “dep¬†¬†¬† :¬†¬†¬† OPERATOR category SLASH pn HYPHEN VERSION;” ,¬† and try “>=cat.a.gory/packageName-12a” for inpu, telling it’s version 12.

Version not matched

Whoa what just happend?!! package name included the version!

The cause to this is because the “-12a” part can be consumed by the package name rule. And the “can not end with hyphen followed by digit” rule was not implemented.

How to implement can not end? Well currently Antlr does not support such a construct as in may not end in (not to be confused by the ‘~’ operator, meaning match anything until …). I’m going to use the input.mark, input.rewind and input release commands.¬† When mark is called, you get an id. With this id you can rewind back to this mark. Any tokens matched in between will become consumable once again, it’s like reversing time. Input.Release(id) does not rewind, but cleans up the administration.

The goal is to match “packageName-12a”, with a marker @ the HYPEN. Then we check if the rule “can not end in hypen.. digit” is true or false. If true, we will simply rewind and the that’s that.¬† So the matched “packageName-12a” will end up in “packageName” (with “-12a” rewinded, left for other rules).

Every time a hyphen comes by i’ll mark it, if we already had a mark, release it.

This is the final result:

Comments welcome!

//options
//{
//language = CSharp;
//}

//http://www.gentoo.org/proj/en/qa/pms.xml

/*
2.1.1 Category Names
A category name may contain any of the characters [A-Za-z0-9+_.-]. It must not begin with a hyphen or a
dot.
Note: A hyphen is not required because of the virtual category. Usually, however, category names will
contain a hyphen.
*/

/*
2.1.2 Package Names
A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen, and
must not end in a hyphen followed by one or more digits.
*/

/*
2.2 Version Specifications
The package manager must not impose fixed limits upon the number of version components. Package managers
should indicate or reject any version that is invalid according to these rules.
A version starts with the number part, which is in the form [0-9]+(n.[0-9]+)* (a positive integer, followed
by zero or more dot-prefixed positive integers).
This may optionally be followed by one of [a-z] (a lowercase letter).
*/

@lexer::members
{
boolean category=false;

//c#
//bool category=false;

}

dep    :    OPERATOR category SLASH pn HYPHEN VERSION;

HYPHEN¬†¬†¬† :¬†¬†¬† ‘-‘;
SLASH¬†¬†¬† :¬†¬†¬† ‘/’¬† {category=false;};
VERSION    :    DIGIT+ (DOT DIGIT+)* LOWER?;
OPERATOR:¬†¬†¬† (‘<‘|'<=’|’=’|’~’|’>=’|’>’) {category=true;};

category:    STR;
pn    :    STR;

STR
@init
{
int _mrk=-1;
//    int start = CharIndex; //c#
int start = getCharIndex(); //JAVA
}
:
//c#    {int _mrk=-1;int start = getCharIndex;}

GENERIC
(
{category}?=>(GENERIC|PLUS|DOT|HYPHEN)*
|
{!category}?=>
(

(GENERIC|PLUS|( {if (_mrk!=-1){input.release(_mrk);} {_mrk = input.mark();} }  HYPHEN))*
)
)
/*
Post Check: If this is a Package name and it ends in eg. -12 just take that off.
*/
{
if (!category)
{

String s = input.substring(start, input.getCharPositionInLine()-1);
String[] s2 = s.split(“-“);
if ( Character.isDigit(s2[s2.length-1].charAt(0)) && s2.length>1)
{
input.rewind(_mrk); _mrk = -1; //rewind (altering this tokens, leave the rest to be matched.)
}
else
{
if (_mrk != -1) { input.release(_mrk); } //release, do not alter token.
}

}
}//end of post check.
;

fragment UNDERSCORE¬†¬†¬† :¬†¬†¬† ‘_’;
fragment UPPER¬†¬†¬† :¬†¬†¬† ‘A’..’Z’;
fragment DIGIT¬†¬†¬† :¬†¬†¬† ‘0’..’9′;
fragment LOWER¬†¬†¬†¬† :¬†¬†¬† ‘a’..’z’;
fragment DOT¬†¬†¬† :¬†¬†¬† ‘.’;
fragment PLUS¬†¬†¬† :¬†¬†¬† ‘+’;
fragment GENERIC    :    (UPPER|LOWER|UNDERSCORE|DIGIT);

WS¬†¬†¬†¬† : (‘ ‘|’t’ | ‘r’ | ‘n’)+ {$channel=HIDDEN;};

LICENSE: All source code in this post comes with ABSOLUTELY NO WARRANTY, This is free software, and you are welcome to redistribute it under certain conditions, see GPL 2.

ANTLR common pitfalls

Personally I think ANTLR is a great tool, It has a steep learning curve and it has a few quirks. I hope the description here will help you find your problem, understand it and help you fix the issue.

Recognition problems

Recognizes number as ‘1234’ but not as ‘1’:
Example:

grammar number;

number :	INT;
DIGIT	:	'0'..'9';
INT	:	DIGIT+;

Explanation, the input such as ‘1’ or ‘4’ is just one char, it will be recognized as a ‘DIGIT’, not as an ‘INT’. You have two options :
– delete the DIGIT rule, and rewrite the INT rule as ” INT:’0′..’9′; ” (works)
– place the ‘fragment’ keyword in front of DIGIT. DIGIT will not be seen as a token.

Grammar Check Errors

The following token definitions are unreachable: INT (AntlrWorks 1.1.7)
The following token definitions can never be matched because prior tokens match the same input: INT (AntlrWorks 1.2) (AntlrWorks 1.3)

Example:

grammar number;

DIGIT:	'0'..'9';
INT	:	DIGIT;

What has INT to offer? As is it is a redundant rule. Probably you meant more than one number (thus matching 43) so make it INT : DIGIT+;

Another possibility:

factExpression:	Fact fact;
fact 	:	ID;

propertyExpression:	fact Property property;

property:	ID;

NEWLINE	:	'\r'?'\n';
WS	:	(' ' | '\t' | '\n' | '\r') { skip(); };

Fact	: 'There is ' ARTICLE //added extra space since it needs to be there
;

Property
:	'has '  ARTICLE //added extra space since it needs to be there
;

ID 	:	('a'..'z'|'A'..'Z')+;

ARTICLE
:	('a'|'an')
;

Here the error is “The following token…match the same input: ARTICLE”. ID can both match ‘a’ and ‘an’, but in this case ARTICLE is more important. Flip the ID and ARTICLE rule, and it’s fine.
More on Ambiguous rules.


syntax error: codegen: :0:0: unexpected end of subtree (AntlrWorks 1.1.7 & 1.2)
Example:

grammar number;
//number 	:	INT | FLOAT;
DIGIT	 :	'0'..'9';
INT	:	DIGIT+;
FLOAT	:	DIGIT* '.' DIGIT+;

You are working with a mixed grammar (both LEXER and PARSER).
You did not include any parser rules, please do so. Uncomment the ‘number’ parser rule.
Another possibility: is that the last line of your grammar is comment, just move it.

*Updated 10-nov-2009 : Added internal link to explain more on Ambiguous rules.;
*Updated 13-oct-2009 : Syntax Highlighting;
*Updated 25-jan-2010 : Small updates
*Updated 17-ock-2014 : Ending support for antlr 3. Thank you, it was fun!

Antlr, AST and rewriting rules.

Onward with my little project, with writing a SVG renderer. It’s intended as a study to get to know ANTLR and SVG.

The good folks of W3.org are so kind to publish a BNF that can be rewritten to an AntLr Grammar.

The goal of this post is to interpret:

M115 285 C115 400 285 400 285 285 C400 285 400 115 285 115 C285 0 115 0 115 115 C0 115 0 285 115 285 z

Into a drawing of some sort.

After reading the BNF it’s clear that the letters stand for commands, and are usually followed by one or more coordinates. EG. ‘M’ is the move to command, and ‘C’ is a curve command, and so on.

The Grammar for SVG will be posted soon in another post. This post is about rewriting.

AST tree SVG Path

AST tree SVG Path

When using the grammar, that I need to post, the AST output looks like this. My first thought was “so I’ve done all this an this is what I get!?” The Definitive ANTLR Reference: Building Domain-Specific Languages (Pragmatic Programmers) showed me I can rewrite and restructure the tree.

Learn by a Simpler Example:

Let say we want to parse a C# function header in an interface, which kinda looks like:

public void foo(string prm1, out int val1);

the ANTLR grammar would look something like:


grammar functionheader; 

function_header
	: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;

 argument_list
	:	argument (COMMA  argument_list)?
	;  

argument : argument_accessor? type argument_name
	; 

argument_accessor
	:	'ref'|'out';

access_modifier
	:	 'public'|'private'|'protected'|'internal'; 

return_type
	:	type|'void';

type	:	'string'|'int'|'float'|'double'|'long'; 

argument_name
	:	NAME; 

function_name
	:	NAME;

//PARSER

COMMA	:	',';

SEMICOLON
	:	';'; 

NAME	:	CHAR(DIGIT|CHAR)+; 

CHAR 	:	('a'..'z'|'A'..'Z'|'_');

DIGIT
	:	('0'..'9'); 

LPAREN	:	'(';

RPAREN	:	')';

WHITESPACE
	: ( '\t' | ' ' | '\r' | '\n'| '\u000C') 	{ $channel = HIDDEN; } ;

Rewrite Rules, Omitting items

function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;

the parser rule above contains some parts that are not really interesting at all. The LPAREN RPAREN and SEMICOLON are not really interesting at all. They are just there to help separate the different aspects from the function header. So when ANTLR has done it’s job there is no need for them any more and we can leave them out with rewriting rules. Thus the rewrite statement looks like:

function_header :
access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON
->access_modifier? return_type function_name argument_list?;

Resulting in:

ast tree function before rewrite

ast tree function before rewrite

ast tree function after rewrite

ast tree function after rewrite

You can clearly see the symbols being omitted.

Rewrite Rules, Creating children

Next big rewrite option is to structure the tree in a tree you are comfortable with to walk/interpret in your own code. The symbol that is used is the ‘^’ sign. Also used in to describe the POW function.

There are two way’s of doing it. This first one is to put it in your parser rule like:

function_header
: access_modifier? return_type function_name^ LPAREN argument_list? RPAREN SEMICOLON;
Result of rewrite rule

Result of rewrite rule

making the function_name the root^  (Do not use rewrites ANTLR will throw errors in your face).

The second method is to use it in your rewrite rule (my personal favorite) :

function_header
: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON
 -> ^(function_name access_modifier? return_type argument_list? )
rewrite rule

rewrite rule

so that is “^(root child child … child)”, you may also write “^(root child..child ^(subroot child..Child) child..child)”. etc..

Rewrite Rules, Adding some extra nodes

To make the AST tree just right, so your code can handle your AST much better you may need to insert some tokens. You have seen some rewriting rules, with those you can insert some ‘dummy’ nodes. Just tell ANTLR you need some extra tokens like this:

tokens
{
FUNCTION;
NAME;
MODIFIER;
RETURNTYPE;
PARAMS;
TYPE;
ACCESSER;
}
function_header
: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN
-> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? )
;

note there is a difference in opt1=”^(PARAMS argument_list)?” and opt2=”^(PARAMS argument_list?)”

opt1 will ommit the PARAMS token from the tree when arument_list is empty. While opt2 will show a PARAMS token, but it won’t have children when the agument_list is empty.

here is a final listing and a sample output of the AST:

The AST tree fully manipulated.

The AST tree fully manipulated.

grammar functionheader;
options
{
    output=AST;
}
tokens
{
	FUNCTION;
	NAME;
	MODIFIER;
	RETURNTYPE;
	PARAMS;
	TYPE;
	ACCESSER;
}
function_header
	: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN
	-> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? )
	;
argument_list
	:	argument (COMMA  argument_list)?
	->	argument argument_list?
	; 
argument : argument_accesser? type argument_name
	-> ^(argument_name (ACCESSER argument_accesser)? TYPE type )
	;
argument_accesser
	:	'ref'|'out';
access_modifier
	:	 'public'|'private'|'protected'|'internal';
return_type
	:	type|'void';
type	:	'string'|'int'|'float'|'double'|'long';
argument_name
	:	NAME;
function_name
	:	NAME;
//PARSER
COMMA	:	',';
SEMICOLUMN
	:	';';
NAME	:	CHAR(DIGIT|CHAR)+;
CHAR 	:	('a'..'z'|'A'..'Z'|'_');
DIGIT
	:	('0'..'9');
LPAREN	:	'(';
RPAREN	:	')';
WHITESPACE
	: ( '\t' | ' ' | '\r' | '\n'| '\u000C') 	{ $channel = HIDDEN; } ;

this file contains the grammar.

Please, if you have questions or comments, let us know.