Antlr, AST and rewriting rules.

Onward with my little project, with writing a SVG renderer. It’s intended as a study to get to know ANTLR and SVG.

The good folks of W3.org are so kind to publish a BNF that can be rewritten to an AntLr Grammar.

The goal of this post is to interpret:

M115 285 C115 400 285 400 285 285 C400 285 400 115 285 115 C285 0 115 0 115 115 C0 115 0 285 115 285 z

Into a drawing of some sort.

After reading the BNF it’s clear that the letters stand for commands, and are usually followed by one or more coordinates. EG. ‘M’ is the move to command, and ‘C’ is a curve command, and so on.

The Grammar for SVG will be posted soon in another post. This post is about rewriting.

AST tree SVG Path

AST tree SVG Path

When using the grammar, that I need to post, the AST output looks like this. My first thought was “so I’ve done all this an this is what I get!?” The Definitive ANTLR Reference: Building Domain-Specific Languages (Pragmatic Programmers) showed me I can rewrite and restructure the tree.

Learn by a Simpler Example:

Let say we want to parse a C# function header in an interface, which kinda looks like:

public void foo(string prm1, out int val1);

the ANTLR grammar would look something like:


grammar functionheader; 

function_header
	: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;

 argument_list
	:	argument (COMMA  argument_list)?
	;  

argument : argument_accessor? type argument_name
	; 

argument_accessor
	:	'ref'|'out';

access_modifier
	:	 'public'|'private'|'protected'|'internal'; 

return_type
	:	type|'void';

type	:	'string'|'int'|'float'|'double'|'long'; 

argument_name
	:	NAME; 

function_name
	:	NAME;

//PARSER

COMMA	:	',';

SEMICOLON
	:	';'; 

NAME	:	CHAR(DIGIT|CHAR)+; 

CHAR 	:	('a'..'z'|'A'..'Z'|'_');

DIGIT
	:	('0'..'9'); 

LPAREN	:	'(';

RPAREN	:	')';

WHITESPACE
	: ( '\t' | ' ' | '\r' | '\n'| '\u000C') 	{ $channel = HIDDEN; } ;

Rewrite Rules, Omitting items

function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;

the parser rule above contains some parts that are not really interesting at all. The LPAREN RPAREN and SEMICOLON are not really interesting at all. They are just there to help separate the different aspects from the function header. So when ANTLR has done it’s job there is no need for them any more and we can leave them out with rewriting rules. Thus the rewrite statement looks like:

function_header :
access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON
->access_modifier? return_type function_name argument_list?;

Resulting in:

ast tree function before rewrite

ast tree function before rewrite

ast tree function after rewrite

ast tree function after rewrite

You can clearly see the symbols being omitted.

Rewrite Rules, Creating children

Next big rewrite option is to structure the tree in a tree you are comfortable with to walk/interpret in your own code. The symbol that is used is the ‘^’ sign. Also used in to describe the POW function.

There are two way’s of doing it. This first one is to put it in your parser rule like:

function_header
: access_modifier? return_type function_name^ LPAREN argument_list? RPAREN SEMICOLON;
Result of rewrite rule

Result of rewrite rule

making the function_name the root^  (Do not use rewrites ANTLR will throw errors in your face).

The second method is to use it in your rewrite rule (my personal favorite) :

function_header
: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON
 -> ^(function_name access_modifier? return_type argument_list? )
rewrite rule

rewrite rule

so that is “^(root child child … child)”, you may also write “^(root child..child ^(subroot child..Child) child..child)”. etc..

Rewrite Rules, Adding some extra nodes

To make the AST tree just right, so your code can handle your AST much better you may need to insert some tokens. You have seen some rewriting rules, with those you can insert some ‘dummy’ nodes. Just tell ANTLR you need some extra tokens like this:

tokens
{
FUNCTION;
NAME;
MODIFIER;
RETURNTYPE;
PARAMS;
TYPE;
ACCESSER;
}
function_header
: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN
-> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? )
;

note there is a difference in opt1=”^(PARAMS argument_list)?” and opt2=”^(PARAMS argument_list?)”

opt1 will ommit the PARAMS token from the tree when arument_list is empty. While opt2 will show a PARAMS token, but it won’t have children when the agument_list is empty.

here is a final listing and a sample output of the AST:

The AST tree fully manipulated.

The AST tree fully manipulated.

grammar functionheader;
options
{
    output=AST;
}
tokens
{
	FUNCTION;
	NAME;
	MODIFIER;
	RETURNTYPE;
	PARAMS;
	TYPE;
	ACCESSER;
}
function_header
	: access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN
	-> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? )
	;
argument_list
	:	argument (COMMA  argument_list)?
	->	argument argument_list?
	; 
argument : argument_accesser? type argument_name
	-> ^(argument_name (ACCESSER argument_accesser)? TYPE type )
	;
argument_accesser
	:	'ref'|'out';
access_modifier
	:	 'public'|'private'|'protected'|'internal';
return_type
	:	type|'void';
type	:	'string'|'int'|'float'|'double'|'long';
argument_name
	:	NAME;
function_name
	:	NAME;
//PARSER
COMMA	:	',';
SEMICOLUMN
	:	';';
NAME	:	CHAR(DIGIT|CHAR)+;
CHAR 	:	('a'..'z'|'A'..'Z'|'_');
DIGIT
	:	('0'..'9');
LPAREN	:	'(';
RPAREN	:	')';
WHITESPACE
	: ( '\t' | ' ' | '\r' | '\n'| '\u000C') 	{ $channel = HIDDEN; } ;

this file contains the grammar.

Please, if you have questions or comments, let us know.

7 Responses to “Antlr, AST and rewriting rules.”

  • Nicolò Chieffo says:

    Hello, I’m looking for a really small help.

    I’m building an AST with rewrite rules, everything is clear to me except one thing.
    This is a part of my grammar
    CHAR : ‘0’..’9′ | ‘a’..’z’ | ‘A’..’Z’;
    DASH : ‘-‘;
    str : CHAR (DASH | CHAR)+ -> ^(OPTION str);

    When the node OPTION (a token defined previously) is created, I’d like to have only 1 child with the hole ‘str’ as text. Instead I get n childs, one per character. Is this possible?
    Thanks

  • Floris says:

    The number of nodes is directly related to the number of tokens.
    Tokes are created at with the lexer rules (the uppercase stuff like CHAR and DASH).
    your rule CHAR (DASH | CHAR)+; should match at least two tokens : CHAR and DASH or CHAR and CHAR

    so you’ll need a token that combines the ‘CHAR (DASH | CHAR)+’ into one token.
    That happens to be quite simple:
    STR : CHAR (DASH | CHAR)+;
    there done.
    so if your input is “abc” this would result in a single token [str]
    *Warning, common mistake* What happens if your input is “a”, well that would result in a [char] token.
    You don’t want that thus, you’ll have to add a keyword to CHAR and DASH, namely fragment
    That will allow you to use CHAR and DASH in other lexer rules, but will never be matched as a token.

    so once more, one node? Answer: one token!

    grammar example20090925;

    options
    {
    output=AST;
    }

    tokens
    {
    OPTION;
    }

    str : STR->^(OPTION STR);
    STR
    : CHAR (DASH | CHAR)+;
    fragment CHAR
    : ‘0’..’9′ | ‘a’..’z’ | ‘A’..’Z’;
    fragment DASH
    : ‘-‘;

  • Scott says:

    Hey,

    I’ve got small question regarding the layout of my AST. Basically I’m trying to make my tree read in each line of a file and then store each of these lines in a different ‘Dummy’ node. Then have all of the tokens from that line stored as siblings in that node. I have found this easy enough to do with my simpler functions like:

    sectionnumber:
    (SECTION NUMBER)

    -> ^(DUMMY SECTION NUMBER);

    but when it comes to situations where i have a choice for tokens such as for

    (INSTR_END
    | INSTR_END_BLK
    | INSTR_IN
    | INSTR_MPP
    | INSTR_MPS
    | INSTR_MRD
    | INSTR_OUT_BLK
    | INSTR_RET )

    i have only been able to successfully create the AST if i use this method:

    singleinstr:
    (INSTR_END -> ^(DUMMY INSTR_END)
    | INSTR_END_BLK -> ^(DUMMY INSTR_END_BLK)
    | INSTR_IN -> ^(DUMMY INSTR_IN)
    | INSTR_MPP -> ^(DUMMY INSTR_MPP)
    | INSTR_MPS -> ^(DUMMY INSTR_MPS)
    | INSTR_MRD -> ^(DUMMY INSTR_MRD)
    | INSTR_OUT_BLK -> ^(DUMMY INSTR_OUT_BLK)
    | INSTR_RET -> ^(DUMMY INSTR_RET));

    The problem that i have with this is that i have some much much bigger functions and my interpreter doesn’t like it if i break them up for some reason and for me to do it this way will take forever. Basically, is there is a simpler way to achieve the same result by having it in this kind of format?

    (INSTR_END
    | INSTR_END_BLK
    | INSTR_IN
    | INSTR_MPP
    | INSTR_MPS
    | INSTR_MRD
    | INSTR_OUT_BLK
    | INSTR_RET )
    ->
    ^(DUMMY // This inserts the ‘DUMMY’ node
    (INSTR_END
    | INSTR_END_BLK
    | INSTR_IN
    | INSTR_MPP
    | INSTR_MPS
    | INSTR_MRD
    | INSTR_OUT_BLK
    | INSTR_RET )
    );

    Any assistance with this matter would be much appreciated.
    Thanks,
    Scott

  • Floris says:

    I’m suggesting that you use a parent rule for “singleinstr”
    so in short:
    parentrule : singleinstr* -> ^(DUMMY singleinstr)*;
    //your singleinstr will be simple
    singleinstr:
    INSTR_END
    |INSTR_END_BLK
    |INSTR_IN,… etc.;

    This will result in the same ast structure!

  • Scott says:

    Thanks for that that it worked perfectly.
    I have another small problem that i was wondering if you could give me a hand on. Basically what i am trying to do is concatenate a series of characters together in the following way:

    tokens { }

    SECTION: ‘SECTION’ WHITESPACE* DIGIT+;
    fragment DIGIT: ‘0’..’9′;
    WHITESPACE: (‘\t’ |’ ‘ |’\r’ |’\n’ |’\u000C’ )+ { $channel = HIDDEN; } ;

    The reason for this is that I’m parsing “SECTION 1” for example, is to make it easier to deal with in the interpreter. The problem is that I’m not sure if the number of white spaces is consistent so I’ve been trying to remove them to get “SECTION1” but have been unable to do so. Here are the various ways that I’ve modified the ‘SECTION’ token to try and remove them but none have worked, any ideas?

    SECTION: ‘SECTION’ (WHITESPACE!)* DIGIT+;
    SECTION: ‘SECTION’ (WHITESPACE*)! DIGIT+;
    SECTION: ‘SECTION’ WHITESPACE*! DIGIT+;
    SECTION: ‘SECTION’ WHITESPACE!* DIGIT+;
    SECTION: ‘SECTION’ ((WHITESPACE)*)! DIGIT+;
    SECTION: ‘SECTION’ ((WHITESPACE)!)* DIGIT+;
    SECTION: ( ‘SECTION’ WHITESPACE* DIGIT+ ) -> ( ‘SECTION’ DIGIT+);

    Thanks again,
    Scott

  • admin says:

    you are making section a token!!, I would have made it a parser rule, so simply write “SECTION :” to “section : ‘SECTION’ DIGIT+;” . See my post on whitespace why you can omit the WHITESPACE rule.

    But if you MUST have section as a token concatenated as you described you can use the following grammar as an example.

    grammar sample20100125;

    prog : SECTION;

    SECTION :
    a=’SECTION’ WS* b=INT {setText($a.text + $b.text);} ;

    INT : DIGIT+;

    fragment
    DIGIT: ‘0’..’9′;

    WS : ( ‘ ‘
    | ‘\t’
    | ‘\r’
    | ‘\n’
    )
    {$channel=HIDDEN;}
    ;

    is this clear enough for you?
    regards

    Floris

  • Nguyen says:

    Hi Floris,

    I’ve read this topic and the previous topic for a few times, but I dont know exactly how I can do with my project. Please help me.

    I use c# to load input text file, the output are some arrays due to its characteristic. For example, after spliting the text file, I have
    “student,noun
    pupil,noun
    study,verb
    play,verb
    good,adj

    the output is Noun array: student,pupil ; Verb array: study,play ; Adj array: good

    How can I match with the grammar file to make these elements of the array into the TOKENS like:
    NOUN: ‘student’ | ‘pupil’ ;
    VERB: ‘study’ | ‘play’ ;
    ADJ: ‘good’ ;

    Please help me with this. Thank you anyway.

Leave a Reply