Posts Tagged ‘AST’
Antlr, AST and rewriting rules.
Onward with my little project, with writing a SVG renderer. It’s intended as a study to get to know ANTLR and SVG.
The good folks of W3.org are so kind to publish a BNF that can be rewritten to an AntLr Grammar.
The goal of this post is to interpret:
M115 285 C115 400 285 400 285 285 C400 285 400 115 285 115 C285 0 115 0 115 115 C0 115 0 285 115 285 z
Into a drawing of some sort.
After reading the BNF it’s clear that the letters stand for commands, and are usually followed by one or more coordinates. EG. ‘M’ is the move to command, and ‘C’ is a curve command, and so on.
The Grammar for SVG will be posted soon in another post. This post is about rewriting.
When using the grammar, that I need to post, the AST output looks like this. My first thought was “so I’ve done all this an this is what I get!?” The Definitive ANTLR Reference: Building Domain-Specific Languages (Pragmatic Programmers) showed me I can rewrite and restructure the tree.
Learn by a Simpler Example:
Let say we want to parse a C# function header in an interface, which kinda looks like:
public void foo(string prm1, out int val1);
the ANTLR grammar would look something like:
grammar functionheader; function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON; argument_list : argument (COMMA argument_list)? ; argument : argument_accessor? type argument_name ; argument_accessor : 'ref'|'out'; access_modifier : 'public'|'private'|'protected'|'internal'; return_type : type|'void'; type : 'string'|'int'|'float'|'double'|'long'; argument_name : NAME; function_name : NAME; //PARSER COMMA : ','; SEMICOLON : ';'; NAME : CHAR(DIGIT|CHAR)+; CHAR : ('a'..'z'|'A'..'Z'|'_'); DIGIT : ('0'..'9'); LPAREN : '('; RPAREN : ')'; WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C') { $channel = HIDDEN; } ;
Rewrite Rules, Omitting items
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON;
the parser rule above contains some parts that are not really interesting at all. The LPAREN RPAREN and SEMICOLON are not really interesting at all. They are just there to help separate the different aspects from the function header. So when ANTLR has done it’s job there is no need for them any more and we can leave them out with rewriting rules. Thus the rewrite statement looks like:
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON ->access_modifier? return_type function_name argument_list?;
Resulting in:
You can clearly see the symbols being omitted.
Rewrite Rules, Creating children
Next big rewrite option is to structure the tree in a tree you are comfortable with to walk/interpret in your own code. The symbol that is used is the ‘^’ sign. Also used in to describe the POW function.
There are two way’s of doing it. This first one is to put it in your parser rule like:
function_header : access_modifier? return_type function_name^ LPAREN argument_list? RPAREN SEMICOLON;
making the function_name the root^ (Do not use rewrites ANTLR will throw errors in your face).
The second method is to use it in your rewrite rule (my personal favorite) :
function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLON -> ^(function_name access_modifier? return_type argument_list? )
so that is “^(root child child … child)”, you may also write “^(root child..child ^(subroot child..Child) child..child)”. etc..
Rewrite Rules, Adding some extra nodes
To make the AST tree just right, so your code can handle your AST much better you may need to insert some tokens. You have seen some rewriting rules, with those you can insert some ‘dummy’ nodes. Just tell ANTLR you need some extra tokens like this:
tokens { FUNCTION; NAME; MODIFIER; RETURNTYPE; PARAMS; TYPE; ACCESSER; } function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN -> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? ) ;
note there is a difference in opt1=”^(PARAMS argument_list)?” and opt2=”^(PARAMS argument_list?)”
opt1 will ommit the PARAMS token from the tree when arument_list is empty. While opt2 will show a PARAMS token, but it won’t have children when the agument_list is empty.
here is a final listing and a sample output of the AST:
grammar functionheader; options { output=AST; } tokens { FUNCTION; NAME; MODIFIER; RETURNTYPE; PARAMS; TYPE; ACCESSER; } function_header : access_modifier? return_type function_name LPAREN argument_list? RPAREN SEMICOLUMN -> ^(FUNCTION ^(NAME function_name) ^(MODIFIER access_modifier)? ^(RETURNTYPE return_type) ^(PARAMS argument_list)? ) ; argument_list : argument (COMMA argument_list)? -> argument argument_list? ; argument : argument_accesser? type argument_name -> ^(argument_name (ACCESSER argument_accesser)? TYPE type ) ; argument_accesser : 'ref'|'out'; access_modifier : 'public'|'private'|'protected'|'internal'; return_type : type|'void'; type : 'string'|'int'|'float'|'double'|'long'; argument_name : NAME; function_name : NAME; //PARSER COMMA : ','; SEMICOLUMN : ';'; NAME : CHAR(DIGIT|CHAR)+; CHAR : ('a'..'z'|'A'..'Z'|'_'); DIGIT : ('0'..'9'); LPAREN : '('; RPAREN : ')'; WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C') { $channel = HIDDEN; } ;
this file contains the grammar.
Please, if you have questions or comments, let us know.