ANTLR common pitfalls
Personally I think ANTLR is a great tool, It has a steep learning curve and it has a few quirks. I hope the description here will help you find your problem, understand it and help you fix the issue.
Recognition problems
Recognizes number as ‘1234’ but not as ‘1’:
Example:
grammar number; number : INT; DIGIT : '0'..'9'; INT : DIGIT+;
Explanation, the input such as ‘1’ or ‘4’ is just one char, it will be recognized as a ‘DIGIT’, not as an ‘INT’. You have two options :
– delete the DIGIT rule, and rewrite the INT rule as ” INT:’0′..’9′; ” (works)
– place the ‘fragment’ keyword in front of DIGIT. DIGIT will not be seen as a token.
Grammar Check Errors
The following token definitions are unreachable: INT (AntlrWorks 1.1.7)
The following token definitions can never be matched because prior tokens match the same input: INT (AntlrWorks 1.2) (AntlrWorks 1.3)
Example:
grammar number; DIGIT: '0'..'9'; INT : DIGIT;
What has INT to offer? As is it is a redundant rule. Probably you meant more than one number (thus matching 43) so make it INT : DIGIT+;
Another possibility:
factExpression: Fact fact; fact : ID; propertyExpression: fact Property property; property: ID; NEWLINE : '\r'?'\n'; WS : (' ' | '\t' | '\n' | '\r') { skip(); }; Fact : 'There is ' ARTICLE //added extra space since it needs to be there ; Property : 'has ' ARTICLE //added extra space since it needs to be there ; ID : ('a'..'z'|'A'..'Z')+; ARTICLE : ('a'|'an') ;
Here the error is “The following token…match the same input: ARTICLE”. ID can both match ‘a’ and ‘an’, but in this case ARTICLE is more important. Flip the ID and ARTICLE rule, and it’s fine.
More on Ambiguous rules.
syntax error: codegen: :0:0: unexpected end of subtree (AntlrWorks 1.1.7 & 1.2)
Example:
grammar number; //number : INT | FLOAT; DIGIT : '0'..'9'; INT : DIGIT+; FLOAT : DIGIT* '.' DIGIT+;
You are working with a mixed grammar (both LEXER and PARSER).
You did not include any parser rules, please do so. Uncomment the ‘number’ parser rule.
Another possibility: is that the last line of your grammar is comment, just move it.
*Updated 10-nov-2009 : Added internal link to explain more on Ambiguous rules.;
*Updated 13-oct-2009 : Syntax Highlighting;
*Updated 25-jan-2010 : Small updates
*Updated 17-ock-2014 : Ending support for antlr 3. Thank you, it was fun!
Clear and valuable information ; I had forgotten the ‘fragment’ keyword which can be particularly handy at times.
A big thanks.
Do you have any good tips on implementing the following:
I have a thing that needs to match until the next rule or whitespace based on where it’s coming there so I am using dynamic scopes. This code just doesn’t seem to be working:
pn : (alphanum|’+’|’_’) (options { greedy=false; } : pn_char)* (pn_char pn_follows)=>pn_char {!pn_end.matcher($pn.text).matches()}?;
//the end test could probably be proper antlr stuff but this was easier
pn_char: alphanum|’+’|’_’|’-‘;
pn_follows
scope VersionScope;
:
( {$VersionScope::needs_version}?=>version_spec ) (WS|EOF);
@Betelgeuse, a post to describe your problem.
“The following token definitions can never be matched because prior tokens match the same input”.
I fell over this one when I made the n00b mistake of not realizing that grammar rules *must* begin lower case and lexer rules must begin upper case.
Thanks for offerring the help Floris! I’m one of the newbies to ANTLR who is pulling his hair out to get his issues resolved. A painfull exprience indeed but I hope that at the end I will be rewarded big.
Let me first explain a bit about my situation: I’m working with ANTLR from within eclipse using the eclipse ANTLR plugin. I wrote a grammar (of course) and tested it in the “Interpreter” window and it all looks fine and the tree looked great. I then wrote a test to print the tree but that failed with some error messages. Thanks for your help!
Here is the grammar:
————————————————————————————
grammar T;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
tokens {
NEGATE;
}
@header {
package a.b.c.T.parser;
}
@lexer::header {
package a.b.c.T.parser;
}
codeFragment
: statement+ EOF!
;
statement
: block
| unterminated LINE_END
| ifstatement
| LINE_END
;
unterminated
: expression
| assignmentStatement
;
ifstatement
: ‘if’ parExpression statement (‘else if’ parExpression statement)* (options {k=1; backtrack=false;}:’else’ statement)?
;
parExpression
: ‘(‘! expression ‘)’!
;
block returns [String r]
: ‘{‘ statement* ‘}’
;
assignmentStatement
: IDENT (‘=’^ | ‘ NEGATE
;
mult
: unary ((‘*’^ | ‘/’^) unary)*
;
add
: mult ((‘+’^ | ‘-‘^) mult)*
;
relation
: add ((‘==’^ | ‘<'^ | '’^ | ‘>=’^) add)*
;
expression
: relation ((‘&’^ | ‘&&’^ | ‘|’^ | ‘||’^) relation)*
;
INTEGER: ‘0’ | ‘1’..’9′ DIGIT* ‘L’? | ‘0x’ DIGIT+ ‘L’?;
IDENT: LETTER (LETTER | DIGIT)*;
WS: (‘ ‘ | ‘\t’ | ‘\f’)+ {$channel=HIDDEN;};
LINE_END: (‘#’ .*)? {$channel=HIDDEN;} (‘\r’? ‘\n’);
fragment DIGIT: ‘0’..’9′;
fragment LETTER: (‘a’..’z’ | ‘A’..’Z’);
————————————————————————————
Here is the test I’m running against in the “Interpreter” window:
————————————————————————————
if (a > 1) {
## x not near 0: calculate in the obvious way.
s <-b
}else if (d==1) {
d<-2
} else{
## x too close to 0 for sin(x) / x to work. Use power series instead.
s <- 1
term <- 1
if(c 1) {\r\n” +
” ## x not near 0: calculate in the obvious way.\r\n” +
” s <-b\r\n" +
"}else if (d==1) {\r\n" +
" d<-2\r\n" +
"} else{\r\n" +
" ## x too close to 0 for sin(x) / x to work. Use power series instead.\r\n" +
" s <- 1\r\n" +
" term <- 1\r\n" +
" if(c < 1) break\r\n" +
" }\r\n" +
"\r\n" +
" ## Value returned is value of last expression: s in this case\r\n" +
" s\r\n" +
"");
TLexer lexer = new TLexer(charStream);
TokenStream tokenStream = new CommonTokenStream(lexer);
TParser parser = new TParser(tokenStream);
codeFragment_return fragment = parser.codeFragment();
CommonTree tree = fragment.tree;
System.out.println(tree.toStringTree());
}
————————————————————————————
Or simply put, I guess my issue comes down to how to deal with sentence ending. What I have in the grammar for statements include those that must be terminated by either a newline or a “}”. But the issues is that the “}” is already consumed by a block statement as in block : ‘{‘ statement* ‘}’. How should this kind of problem be solved in general?
Hi, I’m from Colombia so my english isn’t so perfect, but I’m gonna try to explain myself. I have this grammar on Antlr.
grammar punto;
prog
: declist;
z : dec;
declist : dec (/*nada*/| z);
dec : vardec
| fundec;
vardec : typespe VAR (PUNYCO| LCORC LITERAL RCORC);
typespe : TERINT
| TERVOID;
fundec : typespe VAR LPAR params RPAR compstmt;
params : paramlist
| TERVOID;
x : COMA param (/*nada*/ |x);
paramlist
: param (/*nada*/ |x);
param : typespe VAR (/*nada*/ |LCORC RCORC);
compstmt: LLLAVE localdec statlist RLLAVE;
y : vardec (/*nada*/| y);
localdec: /*epsilon*/
| y;
w : statment (/*nada*/|w);
statlist: /*epsilon*/
| w;
statment: exprstmt
| compstmt
| selectstmt
| iterationstmt
| returnstmt;
exprstmt: expres PUNYCO
| PUNYCO;
selectstmt
: TERIF LPAR expres RPAR statment (/*nada*/|TERELSE statment) ;
iterationstmt
: TERWHILE LPAR expres RPAR statment;
returnstmt
: RETURN (PUNYCO|expres);
expres : VAR IGUAL expres
| simpexpres;
simpexpres
: addexpres (RELOP addexpres | /*nada*/);
r : ADDOP term (/*nada*/| r) ;
addexpres
: term (/*nada*/| r) ;
s : MULOP factor (/*nada*/| s);
term : MULOP factor (/*nada*/| s);
factor : LPAR expres RPAR
| VAR
| call;
call : VAR LPAR args RPAR;
args : arglist
| /*epsilon*/;
l : COMA expres +(/*nada*/| l) ;
arglist : expres (/*nada*/| l) ;
PUNYCO :’;’;
LCORC : ‘[‘;
RCORC : ‘]’;
LPAR : ‘(‘;
RPAR : ‘)’;
LLLAVE : ‘{‘;
RLLAVE : ‘}’;
MULOP : (‘*’|’/’);
ADDOP : (‘+’|’-‘);
RELOP : (‘<=' | '’ | ‘>=’ | ‘==’ | ‘!=’);
IGUAL : ‘=’;
PUYCO : ‘;’;
COMA : ‘,’;
TERINT : ‘int’;
TERVOID : ‘void’;
TERIF : ‘if’;
TERELSE : ‘else’;
TERWHILE: ‘while’;
RETURN : ‘return’;
VAR : (‘a’..’z’|’A’..’Z’)((‘a’..’z’|’A’..’Z’)|’0′..’9′)*;
LITERAL : (‘1’..’9′) (‘0’..’9′)*;
And I need de lexer code, but It keeps giving me an error. The following token definitions can never be matched because prior tokens match the same input: PUYCO. and then I was looking for the mistake and I found that this symbols are in red “selectstmt” “s” “term”, I don’t really know what to do now. I will appreciate your help
You have 2 problems here:
1) Empty rules : (/*nada*/| y)
just remove the ‘nada’ including the or sign ‘|’. If you want optional rules just add ‘?’. rule : subrule1 subrule2 (optionalrule value)?;
2) You can not have two LEXER rules that match exactly the same input.
your fixed grammar….
—
grammar punto;
prog
: declist;
z : dec;
declist : dec ( z);
dec : vardec
| fundec;
vardec : typespe VAR (PUNYCO| LCORC LITERAL RCORC);
typespe : TERINT
| TERVOID;
fundec : typespe VAR LPAR params RPAR compstmt;
params : paramlist
| TERVOID;
x : COMA param (x);
paramlist
: param (x);
param : typespe VAR (LCORC RCORC);
compstmt: LLLAVE localdec statlist RLLAVE;
y : vardec ( y);
localdec: /*epsilon*/
| y;
w : statment (w);
statlist: /*epsilon*/
| w;
statment: exprstmt
| compstmt
| selectstmt
| iterationstmt
| returnstmt;
exprstmt: expres PUNYCO
| PUNYCO;
selectstmt
: TERIF LPAR expres RPAR statment (TERELSE statment) ;
iterationstmt
: TERWHILE LPAR expres RPAR statment;
returnstmt
: RETURN (PUNYCO|expres);
expres : VAR IGUAL expres
| simpexpres;
simpexpres
: addexpres (RELOP addexpres);
r : ADDOP term ( r) ;
addexpres
: term ( r) ;
s : MULOP factor ( s);
term : MULOP factor (s);
factor : LPAR expres RPAR
| VAR
| call;
call : VAR LPAR args RPAR;
args : arglist
| /*epsilon*/;
l : COMA expres +(/*nada*/| l) ;
arglist : expres (/*nada*/| l) ;
PUNYCO : ‘;’;
//PUYCO : ‘;’;
LCORC : ‘[‘;
RCORC : ‘]’;
LPAR : ‘(‘;
RPAR : ‘)’;
LLLAVE : ‘{‘;
RLLAVE : ‘}’;
MULOP : (‘*’|’/’);
ADDOP : (‘+’|’-‘);
RELOP : (‘<=' | '>=’ | ‘==’ | ‘!=’);
IGUAL : ‘=’;
COMA : ‘,’;
TERINT : ‘int’;
TERVOID : ‘void’;
TERIF : ‘if’;
TERELSE : ‘else’;
TERWHILE: ‘while’;
RETURN : ‘return’;
VAR : (‘a’..’z’|’A’..’Z’)((‘a’..’z’|’A’..’Z’)|’0′..’9′)*;
LITERAL : (‘1’..’9′) (‘0’..’9′)*;
Hi!
I have problems with this grammar:
http://www.antlr.org/grammar/1202750770887/vhdl.g
I have error when I check grammar: “The following token definitions can never be matched because prior tokens match the same input INTEGER, LETTER, …”.
Can you tell me, is this grammar correct, or may be I do something wrong. Thanks.
Your lexer rule
BASE,HEXDIGIT is never used, remove it.
You are not using LETTER, DIGIT in your parser rules, but you are using it in your lexer rules
therefore you should make it fragment like so:
fragment LETTER
: ‘a’..’z’ | ‘A’..’Z’
;
fragment DIGIT
: ‘0’..’9′
;
fragment INTEGER
: DIGIT ( ‘_’ | DIGIT )*
;
Good luck!
Thank you very much!
I have a question about taking input that may happen to match a token. If, in your example, you wanted to do something like “There is a an”, there is a Mismatched Token Exception (in the interpreter) because ‘an’ matches a token. I realize in this example that would be silly, but I’m fighting this in a larger file I’m trying to use for a cli. I have a bunch of tokens defined for the various input language fields, but if the user wants to set something to a word that just happens to match one of the tokens, there is an exception. Is there a way in your example to have ID match anything, even if the input matches a token? That is a way to ignore the tokens for this particular input and just accept what’s there regardless if it matches anything?
Hi,
I have this frequent error reported in the console: error(208): ct.g:126:1: The following token definitions can never be matched because prior tokens match the same input: CHAR
Even with the AntLR book I am not able to point at the error. Any suggestion would be appreciated…
Here is an short example for testing either the sugar or the res_section rules. Every row starting with a number corresponds to an input for the residue rule. The 2nd digit refers to monosac_specification when b or to substit_specification when s:
RES
1b:b-dglc-HEX-1:5
2b:b-dglc-HEX-1:5
3b:b-dglc-HEX-1:5
4b:b-dglc-HEX-1:5
5b:b-dglc-HEX-1:5
6s:n-acetyl
7b:b-dglc-HEX-1:5
And the grammar:
grammar ct; /* a grammar for carbohydrates in GlycoCT format */
sugar : res_section /* residues */
(lin_section)? /* linkages */
;
res_section : RES WS (residue )+ ;
lin_section : LIN WS /*(linkage)+*/ ;
residue : INTEGER
residue_specification
(WS)?
(SEMICOLON)?
;
residue_specification
:
( monosac_specification
| substit_specification
| repeat_residue_specification
//| inchi_specification // TODO later
)
;
monosac_specification
:
MONOSAC_DECLARATION
COLON
monosaccharide
;
substit_specification
:
SUBSTIT_DECLARATION
COLON
substituent
;
repeat_residue_specification
:
REPEAT_DECLARATION
CHAR // seems to always be the letter ‘r’
INTEGER // index of the repeat substructure referenced
;
substituent //ex : ‘(r)-lactate’
:
(
(LPARENTHESIS)?
(CHAR)?
(RPARENTHESIS)?
(HYPHEN)?
IDENTIFIER
)
;
monosaccharide
: anomer
HYPHEN
stereo
stem
HYPHEN
monosac_superclass
HYPHEN
monosac_ring_closure
(monosac_substituents_or_modifications)*
;
monosac_ring_closure
: terminus_position
COLON
terminus_position
;
terminus_position
: INTEGER
| HYPHEN INTEGER
| UNKNOWN_TERMINUS
;
monosac_substituents_or_modifications
:
PIPE
t1=INTEGER
( COMMA t2=INTEGER )?
COLON
monosac_modification
;
monosac_modification : IDENTIFIER ;
anomer : CHAR ;
stereo : CHAR ;
stem : IDENTIFIER ;
monosac_superclass : IDENTIFIER ;
COLON : ‘:’ ;
COMMA : ‘,’ ;
HYPHEN : ‘-‘ ;
PLUS : ‘+’ ;
EQUALS : ‘=’ ;
PIPE : ‘|’ ;
SEMICOLON : ‘;’ ;
LPARENTHESIS : ‘(‘ ;
RPARENTHESIS : ‘)’ ;
UNKNOWN_TERMINUS : ‘?’ | ‘x’ ;
MONOSAC_DECLARATION : ‘b’ ;
SUBSTIT_DECLARATION : ‘s’ ;
INCHI_DECLARATION : ‘i’ ;
REPEAT_DECLARATION : ‘r’ ;
LINKAGE_TYPE_IDENTIFIER : LETTER ;
LINKAGE_TERMINUS_DECLARATION : INTEGER LINKAGE_TYPE_IDENTIFIER;
RES : ‘RES’;
LIN : ‘LIN’;
PRO : ‘PRO’;
REP : ‘REP’;
STA : ‘STA’;
ISO : ‘ISO’;
AGL : ‘AGL’;
INTEGER : (‘1’..’9′) (‘0’..’9′)* | ‘0’ ;
IDENTIFIER : (LETTER)+ ;
CHAR : LETTER ;
fragment LETTER : ‘a’..’z’ | ‘A’..’Z’ ;
WS : ( ‘ ‘
| ‘\t’
| (( ‘\r’ ‘\n’ ) | ‘\n’)
) {$channel=HIDDEN;}
;
Thank you for sharing your case.
Hi Floris,
Could you please help me raise the ambiguities in my grammar?
filtre : expression ((‘AND’ | ‘OR’)^ expression)*
;
expression
: filtreDimension
| item
| listeComptes
| ‘(‘! expression ‘)’!
;
filtreDimension
: ” ‘=’ ‘”‘ v=valeur ‘”‘ -> ^(FILTRE $d $v)
;
item
: ‘{‘ i=itemCode ‘,f}’ -> ^(itemCode $i)
;
listeComptes
: ‘[‘ c+=compte (‘,’ c+=compte)* ‘]’ -> ^(COMPTES $c*)
;
dimension : ID;
itemCode : ID;
valeur : VALEUR;
compte : COMPTE;
ID : (‘a’..’z’|’A’..’Z’|’_’) (‘a’..’z’|’A’..’Z’|’0′..’9’|’_’)*;
COMPTE : (‘0’..’9′)+;
VALEUR : ‘#’?(‘a’..’z’|’A’..’Z’|’0′..’9’|’_’)+;
WS : ‘ ‘+ {$channel=Hidden;};
When I match the grammar against the input:
= “yes” AND = “no”
I get the following AST:
AND
—- FILTRE
——– titi
——– mismatched token
—- FILTE
——– toto
——– mismatched token
How would you improve the lexer rules?
Thanks in advance,
Mark
I am having the most frustrating time and would love anyone to take a look at this. If you can help, you should invest some time in putting a donate button on your site!
(I could strip it down to one parser rule of just ICHAR+ EOF and it does the same thing, but I think this shows more what I am doing.
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : (‘\t’ | ‘ ‘ | ‘\r’ | ‘\n’)+ {self.skip();};
//PCT_CONTAINS : ‘pct_contains’; //Here or below ICHAR cause the same issue
ICHAR : (‘a’..’z’|’A’..’Z’);
PCT_CONTAINS : ‘pct_contains’;
USCORE : ‘_’;
DOT : ‘.’;
PCT_CONTAINS is a keyword in my language and is used as a built in function. It is not used here but having it defined is creating the error.
When I have it parse “pct_female” it comes back with a nasty error
line 1:4 no viable alternative at character u’f’
and then returns “ct_female”.
I understand basically what is happening and have tried a lot of things including various flavors of predication, but everything ends the same.
How is this even an issue to begin with, since many people would like a language with a keyword like ‘while’ but with the ability to make a variable like ‘whiledoingstuff’.