Lexer At work, a tricky sample

I was asked to help out with a grammar. It’s part of the “Package Manager Specification” see gentoo’s spec.

The spec allow to add dependencies, the syntax for this is : [operator][category]/[package name]-[version]

operator operator’s valid values are : ‘<‘ ‘<=’ ‘=’ ‘~’ ‘>=’ ‘>’
category 2.1.1 Category Names
A category name may contain any of the characters [A-Za-z0-9+_.-]. It must not begin with a hyphen or a
dot.
Note: A hyphen is not required because of the virtual category. Usually, however, category names will
contain a hyphen.
package name 2.1.2 Package Names
A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen, and
must not end in a hyphen followed by one or more digits.
version 2.2 Version Specifications
The package manager must not impose fixed limits upon the number of version components. Package managers
should indicate or reject any version that is invalid according to these rules.
A version starts with the number part, which is in the form [0-9]+(n.[0-9]+)* (a positive integer, followed
by zero or more dot-prefixed positive integers).
This may optionally be followed by one of [a-z] (a lowercase letter).

In the spec there are more requirements, but it will make this example so much more harder to understand.

now the trick is that package name (or pn for short), can not end in a hyphen followed by one or more digits’. Thus valids could be:

foo
foo-bla
foo1-bla-

but I’m not sure about the following.

foo-12a-12, if one follows the spec, it ends with -12.. so one could cut that of but you will still be left with -12a ,” a hyphen followed by one or more digits“. I would like to know it that is valid or not. Please tell me! I’m not sure.

I would like to know if this is valid: foo-123-foo? (does not end in.. I know)
I would like to know if this is valid:foo-12-12a

since category and pn and version are much alike and share much the same tokens it seemed logical to create parser rules instead of lexer rules.

dep    :    OPERATOR catagory SLASH pn HYPHEN version;

where OPERATOR is defined as:

OPERATOR:    ‘<‘|'<=’|’=’|’~’|’>=’|’>’;

and SLASH and HYPHEN :

HYPHEN    :    ‘-‘;
SLASH    :    ‘/’;

now here comes the difficult stuff:

Both Category and Pn share a lot of the same tokens. Category may contain a ‘.’  and pn may not. So if we would lex :
‘foo-bar’
would that result in  a CATEGORY token or a PN token? The answer is that Antlr would not stand by this and tell you that “the following token definitions can never be matched because prior tokens match the same input”. If there is no dot in the characters being lexed it could well have been a pn. In short both rules are quite ambigous, but it get’s wores. How do we tell the lexer the differences between a version and the pn if only characters are used that are valid for both rules. The answer is we need to do this ourselfs. The lexer will happely continue eating valid chars and thus eating characters in the version portion. It would have been easier if the final hypen would have been a character not in the range of valid chars of eiter rule.

On the other hand CATEGORY always comes after OPERATOR and PN always after SLASH. We can uses this to our advantage by using a construct named ‘semantic predicates’. We are going to use this to influence the lexer.

When the lexer is done wit operator it will set a flag telling it to match the tokens as with the category rules, and after the slash the flag will be set again to match tokes that apply to the package name rules. Resulting in “OPERATOR:    (‘<‘|'<=’|’=’|’~’|’>=’|’>’ ){category=true;};” and “SLASH    :    ‘/’  {category=false;};”.

a boolean will be set to true or to false. Which has still to be defined at the lexer by setting (basic programming rule:”declare what you want to use”) :

@lexer::members
{
boolean category=false;
}

Now Category and pn share common tokens, these are defined as follows:
fragment GENERIC    :    (UPPER|LOWER|UNDERSCORE|DIGIT);

(i use fragment so GENERIC won’t be seen as a individual token, it can be used over and over again.)

Next we are putting both

STR:
GENERIC
(
{category}?=>(GENERIC|PLUS|DOT|HYPHEN)*
|
{!category}?=>(GENERIC|PLUS|HYPHEN)*
)

So if we now set dep to : “dep    :    OPERATOR category SLASH pn; // HYPHEN VERSION;” (skipping the version) and test it:

>=cat.a.gory/packag-eName

Input was “>=cat.a.gory/packag-eName” Input was fully taken correctly.

>=cat.a.gory/packag.eName

Input was “>=cat.a.gory/packag.eName” The lexer stopped at “packag” since ‘.’ was not a valid token for package name.

Now we can add version to it, thus changing dep to : “dep    :    OPERATOR category SLASH pn HYPHEN VERSION;” ,  and try “>=cat.a.gory/packageName-12a” for inpu, telling it’s version 12.

Version not matched

Whoa what just happend?!! package name included the version!

The cause to this is because the “-12a” part can be consumed by the package name rule. And the “can not end with hyphen followed by digit” rule was not implemented.

How to implement can not end? Well currently Antlr does not support such a construct as in may not end in (not to be confused by the ‘~’ operator, meaning match anything until …). I’m going to use the input.mark, input.rewind and input release commands.  When mark is called, you get an id. With this id you can rewind back to this mark. Any tokens matched in between will become consumable once again, it’s like reversing time. Input.Release(id) does not rewind, but cleans up the administration.

The goal is to match “packageName-12a”, with a marker @ the HYPEN. Then we check if the rule “can not end in hypen.. digit” is true or false. If true, we will simply rewind and the that’s that.  So the matched “packageName-12a” will end up in “packageName” (with “-12a” rewinded, left for other rules).

Every time a hyphen comes by i’ll mark it, if we already had a mark, release it.

This is the final result:

Comments welcome!

//options
//{
//language = CSharp;
//}

//http://www.gentoo.org/proj/en/qa/pms.xml

/*
2.1.1 Category Names
A category name may contain any of the characters [A-Za-z0-9+_.-]. It must not begin with a hyphen or a
dot.
Note: A hyphen is not required because of the virtual category. Usually, however, category names will
contain a hyphen.
*/

/*
2.1.2 Package Names
A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen, and
must not end in a hyphen followed by one or more digits.
*/

/*
2.2 Version Specifications
The package manager must not impose fixed limits upon the number of version components. Package managers
should indicate or reject any version that is invalid according to these rules.
A version starts with the number part, which is in the form [0-9]+(n.[0-9]+)* (a positive integer, followed
by zero or more dot-prefixed positive integers).
This may optionally be followed by one of [a-z] (a lowercase letter).
*/

@lexer::members
{
boolean category=false;

//c#
//bool category=false;

}

dep    :    OPERATOR category SLASH pn HYPHEN VERSION;

HYPHEN    :    ‘-‘;
SLASH    :    ‘/’  {category=false;};
VERSION    :    DIGIT+ (DOT DIGIT+)* LOWER?;
OPERATOR:    (‘<‘|'<=’|’=’|’~’|’>=’|’>’) {category=true;};

category:    STR;
pn    :    STR;

STR
@init
{
int _mrk=-1;
//    int start = CharIndex; //c#
int start = getCharIndex(); //JAVA
}
:
//c#    {int _mrk=-1;int start = getCharIndex;}

GENERIC
(
{category}?=>(GENERIC|PLUS|DOT|HYPHEN)*
|
{!category}?=>
(

(GENERIC|PLUS|( {if (_mrk!=-1){input.release(_mrk);} {_mrk = input.mark();} }  HYPHEN))*
)
)
/*
Post Check: If this is a Package name and it ends in eg. -12 just take that off.
*/
{
if (!category)
{

String s = input.substring(start, input.getCharPositionInLine()-1);
String[] s2 = s.split(“-“);
if ( Character.isDigit(s2[s2.length-1].charAt(0)) && s2.length>1)
{
input.rewind(_mrk); _mrk = -1; //rewind (altering this tokens, leave the rest to be matched.)
}
else
{
if (_mrk != -1) { input.release(_mrk); } //release, do not alter token.
}

}
}//end of post check.
;

fragment UNDERSCORE    :    ‘_’;
fragment UPPER    :    ‘A’..’Z’;
fragment DIGIT    :    ‘0’..’9′;
fragment LOWER     :    ‘a’..’z’;
fragment DOT    :    ‘.’;
fragment PLUS    :    ‘+’;
fragment GENERIC    :    (UPPER|LOWER|UNDERSCORE|DIGIT);

WS     : (‘ ‘|’t’ | ‘r’ | ‘n’)+ {$channel=HIDDEN;};

LICENSE: All source code in this post comes with ABSOLUTELY NO WARRANTY, This is free software, and you are welcome to redistribute it under certain conditions, see GPL 2.

Leave a Reply