Tiger Lexical Analyser

This project meets the requirements of the lexical analysis stage of compiling the Tiger language. It uses JavaCC which reads my grammar specification and converts it to a parser written in Java. At this stage no actual parsing is actually done, the input is just tokenised.

Comment Recognition

I used the same method to recognise comments as demonstrated in lectures.

When the analyser sees “/*”, it first enters SKIP, then increments commentNesting.

It then enters the state IN_COMMENT, which is also a SKIP. If it meets a nested comment opening or comment closing, it keeps track of it. The actual comment text itself gets thrown away into “<~[]>”.

It stays in the IN_COMMENT state until commentNesting is 0, as it has now reached the end of all the comments, and returns to the default state.

It has now successfully skipped comments.

Reserved Word, Punctuation & Operators Recognition

This is simply a list of declarations for each reserved word, punctuation character(s) and operators, and is self-explanatory.

Identifiers and Integers Recognition

I have defined identifiers as such:

 < IDENTIFIER : <LETTER> (<LETTER>|<DIGIT>|"_")* > 

This means an identifier is a letter followed by zero or more LETTER’s, DIGIT’s or underscores ( _ )’s.

Integers are defined as:

 <INTEGER : (<DIGIT>)+ > 

This means an integer is one or more DIGIT’s.

LETTER and DIGIT are defined as private regular expressions:

 < #LETTER : ["A"-"Z", "a"-"z"] >  < #DIGIT : ["0"-"9"] > 

String Recognition

This was the most complicated token to recognise. I will show and explain the complete code here.

 MORE : /* Strings */  {  "\""  : WITHIN_STRING  }

MORE is one of the types of regular expressions that specify what to do when a regular expression has been successfully matched.

It is defined as:

Continue (to whatever the next state is) taking the matched along.  This string will be a prefix of the new matched string.

So what happens firstly is that the analyser sees a quote and matches it to this MORE (“), which then enters into the WITHIN_STRING state. Then, if whatever follows that quote can be matched to the regular expressions in the below MORE, then it will.

< WITHIN_STRING > MORE :
{
        <	~["\\","\""]  >
 	|	<  "\\" (["n", "t", "\\", "\""]
        |
        |	"^"["A"-"Z", "a"-"z"]) > 

 	/* When we encounter whitespace between two \'s,
        remove it, including the \'s */
 	|	< "\\" ([" ","\t","\n","\r"])+ "\\" >
 	{
 		image.delete(image.length() - lengthOfMatch, image.length());
 	}
	: WITHIN_STRING
}

Most of it is obvious. ~["\\","\""] refers to any printable characters, as specified by the  specification.

 <  "\\" (["n", "t", "\\", "\""]|<DIGIT><DIGIT><DIGIT>|"^"["A"-"Z", "a"-"z"]) > 

The above handles any escape sequences, i.e. “\n”, “\t”, etc, as well as control characters, i.e. CTRL-C, CTRL-Z, etc, and three digit ASCII character codes. Note: I realise ASCII character codes shouldn’t reach past 255, but decided for simplicity to not bother to take this into consideration as I doubt there’s any marks going for it.

The most interesting part of this is where it matches a series of white space between two backslashes, and removes it from the token. This is achieved by using an action when this match happens.

The code to remove the white space is as follows:

 image.delete(image.length() - lengthOfMatch, image.length()); 

lengthOfMatch is a handy variable that contains the length of the current match (and is not cumulative over MORE’s).

Finally, below is matched when the closing quote is met. The action here then assigns the value of image to the matchedToken.image, which is in turn passed back as the tokens value.

TOKEN :
{
	/* Assigned the altered 'image' to  matchedToken.image */

	{
		matchedToken.image = image.toString();
	}
	: DEFAULT
}

Printing Tokens

In order to print out the tokens, I have taken advantage of the generated constants file, in my case ‘TigerTokeniserConstants.java’. I used these constants as an index for an array which stores an English description for each token type. After we call getNextToken(), the analyser can then simply refer to this array to print the token name/description and then the lexeme/value.

Encounting Errors

The analyser throws an exception when it reaches something illegal.

For anything else we don’t recognise we funnel to OTHER:

TOKEN : /* Anything else that we don't recognise */
{
     < OTHER : ~[] >
}

Requirements

Files & Running

To generate the parser, you must first install JavaCC, download the above file, and then:

javacc TigerTokeniser.jj
javac *.java
java TigerTokeniser someTextFile