UofT ECE 467 (2022 Fall) - Lab 1



The website and starter code is being pre-emptively updated!

I don't know if I'll be teaching the course again, but if I wait until next year I'll forget feedback I've received. Note that more test cases have been added (if you pull the new starter code); those are for the (potential) future!


Introduction

The first task of a compiler is tokenizing its input, also known as lexical analysis. Tokens are typically defined by regular expressions, which can be converted to finite automata (a state machine). As such, a lexer-generator takes a set of tokens with their defining regexes, and produces a lexer which produces a sequence of tokens from a sequence of characters. Commonly, “producing a lexer” means generating source code, in some target language, which implements the lexer specified by the tokens and their regexes. re2c is one such lexer-generator, which we will be using.

The re2c manual will be useful for this lab. While you don’t need much to complete these tasks, reading more of the documentation will give you a better understanding of how these tools work.

Lab Description

Your tasks for this lab will be to:

The re2c input file in the starter code is src/lexer.re.cpp. The re2c manual includes everything you need to know about this file. In particular, see the section on regular expression syntax.

Additionally, we bring your attention to a few things.

A lexer produces a sequence of tokens from an input sequence of characters. These tokens are processed and “given structure to” by a parser. As such, the lexer and parser must use the same representation for tokens. In Lab 2, we will be using the parser-generator Bison. Tokens are defined in the Bison input file (src/parser.y), which get written to a header file. The re2c input file/output source file will #include this header for token definitions. When you run make, the parser header is written to build/src/parser.y.hpp (a corresponding .y.cpp file exists in the same directory). You may want to look through this file for a better understanding of how it works.

In particular, the tokens are defined in an enum token_kind_type namespaced inside a struct token inside the yy::parser class. The type of these enum values is typedef‘d to yy::parser::token_kind_type, and you would access a token HI as yy::parser::token::HI. You define tokens in src/parser.y by entering a line like the following.

%token <std::string> HI

Here, std::string is the type of the data associated with the HI token. You can specify any C++ type here. This data is required for tokens such as number literals. For tokens that don’t require accompanying data, such as a PLUS operator token, you may omit the <type> part of the token declaration. Note that it’s generally easier to store the associated lexeme as a plain string, and parse it later (e.g. to verify that an integer is within bounds).

Often, different tokens have different types of associated data; for example, an integer literal token may have an int value, whereas a string literal token may have a std::string value. Traditionally in Bison, these values were represented using a C union. Most tutorials you find online will be using this old C union. Unfortunately, limitations of C unions prevent more complex C++ types such as std::string or std::unique_ptr from being used. At some point, Bison supported variants: tagged C unions which expose a type-safe interface. Unfortunately, the Bison manual on this is a bit scattered; in particular, we point you to here and here.

Token Listing

The tokens your compiler will handle are listed below. It is your job to come up with the corresponding regexes.

The test cases for marking are all provided; none of the tests ever try to differentiate the precedence of relational operators with each other.

Name Description
identifier a letter or underscore, followed by 0 or more letters, numbers, or underscores
integer a decimal integer (no hex, binary, or octal literals)
float a C floating point literal as described here. You do not need to implement hex literals or suffixes, or ' separators
lparen, rparen left and right parentheses: (, )
lbrace, rbrace left and right curly braces: {, }
rel_eq, rel_ne, rel_lt, rel_le, rel_gt, rel_ge relational operators: ==, !=, <, <=, >, >=
plus, minus, star, slash, percent arithmetic operators: +, -, *, /, %
log_and, log_or, log_not logical operators: &&, ||, !
if, else, while, break, continue, return control-flow keywords
comma, semicolon miscellaneous punctuation
assign single =
comment single line // comment. Simply ignore (discard) the comment until the end of the line (no “comment” token needed for the parser)

Location Tracking

Instead of trying to keep track of the line and column number during lexing, it is easier to find the offsets of each newline in the file (byte string input) at the beginning, then use those offsets of newlines to compute the line number and column for any given offset into the file. You can populate the offsets of newlines in the constructor for Lexer in src/lexer.re.cpp. Then fill in the required computation of line numbers and offsets in Lexer::location in src/lexer.re.cpp. Note that line and column numbers should start at 1. The end offset in the starter code is exclusive; i.e. the end line/column will point to the character after the token.

Test Cases

The test cases used for marking are provided in the starter code’s test directory. See the comment at the top of each file for a description.

Marking is based on the return code of the process; 0 for success, nonzero for failure. After running your compiler, you can run echo $? in your shell to check the exit code of the previous process.

References


Last updated: 2022-12-23 09:56:36 -0500.