UofT ECE 467 (2020 Fall) - Lab 1

Welcome to the ECE 467 labs homepage for 2020 Fall. See here for the latest version of the course.

Navigation


References


Corrections for Lab 3

Introduction

The first task of a compiler is tokenizing its input, also known as lexical analysis. Tokens are typically defined by regular expressions, which can be converted to finite automata (a state machine). As such, a lexer generator takes a list of tokens with their associated regexes, and produces a lexer which sequences input to the set of tokens. Flex is one of such lexer generators. Your task in this lab will be to:

  1. familiarize yourself with the code base/structure,
  2. come up with the regexes for the tokens your compile will handle, and
  3. implement column tracking for tokens (line tracking is implemented by default in Flex)

The Flex manual will be useful for this lab. While you don't need much to complete this lab, reading more of the documentation on your own will give you a better understanding of how these tools work.

Lab Description

The Flex input file in the starter code is src/lexer.l. The Flex manual contains everything you need to know about this file. We bring your attention to a few things.

A lexer produces a stream of tokens from an input string. These tokens are processed by a parser. As such, the lexer and parser must use the same representation for tokens. With Flex and Bison, Bison defines the tokens, which get written to a parser header file, and the Flex generated lexer #includes this header for the token definitions. When you run make, the parser header is written to build/src/parser.hpp (a corresponding cpp file exists in the same directory). You may want to look through this file for a better understanding of how it works.

In particular, the tokens are defined in an enum token_kind_type namespaced inside a struct token inside the yy::parser class. The type of these enum values is typedef'd to yy::parser::token_kind_type, and you would access a token HI with yy::parser::token::HI. You define tokens in the src/parser.y file by entering a line like %token <std::string> HI. Here, std::string is the type of the data associated with the HI token. You can specify any C++ type here. Tokens such as numbers or strings require this extra data for compilation. For tokens that don't require accompanying data, such as a PLUS operator token, you may omit the <type> part of the token declaration.

Often, different tokens have different types of associated data; for example, an integer token would have an int, a string token would have a char*. Traditionally in Bison, this was implemented using a C style union. Most tutorials you find online will use this old C style union. Unfortunately, limitations of C style unions prevented you from using more complex C++ types such as std::string or std::unique_ptr. A newer feature of Bison is variants, essentially a type-safe union implemented using smart meta-programming tricks. Unfortunately, the Bison manual on this is a bit scattered; in particular, we point you to here and here.

The starter lexer.l includes macros for simplifying the creation of tokens in the lexer. In particular, if you define a token HI with type std::string, you can use the TOK macro like return TOK(HI, <column number>, my_string) to return the corresponding token. For a token defined with no associated data, such as the built in YYUNDEF type symbolizing an invalid token, you can call it simply as TOK(YYUNDEF, <column number>).

In other words, you should have a %token declaration in your parser.y file for each type of token, listed below. Certain tokens will require you store additional data. When Flex matches a rule, the matched text is available inside a char* yytext of length int yyleng.

Token Listing
The tokens your compiler will handle are listed below. It is your job to come up with the corresponding regexes. When you declare these tokens in your parser.y file, the names should have a TOK_ prefix and be all uppercase. This will simplify proper output later on.
Name Description
identifier letter or underscore, followed by 0 or more letters, numbers, or underscores
integer a decimal integer (no hex, binary, or octal literals)
float a C floating point literal as described here. You do not need to implement hex literals or suffixes.
true, false boolean literals
lparen, rparen left and right parenthesis
lbrace, rbrace left and right curly braces
eq, ne, lt, gt, le, ge (in)equality relational operators
plus, minus, star, slash basic binary arithmetic operators
log_and, log_or logical and (&&), logical or (||)
if, while, for, break, continue, return control-flow keywords
comma, semicolon, colon, question_mark punctuation
assign single equals sign
Update for Lab 3: plus_assign, minus_assign, star_assign, slash_assign augmented assignments
Update for Lab 3: type the literal tokens "int", "float", "bool", and "void"
Location Tracking

The starter lexer.l has a LOC macro which creates the location object used for tokens. This is used by the TOK macro. Due to the structure of control flow, the TOK macro takes a column number as its second parameter. Flex already tracks line information in an int yylineno. However, it does not track columns. Using Flex's extra data feature for reentrant scanners, the starter code contains macros for getting, setting, and incrementing a column counter. To implement column tracking, you should use the SET_COLUMN and INC_COLUMN macros. Other notes:

Execution

The usage for your compiler is ece467c <lab #> <input file>.

Output

If you named your tokens properly, there should be nothing else to do. The driver in src/main.cpp which calls lex in src/compiler.cpp handles printing the type of tokens lexed. You do not need to change anything for error handling; the starter code error handling routine is expected. For this lab, lines with output or error will be grep'd, and compared against the desired output.

Test Cases

The following test cases are provided for you:

Submission

Please fill out your student number, first name, and last name at the top of src/main.cpp. Include a doc.txt that summarizes your contributions. Additionally, include any major design/implementation decisions; this will be used to give part marks if there are failing test cases. Please name your project directory ece467-<student number 1>-<student number 2> before compressing it (tar -czvf ece467-<student numbers>.tar.gz --exclude build ece467-<student numbers>). Submit the resulting archive on Querqus.