Welcome to the ECE 467 labs homepage for 2020 Fall. See here for the latest version of the course.
The first task of a compiler is tokenizing its input, also known as lexical analysis. Tokens are typically defined by regular expressions, which can be converted to finite automata (a state machine). As such, a lexer generator takes a list of tokens with their associated regexes, and produces a lexer which sequences input to the set of tokens. Flex is one of such lexer generators. Your task in this lab will be to:
The Flex manual will be useful for this lab. While you don't need much to complete this lab, reading more of the documentation on your own will give you a better understanding of how these tools work.
The Flex input file in the starter code is src/lexer.l
.
The Flex manual contains everything you need to know about this file.
We bring your attention to a few things.
A lexer produces a stream of tokens from an input string.
These tokens are processed by a parser.
As such, the lexer and parser must use the same representation for tokens.
With Flex and Bison, Bison defines the tokens,
which get written to a parser header file,
and the Flex generated lexer #include
s this header for the token definitions.
When you run make
, the parser header is written to build/src/parser.hpp
(a corresponding cpp file exists in the same directory).
You may want to look through this file for a better understanding of how it works.
In particular, the tokens are defined in an enum token_kind_type
namespaced inside a struct token
inside the yy::parser
class.
The type of these enum values is typedef
'd to yy::parser::token_kind_type
,
and you would access a token HI
with yy::parser::token::HI
.
You define tokens in the src/parser.y
file by entering a line like
%token <std::string> HI
.
Here, std::string
is the type of the data associated with the HI
token.
You can specify any C++ type here.
Tokens such as numbers or strings require this extra data for compilation.
For tokens that don't require accompanying data, such as a PLUS
operator token,
you may omit the <type>
part of the token declaration.
Often, different tokens have different types of associated data;
for example, an integer token would have an int
,
a string token would have a char*
.
Traditionally in Bison, this was implemented using a C style union.
Most tutorials you find online will use this old C style union.
Unfortunately, limitations of C style unions prevented you from using more complex C++ types such as
std::string
or std::unique_ptr
.
A newer feature of Bison is variants,
essentially a type-safe union implemented using smart meta-programming tricks.
Unfortunately, the Bison manual
on this is a bit scattered;
in particular, we point you to
here
and here.
The starter lexer.l
includes macros for simplifying the creation of tokens in the lexer.
In particular, if you define a token HI
with type std::string
,
you can use the TOK
macro like return TOK(HI, <column number>, my_string)
to return the corresponding token.
For a token defined with no associated data,
such as the built in YYUNDEF
type symbolizing an invalid token,
you can call it simply as TOK(YYUNDEF, <column number>)
.
In other words, you should have a %token
declaration in your parser.y
file for each type of token, listed below.
Certain tokens will require you store additional data.
When Flex matches a rule, the matched text is available inside a char* yytext
of length int yyleng
.
parser.y
file,
the names should have a TOK_
prefix and be all uppercase.
This will simplify proper output later on.
Name | Description |
---|---|
identifier | letter or underscore, followed by 0 or more letters, numbers, or underscores |
integer | a decimal integer (no hex, binary, or octal literals) |
float | a C floating point literal as described here. You do not need to implement hex literals or suffixes. |
true, false | boolean literals |
lparen, rparen | left and right parenthesis |
lbrace, rbrace | left and right curly braces |
eq, ne, lt, gt, le, ge | (in)equality relational operators |
plus, minus, star, slash | basic binary arithmetic operators |
log_and, log_or | logical and (&&), logical or (||) |
if, while, for, break, continue, return | control-flow keywords |
comma, semicolon, colon, question_mark | punctuation |
assign | single equals sign |
Update for Lab 3: plus_assign, minus_assign, star_assign, slash_assign | augmented assignments |
Update for Lab 3: type | the literal tokens "int", "float", "bool", and "void" |
The starter lexer.l
has a LOC
macro which creates the location object used for tokens.
This is used by the TOK
macro.
Due to the structure of control flow, the TOK
macro takes a column number as its second parameter.
Flex already tracks line information in an int yylineno
.
However, it does not track columns.
Using Flex's extra data
feature for reentrant scanners,
the starter code contains macros for getting, setting, and incrementing a column counter.
To implement column tracking, you should use the SET_COLUMN
and INC_COLUMN
macros.
Other notes:
yyterminate
to work properly,
the location should point to the first character after the token being returned
The usage for your compiler is ece467c <lab #> <input file>
.
If you named your tokens properly, there should be nothing else to do.
The driver in src/main.cpp
which calls lex
in src/compiler.cpp
handles printing the type of tokens lexed.
You do not need to change anything for error handling;
the starter code error handling routine is expected.
For this lab, lines with output
or error
will be grep'd,
and compared against the desired output.
The following test cases are provided for you:
Please fill out your student number, first name, and last name at the top of src/main.cpp
.
Include a doc.txt
that summarizes your contributions.
Additionally, include any major design/implementation decisions;
this will be used to give part marks if there are failing test cases.
Please name your project directory ece467-<student number 1>-<student number 2>
before compressing it (tar -czvf ece467-<student numbers>.tar.gz --exclude build ece467-<student numbers>
).
Submit the resulting archive on Querqus.