Lab 2 - UofT ECE 467 (2022 Fall)

I don't know if I'll be teaching the course again, but if I wait until next year I'll forget feedback I've received. Note that more test cases have been added (if you pull the new starter code); those are for the (potential) future!

Introduction

After tokenizing its input, a compiler deduces the structure of the code by parsing. The structure of a language is defined by a grammar. Traditionally, programming languages were specified using context-free grammars, although newer grammar definitions such as parsing expression grammars exist. We will be using GNU Bison to generate a bottom-up LALR(1) parser from a context-free grammar. Bison has been used for many programming languages and tools, including GCC, which used it for C++ until 2004 (version 3.4). (It then switched to a hand-written recursive descent parser, due to the complex nature of the C++ language.)

Lab Description

The Bison input file in the starter code is src/parser.y. The Bison manual contains everything you need to know about this file. In particular, I recommend reading the following sections.

In Lab 1, you should have defined the tokens for your language in your Bison file. The tokens in lexical analysis are known as terminals in a grammar. Productions are rules which define nonterminals as a sequence of symbols. A symbol is simply a terminal or nonterminal. Multiple productions for a single nonterminal may exist. Productions are typically associated with actions which execute when the production rule is matched.

Language Constructs

Both types and names are just identifier tokens, but may be distinguished for clarity.

You should support the following top-level declarations, where parameters may be a (possibly empty) comma-separated list of type name (e.g. int x).

// function declaration
type name(parameters);

// function definition
type name(parameters) {
	// statements
}

if (expression) {
	// statements
}

if (expression) {
	// statements
} else {
	// statements
}

if (expression) {
	// statements
} else if (expression) {
	// statements
} // possibly more `else if`s, possibly ending in one `else`

while (expression) {
	// statements
}

type name = expression;

name = expression;

expression;

continue;

break;

return expression;

You should support the following expressions, where arg is a (possibly empty) comma-separated list of expressions, and unary_op just needs to support arithmetic negation (optionally: logical negation):

INTEGER

FLOAT

name

(expression)

(type) expression

expression binary_op expression

unary_op expression

name(arguments)

Precedence and Associativity

If you directly implement the grammar above, you should end up with conflicts in your parser, due to ambiguous precedence and associativity in expression binary_op expression. You should implement precedence and associativity as in C: link. Note that you should do this by rewriting the grammar, not by using the %left, %right, %nonassoc, %precedence, or %prec directives that Bison supports. See the below section for an in-depth guide on rewriting grammars for precedence and associativity.

There should be a src/parser.y.output file in your build directory; see the Understanding Your Parser for more information about this file.

Conflict Resolution

Consider the following numbered grammar for boolean expressions, which is similar to the grammar definition for (binary) expressions.

1. bexpr := bexpr bop bexpr
2. bexpr := NOT bexpr
3. bexpr := LPAREN bexpr RPAREN
4. bexpr := TRUE
5. bexpr := FALSE
6. bop := OR
7. bop := AND

Given an input of TRUE OR TRUE OR TRUE, what are the possible (right-most) derivations? (Recall that if a left-most or right-most derivation is unique, the other is unique too.) The following derivation produces a right-associative binary expression, producing the parse tree (denoted by parentheses) TRUE OR (TRUE OR TRUE).

bexpr
bexpr bop bexpr (1)
bexpr bop bexpr bop bexpr (1)
bexpr bop bexpr bop TRUE (4)
bexpr bop bexpr OR TRUE (6)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)

The following derivation produces a left-associative binary expression, producing the parse tree (TRUE OR TRUE) OR TRUE.

bexpr
bexpr bop bexpr (1)
bexpr bop TRUE (4)
bexpr OR TRUE (6)
bexpr bop bexpr OR TRUE (1)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)

Since two right-most derivations exist, the grammar is ambiguous. Now consider the following rewritten grammar.

1. bexpr := bexpr bop bexpr'
2. bexpr := bexpr'
3. bexpr' := NOT bexpr
4. bexpr' := LPAREN bexpr RPAREN
5. bexpr' := TRUE
6. bexpr' := FALSE
7. bop := OR
8. bop := AND

Now, only the following derivation is possible, producing the left-associative parse tree (TRUE OR TRUE) OR TRUE.

bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr bop bexpr' OR TRUE (1)
bexpr bop TRUE OR TRUE (5)
bexpr OR TRUE OR TRUE (7)
bexpr' OR TRUE OR TRUE (2)
TRUE OR TRUE OR TRUE (5)

Why does this work (why must the binary expression be left-associative)? Because the grammar has been “factored” so that the recursion can only apply to the left side (only the left term can contain nested expressions). A grammar for right-associative binary expressions can be written similarly.

However, there is still a slight problem. Consider the input NOT TRUE OR TRUE. It can be derived as (NOT TRUE) OR TRUE (desired, as unary operators typically have higher precedence than binary operators) the following way.

bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr' OR TRUE (2)
NOT bexpr OR TRUE (3)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)

But it can also be derived as NOT (TRUE OR TRUE) (not desired) the following way.

bexpr
bexpr' (2)
NOT bexpr (3)
NOT bexpr bop bexpr' (1)
NOT bexpr bop TRUE (5)
NOT bexpr OR TRUE (7)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)

Try changing production 3 to bexpr' := NOT bexpr' (notice the ' on the right hand side), and producing the derivations. You should see that NOT “must bind” to bexpr', which does not contain binary expressions. In other words, it is nolonger possible to have NOT (TRUE OR TRUE) (where these parentheses denote the parse tree). To get the same expression, the parentheses in the grammar from production 4 must be used: NOT LPAREN TRUE OR TRUE RPAREN (you should verify that this works).

This grammar is still not quite what we desire; both OR and AND are left-associative with the same precedence, meaning TRUE OR TRUE AND TRUE parses as (TRUE OR TRUE) AND TRUE, and TRUE AND TRUE OR TRUE parses as (TRUE AND TRUE) OR TRUE. As in the C spec, we want AND to take precedence over OR, so TRUE OR TRUE AND TRUE should parse as TRUE OR (TRUE AND TRUE). Similar to how we rewrote the grammar to “force” it to be expanded a certain way (left-associatively), we can rewrite the grammar to enforce precedence. Consider the following grammar, taken from exercise 4.2.2g (page 207) of the textbook.

1. bexpr := bexpr OR bterm
2. bexpr := bterm
3. bterm := bterm AND bfactor
4. bterm := bfactor
5. bfactor := NOT bfactor
6. bfactor := LPAREN bexpr RPAREN
7. bfactor := TRUE
8. bfactor := FALSE

Firstly, note that the structure of productions 1-4 are similar to the previous productions 1-2 (bexpr := bexpr bop bexpr' | bexpr'); this format gives the left-associative structure. However, the grammar has been additionally factored to “force” binary expressions containing AND to be nested inside binary expressions containing OR; in other words, AND has higher precedence than OR, as desired. If you are not sure about this, try producing derivations for various input strings.

So we have seen how a grammar can be rewritten to enforce precedence and associativity. You can use this to eliminate the ambiguities in the provided grammar.

Location Tracking

Section 3.5 of the Bison manual describes how location tracking works in Bison. Note that the location data structure is slightly different for C++ (by default Bison generates parsers in C); see section 10.1.5. Note that the starter code for the lexer does not track end positions for tokens. You do not need to implement this behaviour, although reading the required documentation and implementing it is a good exercise.

Test Cases

The test cases used for marking are provided in the starter code’s test directory. See the comment at the top of each file for a description.

It is recommended that you create your own test cases to avoid bugs showing up in lab 4.

UofT ECE 467 (2022 Fall) - Lab 2

The website and starter code is being pre-emptively updated!