UofT ECE 467 (2020 Fall) - Lab 2

Welcome to the ECE 467 labs homepage for 2020 Fall. See here for the latest version of the course.

Navigation


References


Corrections

As noticed by some students, name is not part of expression, meaning variables cannot be used in expressions. This will be fixed in lab 3.

Also, augmented assignments (e.g. +=, -=, etc.) should be their own tokens. Again, this will be fixed for lab 3, and the edge cases will not be tested/marked.

Corrections for Lab 3

Introduction

After tokenizing its input, a compiler deduces the structure of the code by parsing. The structure of a language is defined by a grammar. Traditionally, programming languages were specified using context-free grammars, although newer grammar definitions such as parsing expression grammars exist. We will be using Bison to generate a bottom-up LALR(1) parser from a context-free grammar. Bison has been used for many programming languages and tools, including GCC, which used it for C++ until 2004 (version 3.4) (it then switched to a hand-written recursive descent parser, due to the complex nature of the C++ language). Your task in this lab will be to:

  1. implement the grammar described below, translating the extended backus-naur form to basic basic backus-naur form
  2. fix the parsing state conflicts that arise if you directly implement the grammar shown

The Bison manual will be useful for this lab. While you don't need much to complete this lab, reading more of the documentation on your own will give you a better understanding of how these tools work. In particular, you should check out the section on C++ parsers (10.1).

Lab Description

The Bison input file in the starter code is src/parser.y. The Bison manual contains everything you need to know about this file. In Lab 1, you should have defined the tokens for your language in your Bison file. The tokens in lexical analysis are known as terminals in a grammar. Productions are rules which define nonterminals as a sequence of symbols. A symbol is simply a terminal or nonterminal. Multiple productions for a single nonterminal may exist. Productions are typically associated with actions which execute when the production rule is matched. Later you will use these actions to build an AST, but for this lab, you will simply print the rules matched.

The Grammar
The following grammar is in EBNF form. However, Bison only accepts basic BNF form. The conversion is relatively straightforward; you are to perform it yourself. However, this grammar intentionally produces parsing conflicts. You are to find and fix them; there should be no parser conflicts building your submitted code. In particular, you should read about precedence in sections 5.3 and 5.4 of the Bison manual. The precedence of your language should be the same as C: reference. Terminals come from the tokens defined in Lab 1. The start symbol is root.
root
function_list
function_list
function+
function
function_decl semicolon
function_defn
function_decl
type name lparen parameter_list rparen
function_defn
function_decl block
type
identifier
Update for Lab 3: replace this with the new type token
name
identifier
parameter_list
(declaration (comma declaration)*)?
block
lbrace suite rbrace
suite
(statement semicolon)*
declaration
type name
statement
single_statement semicolon
compound_statement
single_statement
declaration assign expression
name assign expression
name binary_op assign expression (Remove for Lab 3)
Update for Lab 3: name augmented_assign expression
break
continue
return expression?
expression
expression
Update for Lab 3: name
true | false
integer
float
binary_expression
unary_expression
relational_expression
ternary_expression
cast_expression
function_call
lparen expression rparen
compound_statement
if lparen expression rparen block
for lparen single_statement? semicolon expression? semicolon single_statement? rparen block
while lparen expression rparen block
binary_expression
expression binary_op expression
unary_expression
unary_op expression
relational_expression
expression relational_op expression
binary_op
plus
minus
star
slash
log_and
log_or
bit_and
bit_or
bit_xor
unary_op
minus
relational_op
eq
ne
lt
gt
le
ge
ternary_expression
expression question_mark expression colon expression
cast_expression
lparen type rparen expression
function_call
name lparen (expression (comma expression)*)? rparen
Update for Lab 3: augmented_assign
plus_assign
minus_assign
star_assign
slash_assign
Location Tracking

Section 3.5 of the Bison manual describes how location tracking works in Bison. Note that the location data structure is slightly different for C++ (by default Bison generates parsers in C); see section 10.1.5. Note that the starter code for the lexer does not track end positions for tokens. You do not need to implement this behaviour, although reading the required documentation and implementing it is a good exercise.

Conflict Resolution (by looking at derivations)

Consider the following numbered grammar for boolean expressions, similar to the grammar definition for expression and binary_expression above.

1. bexpr := bexpr bop bexpr
2. bexpr := NOT bexpr
3. bexpr := LPAREN bexpr RPAREN
4. bexpr := TRUE
5. bexpr := FALSE
6. bop := OR
7. bop := AND
				
Given an input of TRUE OR TRUE OR TRUE, what are the possible (right-most) derivations? (Recall that if a left-most or right-most derivation is unique, the other is unique too.) The following derivation produces a right-associative binary expression, producing the parse tree (denoted by parentheses) TRUE OR (TRUE OR TRUE).
bexpr
bexpr bop bexpr (1)
bexpr bop bexpr bop bexpr (1)
bexpr bop bexpr bop TRUE (4)
bexpr bop bexpr OR TRUE (6)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)
				
The following derivation produces a left-associative binary expression, producing the parse tree (TRUE OR TRUE) OR TRUE.
bexpr
bexpr bop bexpr (1)
bexpr bop TRUE (4)
bexpr OR TRUE (6)
bexpr bop bexpr OR TRUE (1)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)
				
Since two right-most derivations exist, the grammar is ambiguous. Now consider the following rewritten grammar.
1. bexpr := bexpr bop bexpr'
2. bexpr := bexpr'
3. bexpr' := NOT bexpr
4. bexpr' := LPAREN bexpr RPAREN
5. bexpr' := TRUE
6. bexpr' := FALSE
7. bop := OR
8. bop := AND
				
Now, only the following derivation is possible, producing the left-associative parse tree (TRUE OR TRUE) OR TRUE.
bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr bop bexpr' OR TRUE (1)
bexpr bop TRUE OR TRUE (5)
bexpr OR TRUE OR TRUE (7)
bexpr' OR TRUE OR TRUE (2)
TRUE OR TRUE OR TRUE (5)
				
Why does this work (why must the binary expression be left-associative)? Because the grammar has been "factored" so that the recursion can only apply to the left side (only the left term can contain nested expressions). A grammar for right-associative binary expressions can be written similarly.

However, there is a slight problem. Consider the input NOT TRUE OR TRUE. It can be derived as (NOT TRUE) OR TRUE (desired, as unary operators typically have higher precedence than binary operators) the following way.

bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr' OR TRUE (2)
NOT bexpr OR TRUE (3)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)
				
It can also be derived as NOT (TRUE OR TRUE) (not desired) the following way.
bexpr
bexpr' (2)
NOT bexpr (3)
NOT bexpr bop bexpr' (1)
NOT bexpr bop TRUE (5)
NOT bexpr OR TRUE (7)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)
				
Try changing production 3 to bexpr' := NOT bexpr', and producing the derivations. You should see that NOT "must bind" to bexpr', which does not contain binary expressions. In other words, it is nolonger possible to have NOT (TRUE OR TRUE) (where these parentheses denote the parse tree). To get the same expression, the parentheses in the grammar from production 4 must be used: NOT LPAREN TRUE OR TRUE RPAREN (you should verify that this works).

This grammar is still not quite what we desire; both OR and AND are left-associative with the same precedence, meaning TRUE OR TRUE AND TRUE parses as (TRUE OR TRUE) AND TRUE, and TRUE AND TRUE OR TRUE parses as (TRUE AND TRUE) OR TRUE. As in the C spec, we want AND to take precedence over OR, so TRUE OR TRUE AND TRUE should parse as TRUE OR (TRUE AND TRUE). Similar to how we rewrote the grammar to "force" it to be expanded a certain way (left-associatively), we can rewrite the grammar to enforce precedence. Consider the following grammar, taken from exercise 4.2.2g (page 207) of the textbook, which we also looked at in tutorial.

1. bexpr := bexpr OR bterm
2. bexpr := bterm
3. bterm := bterm AND bfactor
4. bterm := bfactor
5. bfactor := NOT bfactor
6. bfactor := LPAREN bexpr RPAREN
7. bfactor := TRUE
8. bfactor := FALSE
				
Firstly, note that the structure of productions 1-4 are similar to the previous productions 1-2 (bexpr := bexpr bop bexpr' | bexpr'); this format gives the left-associative structure. However, the grammar has been additionally factored to "force" binary expressions containing AND to be nested inside binary expressions containing OR; in other words, AND has higher precedence than OR, as desired. If you are not sure about this, try producing derivations for various input strings.

So we have seen how a grammar can be rewritten to enforce precedence and associativity. You can use this to eliminate the ambiguities in the provided grammar.

Conflict Resolution (by looking at LR parsing)

You can view the debug info and states of your parser in build/src/parser.output. The file will be quite large, but you will need to read through it closely.

When you build your parser, it will tell you if there are conflicts. If you implemented the grammar rules exactly as defined above, you should have some shift/reduce conflicts. Near the top of parser.output, you should see a list of states (numbered starting from 0) and the conflicts they contain. Further down, you should see a detailed description of each state.

Suppose you wrote this simple grammar:

root
expr
expr
expr bin_op expr
TOK_IDENTIFIER
TOK_INTEGER
bin_op
TOK_PLUS
TOK_MINUS
If you built it, your parser.output should contain something like State 9 conflicts: 2 shift/reduce. So we can scroll/search down to State 9 and find:
State 9

    2 expr: expr . bin_op expr
    2     | expr bin_op expr .

    TOK_PLUS   shift, and go to state 6
    TOK_MINUS  shift, and go to state 7

    TOK_PLUS   [reduce using rule 2 (expr)]
    TOK_MINUS  [reduce using rule 2 (expr)]
    $default   reduce using rule 2 (expr)

    bin_op  go to state 8
				
Let us analyze each block.
    2 expr: expr . bin_op expr
    2     | expr bin_op expr .
				
The 2 means it is the 2nd rule; you can see the numbered rules near the top of parser.output under Grammar. The contents of the rule is displayed, along with a dot. The dot means "where" the parser is in terms of matching this/these rules. So in this case, it means the parser has just finished seeing/producing an expr (as the dot comes after expr in both cases). However, in the first case the parser has just seen the first expr of this rule, in the second case the parser has just seen the second expr of this rule. Your parser may be in the middle of matching "multiple" or different rules, as it is a state machine which has potentially combined states.
    TOK_PLUS   shift, and go to state 6
    TOK_MINUS  shift, and go to state 7
				
What this means is that given TOK_PLUS as a lookahead, the parser wants to shift to state 6. Similarly, given TOK_MINUS as a lookahead, the parser wants to shift to state 7.
    TOK_PLUS   [reduce using rule 2 (expr)]
    TOK_MINUS  [reduce using rule 2 (expr)]
    $default   reduce using rule 2 (expr)
				
Here, given TOK_PLUS as the next symbol, the parser wants to reduce using rule 2. Similarly, the parser wants to reduce using rule 2 if it sees a TOK_MINUS. There are conflicts as for both of these symbols, the parser does not know whether it should shift or reduce.
    bin_op  go to state 8
				
When the parser performs a reduction and pops off states from the stack, assuming it ends up at this state, it should go to state 8 if the next symbol is a bin_op (in this case, anything else would be an error).

Note that you will likely have to examine the nearby/related states to understand why the conflict exists, and how to rewrite the grammar to fix it.

Operator Precedence (for LR parsing)

The page on operator precedence, contextual precedence, and understanding your parser from the Bison manual will be useful. As a quick reference:

Specifically, precedence and associativity can be used to solve shift/reduce conflicts. Tokens/terminals get a precedence from the order they are defined (unless %token is used, in which case it has no precedence or associativity). A production gets the precedence of its right-most terminal, if present. With a shift-reduce conflict, the parser compares the precedence of the lookahead token, with the precedence of the rule to reduce by: If the precedences of the lookahead and the production are the same, the parser looks at the associativity of the lookahead. If the lookahead token is left-associative, it reduces. If the lookahead is right-associative, it shifts. One can verify that for binary operators, these actions correspond to the "expected thing".

Output

For each rule matched, print a line in the form:
target := a b c
where a, b, c are the corresponding symbols. If you have an empty rule (whose production body should be specified by %empty in Bison), please print target := %empty. You do not need to change anything for error handling; the starter code error handling routine is expected. As not everyone will have the same grammar, the output will be inspected manually. However, please adhere to the rules above. Additionally, you should ensure the there are no parsing conflicts building your parser (when you run make after touching parser.y).

Test Cases

The following test cases are provided for you. The output has been omitted as it essentially shows you the grammar used. The most important thing is that the success cases succeed (no error message, return code 0 - i.e. ./src/ece467c 2 success-0.c && echo $? should print 0), and the failure test cases should produce a line like [error] parser error at 4.6: syntax error.

Submission

Same as Lab 1.