UofT ECE 467 (2020 Fall)

Lab Description

The Bison input file in the starter code is src/parser.y. The Bison manual contains everything you need to know about this file. In Lab 1, you should have defined the tokens for your language in your Bison file. The tokens in lexical analysis are known as terminals in a grammar. Productions are rules which define nonterminals as a sequence of symbols. A symbol is simply a terminal or nonterminal. Multiple productions for a single nonterminal may exist. Productions are typically associated with actions which execute when the production rule is matched. Later you will use these actions to build an AST, but for this lab, you will simply print the rules matched.

The Grammar

The following grammar is in EBNF form. However, Bison only accepts basic BNF form. The conversion is relatively straightforward; you are to perform it yourself. However, this grammar intentionally produces parsing conflicts. You are to find and fix them; there should be no parser conflicts building your submitted code. In particular, you should read about precedence in sections 5.3 and 5.4 of the Bison manual. The precedence of your language should be the same as C: reference. Terminals come from the tokens defined in Lab 1. The start symbol is root.

root: function_list
function_list: function+
function: function_decl semicolon; function_defn
function_decl: type name lparen parameter_list rparen
function_defn: function_decl block
type: identifier; Update for Lab 3: replace this with the new type token
name: identifier
parameter_list: (declaration (comma declaration)*)?
block: lbrace suite rbrace
suite: (statement ~~semicolon~~)*
declaration: type name
statement: single_statement semicolon; compound_statement
single_statement: declaration assign expression; name assign expression; name binary_op assign expression (Remove for Lab 3); Update for Lab 3: name augmented_assign expression; break; continue; return expression?; expression
expression: Update for Lab 3: name; true | false; integer; float; binary_expression; unary_expression; relational_expression; ternary_expression; cast_expression; function_call; lparen expression rparen
compound_statement: if lparen expression rparen block; for lparen single_statement? semicolon expression? semicolon single_statement? rparen block; while lparen expression rparen block
binary_expression: expression binary_op expression
unary_expression: unary_op expression
relational_expression: expression relational_op expression
binary_op: plus; minus; star; slash; log_and; log_or; ~~bit_and~~; ~~bit_or~~; ~~bit_xor~~
unary_op: minus
relational_op: eq; ne; lt; gt; le; ge
ternary_expression: expression question_mark expression colon expression
cast_expression: lparen type rparen expression
function_call: name lparen (expression (comma expression)*)? rparen
Update for Lab 3: augmented_assign: plus_assign; minus_assign; star_assign; slash_assign

Location Tracking

Section 3.5 of the Bison manual describes how location tracking works in Bison. Note that the location data structure is slightly different for C++ (by default Bison generates parsers in C); see section 10.1.5. Note that the starter code for the lexer does not track end positions for tokens. You do not need to implement this behaviour, although reading the required documentation and implementing it is a good exercise.

Conflict Resolution (by looking at derivations)

Consider the following numbered grammar for boolean expressions, similar to the grammar definition for expression and binary_expression above.

1. bexpr := bexpr bop bexpr
2. bexpr := NOT bexpr
3. bexpr := LPAREN bexpr RPAREN
4. bexpr := TRUE
5. bexpr := FALSE
6. bop := OR
7. bop := AND

Given an input of TRUE OR TRUE OR TRUE, what are the possible (right-most) derivations? (Recall that if a left-most or right-most derivation is unique, the other is unique too.) The following derivation produces a right-associative binary expression, producing the parse tree (denoted by parentheses) TRUE OR (TRUE OR TRUE).

bexpr
bexpr bop bexpr (1)
bexpr bop bexpr bop bexpr (1)
bexpr bop bexpr bop TRUE (4)
bexpr bop bexpr OR TRUE (6)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)

The following derivation produces a left-associative binary expression, producing the parse tree (TRUE OR TRUE) OR TRUE.

bexpr
bexpr bop bexpr (1)
bexpr bop TRUE (4)
bexpr OR TRUE (6)
bexpr bop bexpr OR TRUE (1)
bexpr bop TRUE OR TRUE (4)
bexpr OR TRUE OR TRUE (6)
TRUE OR TRUE OR TRUE (4)

Since two right-most derivations exist, the grammar is ambiguous. Now consider the following rewritten grammar.

1. bexpr := bexpr bop bexpr'
2. bexpr := bexpr'
3. bexpr' := NOT bexpr
4. bexpr' := LPAREN bexpr RPAREN
5. bexpr' := TRUE
6. bexpr' := FALSE
7. bop := OR
8. bop := AND

Now, only the following derivation is possible, producing the left-associative parse tree (TRUE OR TRUE) OR TRUE.

bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr bop bexpr' OR TRUE (1)
bexpr bop TRUE OR TRUE (5)
bexpr OR TRUE OR TRUE (7)
bexpr' OR TRUE OR TRUE (2)
TRUE OR TRUE OR TRUE (5)

Why does this work (why must the binary expression be left-associative)? Because the grammar has been "factored" so that the recursion can only apply to the left side (only the left term can contain nested expressions). A grammar for right-associative binary expressions can be written similarly.

However, there is a slight problem. Consider the input NOT TRUE OR TRUE. It can be derived as (NOT TRUE) OR TRUE (desired, as unary operators typically have higher precedence than binary operators) the following way.

bexpr
bexpr bop bexpr' (1)
bexpr bop TRUE (5)
bexpr OR TRUE (7)
bexpr' OR TRUE (2)
NOT bexpr OR TRUE (3)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)

It can also be derived as NOT (TRUE OR TRUE) (not desired) the following way.

bexpr
bexpr' (2)
NOT bexpr (3)
NOT bexpr bop bexpr' (1)
NOT bexpr bop TRUE (5)
NOT bexpr OR TRUE (7)
NOT bexpr' OR TRUE (2)
NOT TRUE OR TRUE (5)

Try changing production 3 to bexpr' := NOT bexpr', and producing the derivations. You should see that NOT "must bind" to bexpr', which does not contain binary expressions. In other words, it is nolonger possible to have NOT (TRUE OR TRUE) (where these parentheses denote the parse tree). To get the same expression, the parentheses in the grammar from production 4 must be used: NOT LPAREN TRUE OR TRUE RPAREN (you should verify that this works).

This grammar is still not quite what we desire; both OR and AND are left-associative with the same precedence, meaning TRUE OR TRUE AND TRUE parses as (TRUE OR TRUE) AND TRUE, and TRUE AND TRUE OR TRUE parses as (TRUE AND TRUE) OR TRUE. As in the C spec, we want AND to take precedence over OR, so TRUE OR TRUE AND TRUE should parse as TRUE OR (TRUE AND TRUE). Similar to how we rewrote the grammar to "force" it to be expanded a certain way (left-associatively), we can rewrite the grammar to enforce precedence. Consider the following grammar, taken from exercise 4.2.2g (page 207) of the textbook, which we also looked at in tutorial.

1. bexpr := bexpr OR bterm
2. bexpr := bterm
3. bterm := bterm AND bfactor
4. bterm := bfactor
5. bfactor := NOT bfactor
6. bfactor := LPAREN bexpr RPAREN
7. bfactor := TRUE
8. bfactor := FALSE

Firstly, note that the structure of productions 1-4 are similar to the previous productions 1-2 (bexpr := bexpr bop bexpr' | bexpr'); this format gives the left-associative structure. However, the grammar has been additionally factored to "force" binary expressions containing AND to be nested inside binary expressions containing OR; in other words, AND has higher precedence than OR, as desired. If you are not sure about this, try producing derivations for various input strings.

So we have seen how a grammar can be rewritten to enforce precedence and associativity. You can use this to eliminate the ambiguities in the provided grammar.

Conflict Resolution (by looking at LR parsing)

You can view the debug info and states of your parser in build/src/parser.output. The file will be quite large, but you will need to read through it closely.

When you build your parser, it will tell you if there are conflicts. If you implemented the grammar rules exactly as defined above, you should have some shift/reduce conflicts. Near the top of parser.output, you should see a list of states (numbered starting from 0) and the conflicts they contain. Further down, you should see a detailed description of each state.

Suppose you wrote this simple grammar:

root: expr
expr: expr bin_op expr; TOK_IDENTIFIER; TOK_INTEGER
bin_op: TOK_PLUS; TOK_MINUS

If you built it, your parser.output should contain something like State 9 conflicts: 2 shift/reduce. So we can scroll/search down to State 9 and find:

State 9

    2 expr: expr . bin_op expr
    2     | expr bin_op expr .

    TOK_PLUS   shift, and go to state 6
    TOK_MINUS  shift, and go to state 7

    TOK_PLUS   [reduce using rule 2 (expr)]
    TOK_MINUS  [reduce using rule 2 (expr)]
    $default   reduce using rule 2 (expr)

    bin_op  go to state 8

Let us analyze each block.

    2 expr: expr . bin_op expr
    2     | expr bin_op expr .

The 2 means it is the 2nd rule; you can see the numbered rules near the top of parser.output under Grammar. The contents of the rule is displayed, along with a dot. The dot means "where" the parser is in terms of matching this/these rules. So in this case, it means the parser has just finished seeing/producing an expr (as the dot comes after expr in both cases). However, in the first case the parser has just seen the first expr of this rule, in the second case the parser has just seen the second expr of this rule. Your parser may be in the middle of matching "multiple" or different rules, as it is a state machine which has potentially combined states.

    TOK_PLUS   shift, and go to state 6
    TOK_MINUS  shift, and go to state 7

What this means is that given TOK_PLUS as a lookahead, the parser wants to shift to state 6. Similarly, given TOK_MINUS as a lookahead, the parser wants to shift to state 7.

    TOK_PLUS   [reduce using rule 2 (expr)]
    TOK_MINUS  [reduce using rule 2 (expr)]
    $default   reduce using rule 2 (expr)

Here, given TOK_PLUS as the next symbol, the parser wants to reduce using rule 2. Similarly, the parser wants to reduce using rule 2 if it sees a TOK_MINUS. There are conflicts as for both of these symbols, the parser does not know whether it should shift or reduce.

    bin_op  go to state 8

When the parser performs a reduction and pops off states from the stack, assuming it ends up at this state, it should go to state 8 if the next symbol is a bin_op (in this case, anything else would be an error).

Note that you will likely have to examine the nearby/related states to understand why the conflict exists, and how to rewrite the grammar to fix it.

Operator Precedence (for LR parsing)

The page on operator precedence, contextual precedence, and understanding your parser from the Bison manual will be useful. As a quick reference:

%token defines a token without precedence or associativity
The following directives define tokens with precedence, where the later definitions have higher precedence:
- %left and %right define left and right associative operators
- %precedence defines a token with precedence only; the parser generated may have a conflict with some default resolution, in which case the parser-generator will warn you about this conflict at compile time
- %nonassoc defines a token which is non-associative; the parser produces a syntax error at runtime if it encounters the same operator twice "in a row"

Specifically, precedence and associativity can be used to solve shift/reduce conflicts. Tokens/terminals get a precedence from the order they are defined (unless %token is used, in which case it has no precedence or associativity). A production gets the precedence of its right-most terminal, if present. With a shift-reduce conflict, the parser compares the precedence of the lookahead token, with the precedence of the rule to reduce by:

if either precedence is undefined, it is a conflict (Bison will issue a warning, but will generate a parser that shifts by default);
if the precedence of the lookahead is higher, it shifts;
if the precedence of the production is higher, it reduces;
if the precedences are the same, it uses the associativity rules.

If the precedences of the lookahead and the production are the same, the parser looks at the associativity of the lookahead. If the lookahead token is left-associative, it reduces. If the lookahead is right-associative, it shifts. One can verify that for binary operators, these actions correspond to the "expected thing".

Output

For each rule matched, print a line in the form:
target := a b c
where a, b, c are the corresponding symbols. If you have an empty rule (whose production body should be specified by %empty in Bison), please print target := %empty. You do not need to change anything for error handling; the starter code error handling routine is expected. As not everyone will have the same grammar, the output will be inspected manually. However, please adhere to the rules above. Additionally, you should ensure the there are no parsing conflicts building your parser (when you run make after touching parser.y).

Test Cases

The following test cases are provided for you. The output has been omitted as it essentially shows you the grammar used. The most important thing is that the success cases succeed (no error message, return code 0 - i.e. ./src/ece467c 2 success-0.c && echo $? should print 0), and the failure test cases should produce a line like [error] parser error at 4.6: syntax error.

Submission

Same as Lab 1.

UofT ECE 467 (2020 Fall) - Lab 2

Navigation

References

Corrections

Corrections for Lab 3

Introduction