You are on page 1of 22

1

9. Parsing

2/24/2018

John Roberts

2 provide context, define parsing (again)


Overview
• Parsing

• Grammars

• X Language Grammar

• Building ASTs for Grammar

• Parser Source Code

3
Parsing

Parsing (Grammars)
recursive descent
Tokens processing
(stream of lexical units)
Abstract Syntax Tree (AST)

• Recall that lexical analysis is the process by which the


source code is examined, and tokens generate
4
Parsing

Parsing (Grammars)
recursive descent
Tokens processing
(stream of lexical units)
Abstract Syntax Tree (AST)

• Check that the tokens are assembled in the correct fashion


according to the grammar

• Check the structure and syntax of user’s program

• Build the Abstract Syntax Tree (AST) to represent the program

• AST is a representation of the source program in a form that is


convenient for subsequent processing

5
Overview
• Parsing

• Grammars

• X Language Grammar

• Building ASTs for Grammar

• Parser Source Code

6 This is what you’ll spend a few weeks on in Theory and Programming


Grammars Language Design
• A set of production rules for strings in a formal language

• Describes how to form strings from the language’s


alphabet that are valid according to the language’s syntax

• Wiki link
7
Grammars

• A grammar G consists of

• A finite set N of nonterminal symbols that is disjoint


with the strings formed from G

• A finite set T of terminal symbols that is disjoint from N

• A finite set P of production (rewriting) rules

• A distinguished symbol S that is the start symbol

• G = ( N, T, P, S )

8
Nonterminals

• G = ( N, T, P, S )

• Describe the main structural components (syntactic


categories) of the language

• blocks

• programs

• expressions

• etc.

9
Terminals

• G = ( N, T, P, S )

• The tokens used by the programmer, and determined by


the Lexer
10
Production Rules

• G = ( N, T, P, S )

• Rules that describe what the nonterminals may be


rewritten into

11
Production Rules

• Example: S → if E then BLOCK else BLOCK

• This category describes if statements

• S represents statements in the user program; S is the rule’s left


hand side (LHS)

• if E then BLOCK else BLOCK is the rule’s right hand side (RHS)

• S is in the set of N (nonterminals)

• The RHS is a string of symbols composed from those symbols in


N (nonterminals) and T (terminals)

12
Start symbol

• G = ( N, T, P, S )

• The special start symbol S is a symbol in N


(nonterminals) used to start the rewriting process (also
called the derivation)

• We may continue rewriting as long as the produced


string has nonterminals
13
Sample Derivation

• A grammar is defined as:



S → aS | bBS | c

B → bB | b

• Note that terminals are typically in lower case, while


nonterminals are in upper case

• A sample derivation is:



S → bBS → bbBS → bbbS → bbbaS → bbbaaS →
bbbaac

14
Overview
• Parsing

• Grammars

• X Language Grammar

• Building ASTs for Grammar

• Parser Source Code

15 D = Declarations

Grammar for X S = Statements

Production Category * = zero or more

PROGRAM → ‘program’ BLOCK program ? = zero or one


BLOCK → ‘{‘ D* S* ‘}’ block

D → TYPE NAME decl

D → TYPE NAME FUNHEAD BLOCK functionDecl

TYPE → ‘int’

TYPE → ‘boolean’

FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’ formals


16 E = Expression
Grammar for X

Production Category

S → ‘if’ E ‘then’ BLOCK ‘else’ BLOCK if

S → ‘while’ E BLOCK while

S → ‘return’ E return

S → BLOCK

S → NAME ‘=‘ E assign

17 SE = Simple Expression
Grammar for X

Production Category

E → SE

E → SE ‘==‘ SE =

E → SE ‘!=‘ SE !=

E → SE ‘<‘ SE <

E → SE ‘<=‘ SE <=

18
Grammar for X

Production Category

SE → T

SE → SE ‘+’ T +

SE → SE ‘-‘ T -

SE → SE ‘|’ T |
19 T = Term
Grammar for X

Production Category

T→F

T → T ‘*’ F *

T → T ‘/‘ F /

T → T ‘&’ F &

20 F = Factor
Grammar for X

Production Category

F → ‘(‘ E ‘)’

F → NAME

F → <int>

F → NAME ‘(‘ (E list ‘,’)? ‘)’ call

NAME → <id>

21 IDENTIFIER → LETTER (CHARACTER)*

Grammar for X CHARACTER → LETTER | _ | … | $

• <id> stands for any valid identifier (what would a LETTER → a | b | c | … | z

production look like for this?)

• <int> stands for any integer (what would a production


look like for this?) DECIMAL_LITERAL → [1 | 2 | … | 9] ( 0 | … | 9)*
22
Sample Derivation for X Grammar

• A sample derivation is:



PROGRAM → program BLOCK →

program { D S } →

program { TYPE NAME S } →

program { int NAME S } →

program { int a S } →

program { int a NAME = E } →

program { int a a = E } →

program { int a a = SE } →

program { int a a = T } →

program { int a a = F } →

program { int a a=5}

23
Overview
• Parsing

• Grammars

• X Language Grammar

• Building ASTs for Grammar

• Parser Source Code

24
Parser Notes

• There will be one (recursive) procedure for each


nonterminal to process the right-hand-side of each
syntax rule

• An AST subtree will be built to represent the right-hand-


side (the tree to be built is described by the grammar)

• If the user’s program doesn’t conform to the grammar,


then a syntax error will be reported

• When we look at the code, note that the procedure for


the Program rule will build the AST for the entire program
25
Building ASTs for Grammar

• PROGRAM → ‘program’ BLOCK



Build a program tree with one child for the BLOCK tree

• BLOCK → ‘{‘ D* S* ‘}’



Build a block tree with children for the declarations (D)
and statements (S)

26
Building ASTs for Grammar

• D → TYPE NAME FUNHEAD BLOCK



Build a functionDecl tree with four children:

• TYPE - Return type of the function

• NAME - Name of the function

• FUNHEAD - Function formals

• BLOCK - Function body

27
Building ASTs for Grammar

• FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’



Build the tree describing the formal parameters for the
corresponding function declaration

• There is one tree per formal declaration (D list ‘,’)

• The Ds are separated by commas (indicated by list)


28
Building ASTs for Grammar

• F → NAME ‘(‘ (E list ‘,’)? ‘)’



Build a call tree with one child for the function name, and
one child for each actual argument expression E

1 program { int i int j 29


2 i = i + j + 7
ASTs built from Source 3
4 }
j = write(i)

• Lexical analysis produces the following tokens


program leftBrace intType <id:i> intType <id:j>

<id:i> assign <id:i> plus <id:j> plus <int:7>

<id:j> assign <id:write> leftParen <id:i> rightParen

rightBrace

• Assignment 2 adds some tokens into the set of tokens recognized


during lexical analysis

• Assignment 3 will add these into the grammar, and provide a textual
(next slide) and graphical (following slide) display of the generated
ASTs

1 program { int i int j 30


2 i = i + j + 7
ASTs built from Source 3
4 }
j = write(i)

1: Program program leftBrace intType <id:i> intType <id:j>

2: Block <id:i> assign <id:i> plus <id:j> plus <int:7>

5: Decl <id:j> assign <id:write> leftParen <id:i> rightParen

3: IntType rightBrace

4: Id: i program

8: Decl
6: IntType block

7: Id: j
decl decl assign assign

10: Assign
9: Id: i int i int j i + j call

14: AddOp: + + 7 write i

12: AddOp: +
i j

11: Id: i
13: Id: j
15: Int: 7
17: Assign
16: Id: j
19: Call
18: Id: write
20: Id: i
1 program { int i int j 31
2 i = i + j + 7
ASTs built from Source 3
4 }
j = write(i)

program leftBrace intType <id:i> intType <id:j>

<id:i> assign <id:i> plus <id:j> plus <int:7>

<id:j> assign <id:write> leftParen <id:i> rightParen


program
rightBrace

block

decl decl assign assign

int i int j i + j call

+ 7 write i

i j

32 Time permitting
Build an AST

program {boolean j int i


int factorial(int n) {
if (n < 2) then
{ return 1 }
else
{return n*factorial(n-1) }
}
while (1==1) {
i = write(factorial(read()))
}
}

• Answer: https://gist.github.com/
jrob8577/7401c004e6c78040a15e9944e2536db2

33
Overview
• Parsing

• Grammars

• X Language Grammar

• Building ASTs for Grammar

• Parser Source Code


34 Assignment 2 covered these
Compiler Packages

Package Function of Package

lexical analyzer - scan the source


lexer program and output tokens

automatically generate classes Sym


lexer.setup
and TokenType for lexer

35 We’ll discuss ast and parser packages tonight


Compiler Packages

Package Function of Package

parser analyze tokens, check syntax, build


AST

Abstract Syntax Tree classes -


ast representation of source program
for efficient processing
Used when walking (visiting) the
visitor AST - for printing, constraining,
generating the code

compiler Compiler main program

36
Compiler Packages

Package Function of Package

visit the AST and check type


constraints; decorate AST to hook
constrain
up variable references to their
declaration

codegen visit the decorated AST and


generate bytecodes

Virtual machine, etc. used to


interpreter execute the byte codes generated
for the source program
37
Review Java util classes

• ArrayList - Resizable array implementation of the List


interface

• Iterator interface - An iterator over a collection. Allows


the caller to remove elements from the underlying
collection during the iteration (used in ASTVisitors -
next week’s topic)

• EnumSet - A specialized Set implementation for use with


enumerated types (Token)

38 kids irritates me, should be children. don’t be cute, be concise.


AST - Information at each node

public abstract class AST {


// kids of the current node
protected ArrayList<AST> kids;
// used for identifying the node when printing
// (see printed AST's given above)
protected int nodeNum;
// used during the constraining phase
protected AST decoration;
// used during the code generation phase
protected String label = "";

39
AST - Methods (abbreviated)
// return the ith kid of this node (kids are index from 1)
public AST getKid(int i) {
if ( (i <= 0) || (i > kidCount())) {
return null;
}
return kids.get(i - 1);
}

// number of kids from this node


public int kidCount() {
return kids.size();
}

// return a Collection of the kids of this node


public ArrayList<AST> getKids() {
return kids;
}

// add a new kid to this node, return this node


public AST addKid(AST kid) {
kids.add(kid);
return this;
}
40 How do the tree implementations differ? IdTree adds information specific to
AST Package Ids (stack frame offsets) Check out the others and reason through
• AST (Abstract) differences…
• ProgramTree

• IdTree

• RelOpTree

• Note that each kind of AST is described in a subclass of


the abstract AST class - these are just a few so we can
run through a parser simulation

41 Read the code, trace execution, step through with debugger - we’re going to
Parser start by reading the code through a trace of a simple program
• Before we get started - what are some strategies for
figuring out how the Parser works?

42 Read the code, trace execution, step through with debugger


Parser

• Before we get started - what are some strategies for


figuring out how the Parser works?

• We’re going to start by reading the code through a trace


of a simple program

program { int i
int f( int j ) { int i return j + 5 }
i = f( 7 )
}
43 we’re gonna see currentToken in the next few slides, as we scan…
Parser - members

public class Parser {

private Token currentToken;


private Lexer lex;
private EnumSet<Tokens> relationalOps = EnumSet.of(
Tokens.Equal, Tokens.NotEqual, Tokens.Less, Tokens.LessEqual
);
private EnumSet<Tokens> addingOps = EnumSet.of(
Tokens.Plus, Tokens.Minus, Tokens.Or
);
private EnumSet<Tokens> multiplyingOps = EnumSet.of(
Tokens.Multiply, Tokens.Divide, Tokens.And
);

44 Instantiate the Lexer given the sourceProgram input, and scan the first token
Parser - Constructor (recall currentToken is a private member of the Parser class)

/**
* Construct a new Parser;
*
Constructor’s job is to place the class in a valid, default state - for the
* @param sourceProgram - source file name
* @exception Exception - thrown for any problems at startup (e.g. I/O)
*/
Parser, this means with the first token set
public Parser(String sourceProgram) throws Exception {
try {
lex = new Lexer(sourceProgram);
scan();
} catch (Exception e) {
System.out.println("********exception*******" + e.toString());
throw e;
};
}

private void scan() {


currentToken = lex.nextToken();
if (currentToken != null) {
currentToken.print(); // debug printout
}
return;
}

45
Parser - execute

/**
* Execute the parse command
*
* @return the AST for the source program
* @exception Exception - pass on any type of exception raised
*/
public AST execute() throws Exception {
try {
return rProgram();
} catch (SyntaxError e) {
e.print();
throw e;
}
}
46 We need to check out expect next, but first, why are we invoking rBlock here?
Parser - rProgram (because grammar - more on this in a minute)

/**
* <pre>
* Program -> 'program' block ==> program
* </pre>
*
* @return the program tree
* @exception SyntaxError - thrown for any syntax error
*/
public AST rProgram() throws SyntaxError {
// note that rProgram actually returns a ProgramTree; we use the
// principle of substitutability to indicate it returns an AST
AST t = new ProgramTree();
expect(Tokens.Program);
t.addKid(rBlock());
return t;
}

47 Here’s where we do syntax validation - in rProgram, we


Parser - expect expect( Tokens.Program ). If we do not have a Program token here, the
program is invalid, and we throw a SyntaxError

private void expect(Tokens kind) throws SyntaxError {


if (isNextTok(kind)) {
scan();
Note that this also advances currentToken to the next token
return;
}
throw new SyntaxError(currentToken, kind);
}

private boolean isNextTok(Tokens kind) {


if ((currentToken == null) || (currentToken.getKind() != kind)) {
return false;
}
return true;
}

48
Parser - So far

• Create an instance of Parser

• Call execute, which builds the ProgramTree (via


rProgram)

• Check that the next token is the program token and scan
past it, reporting a SyntaxError if the scanned token
doesn’t match the program token

• Call rBlock to check the syntax of the block that should


follow, and get the BlockTree that it returns, adding it as
a child of the ProgramTree
49
Parser

• The grammar describes the structure of all the phrases in


the language (it’s also known as a phrase structure
grammar)

• Each nonterminal describes the structure of a set of


phrases

• Example: the S nonterminal describes the set of


statement phrases

• Example: the Program nonterminal describes the high


level overall program structure

50
Operation of each Parser method

• When called, the token being scanned should be the start


of one of the phrases described by the corresponding
nonterminal

• The method will check the stream of tokens derived from


the user input stream to ensure they are syntactically
correct, as described by the rule

• As the method checks the user token stream, it advances


the scanner. When it finishes scanning the phrase(s) for
the method, the scanner will be advanced to the first
token JUST AFTER the phrase

51
Operation of each Parser method, continued

• As the method checks the stream of tokens for syntactic


correctness, it builds the AST as required.

• In this case, it builds an AST with root ProgramTree,


and adds all the children for the sub phrases (recall that
this rule only has one type of phrase with only one
subphrase - block: PROGRAM → ‘program’ BLOCK)

• The method returns the AST just built

• If the method detects a syntax error, it will throw a


SyntaxError exception
52
SyntaxError Exception class
class SyntaxError extends Exception {
private static final long serialVersionUID = 1L;
private Token tokenFound;
private Tokens kindExpected;

/**
* record the syntax error just encountered
*
* @param tokenFound is the token just found by the parser
* @param kindExpected is the token we expected to find based on the current
* context
*/
public SyntaxError(Token tokenFound, Tokens kindExpected) {
this.tokenFound = tokenFound;
this.kindExpected = kindExpected;
}

void print() {
System.out.println("Expected: "+ kindExpected);
return;
}
}

53
Parser - rBlock

• Recall that we left off in the rProgram method, which


calls the rBlock method

• Production: BLOCK → ‘{‘ D* S* ‘}’


public AST rBlock() throws SyntaxError {
expect(Tokens.LeftBrace);
AST t = new BlockTree();
while (startingDecl()) { // get decls
t.addKid(rDecl());
}
while (startingStatement()) { // get statements
t.addKid(rStatement());
}
expect(Tokens.RightBrace);
return t;
}

54
Parser - rBlock

• Check for the left brace and scan past it - report a


SyntaxError if not found

• Next, we expect 0 or more Ds (declarations) followed by


0 or more Ss (statements)

• Repeatedly check if the next token can start a D, if so


start and add a Decl AST via rDecl

• Do the same for Ss

• Check for the closing right brace


55 As a style note, when I see the pattern: if( boolean expression ) return true
Parser - startingDecl, startingStatement else return false, I replace with return boolean expression - would make this a
little clearer IMHO
boolean startingDecl() {
if( isNextTok(Tokens.Int) || isNextTok(Tokens.BOOLean) ) {
return true;
}
return false;
}

boolean startingStatement() {
if( isNextTok(Tokens.If) || isNextTok(Tokens.While) ||
isNextTok(Tokens.Return) || isNextTok(Tokens.LeftBrace) ||
isNextTok(Tokens.Identifier))
{
return true;
}
return false;
}

56
Parser - rDecl

• Declarations result in two possible productions:

• D → TYPE NAME

• D → TYPE NAME FUNHEAD BLOCK

• Note that rDecl may return either a DeclTree (for the


first production) or a FunctionDeclTree (for the
second production)

57 Point out isNextTok, instead of expect - if left parent, then FunctionDeclTree.


Parser - rDecl Point out cascading dot operator - we’re adding a rType generated child to
the FunctionDeclTree, and adding an rName generated child to the rType child

public AST rDecl() throws SyntaxError {


AST t, t1;
t = rType();
Note not clean code - I prefer explicit else blocks, and whitespace to help
t1 = rName();
if (isNextTok(Tokens.LeftParen)) {
// function - note naming of predicate
parse this…
// isNextTok does not scan past the LeftParen, just checks for it
t = (new FunctionDeclTree()).addKid(t).addKid(t1);
t.addKid(rFunHead());
t.addKid(rBlock());
return t;
}
t = (new DeclTree()).addKid(t).addKid(t1);
return t;
}
58 type is gonna be important for assignment 3 - why? (we added some type
Parser - rType, rName tokens…)
public AST rType() throws SyntaxError {
AST t;
if (isNextTok(Tokens.Int)) {
t = new IntTypeTree();
scan();
} else {
expect(Tokens.BOOLean);
t = new BoolTypeTree();
}
return t;
}

public AST rName() throws SyntaxError {


AST t;
if (isNextTok(Tokens.Identifier)) {
t = new IdTree(currentToken);
scan();
return t;
}
throw new SyntaxError(currentToken, Tokens.Identifier);
}

59 List of 0 or more decl’s, separated by commas, all in parens


Parser - rFunHead

public AST rFunHead() throws SyntaxError {


AST t = new FormalsTree();
expect(Tokens.LeftParen);
if (!isNextTok(Tokens.RightParen)) {
do {
t.addKid(rDecl());
if (isNextTok(Tokens.Comma)) {
scan();
} else {
break;
}
} while (true);
}
expect(Tokens.RightParen);
return t;
}

60 This rule indicates we should pick as many Ts as possible, Ts are left


Parser - rSimpleExpr associative
• SE → T

• SE → SE ‘+’ T

• SE → SE ‘-‘ T

• SE → SE ‘|’ T

public AST rSimpleExpr() throws SyntaxError {


AST t, kid = rTerm();
while ((t = getAddOperTree()) != null) {
t.addKid(kid);
t.addKid(rTerm());
kid = t;
}
return kid;
}
61
Parser - rSimpleExpr

• Consider the derivation:



SE → SE + T → SE + T + T → SE + T + T + … + T

where + could be any of the adding operators (+ - |)

• We want to find any number of Ts with any of the adding


operators between them

• The adding operators should be performed in left-to-right


order (left associative)

62
Associativity (Wiki)

• Associativity is only needed when the operators in an


expression have the same precedence.

• Usually + and - have the same precedence.

• Consider the expression 7 − 4 + 2. The result could be


either (7 − 4) + 2 = 5 or 7 − (4 + 2) = 1.

• The former result corresponds to the case when + and


− are left-associative, the latter to when + and - are
right-associative.

63
Associativity (Wiki)

• In order to reflect normal usage, addition, subtraction,


multiplication, and division operators are usually left-
associative, while an exponentiation operator (if present)
is right-associative

• Any assignment operators are also typically right-


associative.

• To prevent cases where operands would be associated


with two operators, or no operator at all, operators with
the same precedence must have the same associativity.
64 Picking back up after review of associativity…
Parser - getAddOperTree

private AST getAddOperTree() {


Tokens kind = currentToken.getKind();
if (addingOps.contains(kind)) {
AST t = new AddOpTree(currentToken);
scan();
return t;
} else {
return null;
}
}

65 Will be in slide deck printout in iLearn, as well as in source code on github


But wait, there’s more

• We don’t need these for the program we’re analyzing, so skipping


for the sake of time. Please review:

• rExpr

• rTerm

• getMultOperTree

• getRelationTree

• rFactor

• rStatement

66
Parser
program { int i
int f( int j ) { int i return j + 5 }
i = f( 7 )
}

• Result: https://gist.github.com/jrob8577/
d4581f99d5006c944ecc88238f391276

You might also like