Parsing

1
9. Parsing
2/24/2018
John Roberts
2 provide context, define parsing (again)

Overview
• Parsing
• Grammars
• X Language Grammar
• Building ASTs for Grammar
• Parser Source Code
3
Parsing
Parsing (Grammars)
recursive descent
Tokens processing
(stream of lexical units)
Abstract Syntax Tree (AST)
• Recall that lexical analysis is the process by which the

source code is examined, and tokens generate
4
Parsing
Parsing (Grammars)
recursive descent
Tokens processing
(stream of lexical units)
Abstract Syntax Tree (AST)
• Check that the tokens are assembled in the correct fashion

according to the grammar
• Check the structure and syntax of user’s program
• Build the Abstract Syntax Tree (AST) to represent the program
• AST is a representation of the source program in a form that is

convenient for subsequent processing
5
Overview
• Parsing
• Grammars
6 This is what you’ll spend a few weeks on in Theory and Programming

Grammars Language Design
• A set of production rules for strings in a formal language
• Describes how to form strings from the language’s

alphabet that are valid according to the language’s syntax
• Wiki link
7
Grammars
• A grammar G consists of
• A finite set N of nonterminal symbols that is disjoint

with the strings formed from G
• A finite set T of terminal symbols that is disjoint from N
• A finite set P of production (rewriting) rules
• A distinguished symbol S that is the start symbol
• G = ( N, T, P, S )
8
Nonterminals
• G = ( N, T, P, S )
• Describe the main structural components (syntactic

categories) of the language
• blocks
• programs
• expressions
• etc.
9
Terminals
• G = ( N, T, P, S )
• The tokens used by the programmer, and determined by

the Lexer
10
Production Rules
• G = ( N, T, P, S )
• Rules that describe what the nonterminals may be

rewritten into
11
Production Rules
• Example: S → if E then BLOCK else BLOCK
• This category describes if statements
• S represents statements in the user program; S is the rule’s left

hand side (LHS)
• if E then BLOCK else BLOCK is the rule’s right hand side (RHS)
• S is in the set of N (nonterminals)
• The RHS is a string of symbols composed from those symbols in

N (nonterminals) and T (terminals)
12
Start symbol
• G = ( N, T, P, S )
• The special start symbol S is a symbol in N

(nonterminals) used to start the rewriting process (also
called the derivation)
• We may continue rewriting as long as the produced

string has nonterminals
13
Sample Derivation
• A grammar is defined as: 

S → aS | bBS | c 
B → bB | b
• Note that terminals are typically in lower case, while

nonterminals are in upper case
• A sample derivation is: 

S → bBS → bbBS → bbbS → bbbaS → bbbaaS →
bbbaac
14
Overview
• Parsing
• Grammars
15 D = Declarations
Grammar for X S = Statements
Production Category * = zero or more
PROGRAM → ‘program’ BLOCK program ? = zero or one

BLOCK → ‘{‘ D* S* ‘}’ block
D → TYPE NAME decl
D → TYPE NAME FUNHEAD BLOCK functionDecl
TYPE → ‘int’
TYPE → ‘boolean’
FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’ formals

16 E = Expression
Grammar for X
Production Category
S → ‘if’ E ‘then’ BLOCK ‘else’ BLOCK if
S → ‘while’ E BLOCK while
S → ‘return’ E return
S → BLOCK
S → NAME ‘=‘ E assign
17 SE = Simple Expression
Grammar for X
Production Category
E → SE
E → SE ‘==‘ SE =
E → SE ‘!=‘ SE !=
E → SE ‘<‘ SE <
E → SE ‘<=‘ SE <=
18
Grammar for X
Production Category
SE → T
SE → SE ‘+’ T +
SE → SE ‘-‘ T -
SE → SE ‘|’ T |
19 T = Term
Grammar for X
Production Category
T→F
T → T ‘*’ F *
T → T ‘/‘ F /
T → T ‘&’ F &
20 F = Factor
Grammar for X
Production Category
F → ‘(‘ E ‘)’
F → NAME
F → <int>
F → NAME ‘(‘ (E list ‘,’)? ‘)’ call
NAME → <id>
21 IDENTIFIER → LETTER (CHARACTER)*
Grammar for X CHARACTER → LETTER | _ | … | $
• <id> stands for any valid identifier (what would a LETTER → a | b | c | … | z
production look like for this?)
• <int> stands for any integer (what would a production

look like for this?) DECIMAL_LITERAL → [1 | 2 | … | 9] ( 0 | … | 9)*
22
Sample Derivation for X Grammar
• A sample derivation is: 

PROGRAM → program BLOCK → 
program { D S } → 
program { TYPE NAME S } → 
program { int NAME S } → 
program { int a S } → 
program { int a NAME = E } → 
program { int a a = E } → 
program { int a a = SE } → 
program { int a a = T } → 
program { int a a = F } → 
program { int a a=5}
23
Overview
• Parsing
• Grammars
24
Parser Notes
• There will be one (recursive) procedure for each

nonterminal to process the right-hand-side of each
syntax rule
• An AST subtree will be built to represent the right-hand-

side (the tree to be built is described by the grammar)
• If the user’s program doesn’t conform to the grammar,

then a syntax error will be reported
• When we look at the code, note that the procedure for

the Program rule will build the AST for the entire program
25
Building ASTs for Grammar
• PROGRAM → ‘program’ BLOCK 

Build a program tree with one child for the BLOCK tree
• BLOCK → ‘{‘ D* S* ‘}’ 

Build a block tree with children for the declarations (D)
and statements (S)
26
• D → TYPE NAME FUNHEAD BLOCK 

Build a functionDecl tree with four children:
• TYPE - Return type of the function
• NAME - Name of the function
• FUNHEAD - Function formals
• BLOCK - Function body
27
• FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’ 

Build the tree describing the formal parameters for the
corresponding function declaration
• There is one tree per formal declaration (D list ‘,’)
• The Ds are separated by commas (indicated by list)

28
• F → NAME ‘(‘ (E list ‘,’)? ‘)’ 

Build a call tree with one child for the function name, and
one child for each actual argument expression E
1 program { int i int j 29

2 i = i + j + 7
ASTs built from Source 3
4 }
j = write(i)
• Lexical analysis produces the following tokens

program leftBrace intType <id:i> intType <id:j>
<id:i> assign <id:i> plus <id:j> plus <int:7>
<id:j> assign <id:write> leftParen <id:i> rightParen
rightBrace
• Assignment 2 adds some tokens into the set of tokens recognized

during lexical analysis
• Assignment 3 will add these into the grammar, and provide a textual
(next slide) and graphical (following slide) display of the generated
ASTs

2 i = i + j + 7
4 }
j = write(i)
1: Program program leftBrace intType <id:i> intType <id:j>
2: Block <id:i> assign <id:i> plus <id:j> plus <int:7>
5: Decl <id:j> assign <id:write> leftParen <id:i> rightParen
3: IntType rightBrace
4: Id: i program
8: Decl
6: IntType block
7: Id: j
decl decl assign assign
10: Assign
9: Id: i int i int j i + j call
14: AddOp: + + 7 write i
12: AddOp: +
i j
11: Id: i
13: Id: j
15: Int: 7
17: Assign
16: Id: j
19: Call
18: Id: write
20: Id: i
2 i = i + j + 7
4 }
j = write(i)
program leftBrace intType <id:i> intType <id:j>
<id:i> assign <id:i> plus <id:j> plus <int:7>
<id:j> assign <id:write> leftParen <id:i> rightParen

program
rightBrace
block
decl decl assign assign
int i int j i + j call
+ 7 write i
i j
32 Time permitting
Build an AST
program {boolean j int i

int factorial(int n) {
if (n < 2) then
{ return 1 }
else
{return n*factorial(n-1) }
}
while (1==1) {
i = write(factorial(read()))
}
}
• Answer: https://gist.github.com/
jrob8577/7401c004e6c78040a15e9944e2536db2
33
Overview
• Parsing
• Grammars

34 Assignment 2 covered these
Compiler Packages
Package Function of Package
lexical analyzer - scan the source

lexer program and output tokens
automatically generate classes Sym

lexer.setup
and TokenType for lexer
35 We’ll discuss ast and parser packages tonight

Compiler Packages
parser analyze tokens, check syntax, build

AST
Abstract Syntax Tree classes -

ast representation of source program
for efficient processing
Used when walking (visiting) the
visitor AST - for printing, constraining,
generating the code
compiler Compiler main program
36
Compiler Packages
visit the AST and check type

constraints; decorate AST to hook
constrain
up variable references to their
declaration
codegen visit the decorated AST and

generate bytecodes
Virtual machine, etc. used to

interpreter execute the byte codes generated
for the source program
37
Review Java util classes
• ArrayList - Resizable array implementation of the List

interface
• Iterator interface - An iterator over a collection. Allows

the caller to remove elements from the underlying
collection during the iteration (used in ASTVisitors -
next week’s topic)
• EnumSet - A specialized Set implementation for use with

enumerated types (Token)
38 kids irritates me, should be children. don’t be cute, be concise.

AST - Information at each node
public abstract class AST {

// kids of the current node
protected ArrayList<AST> kids;
// used for identifying the node when printing
// (see printed AST's given above)
protected int nodeNum;
// used during the constraining phase
protected AST decoration;
// used during the code generation phase
protected String label = "";
39
AST - Methods (abbreviated)
// return the ith kid of this node (kids are index from 1)
public AST getKid(int i) {
if ( (i <= 0) || (i > kidCount())) {
return null;
}
return kids.get(i - 1);
}
// number of kids from this node

public int kidCount() {
return kids.size();
}
// return a Collection of the kids of this node

public ArrayList<AST> getKids() {
return kids;
}
// add a new kid to this node, return this node

public AST addKid(AST kid) {
kids.add(kid);
return this;
}
40 How do the tree implementations differ? IdTree adds information specific to
AST Package Ids (stack frame offsets) Check out the others and reason through
• AST (Abstract) differences…
• ProgramTree
• IdTree
• RelOpTree
• Note that each kind of AST is described in a subclass of

the abstract AST class - these are just a few so we can
run through a parser simulation
41 Read the code, trace execution, step through with debugger - we’re going to
Parser start by reading the code through a trace of a simple program
• Before we get started - what are some strategies for
figuring out how the Parser works?
42 Read the code, trace execution, step through with debugger

Parser
• Before we get started - what are some strategies for

figuring out how the Parser works?
• We’re going to start by reading the code through a trace

of a simple program
program { int i
int f( int j ) { int i return j + 5 }
i = f( 7 )
}
43 we’re gonna see currentToken in the next few slides, as we scan…
Parser - members
public class Parser {
private Token currentToken;

private Lexer lex;
private EnumSet<Tokens> relationalOps = EnumSet.of(
Tokens.Equal, Tokens.NotEqual, Tokens.Less, Tokens.LessEqual
);
private EnumSet<Tokens> addingOps = EnumSet.of(
Tokens.Plus, Tokens.Minus, Tokens.Or
);
private EnumSet<Tokens> multiplyingOps = EnumSet.of(
Tokens.Multiply, Tokens.Divide, Tokens.And
);
44 Instantiate the Lexer given the sourceProgram input, and scan the first token
Parser - Constructor (recall currentToken is a private member of the Parser class)
/**
* Construct a new Parser;
*
Constructor’s job is to place the class in a valid, default state - for the
* @param sourceProgram - source file name
* @exception Exception - thrown for any problems at startup (e.g. I/O)
*/
Parser, this means with the first token set
public Parser(String sourceProgram) throws Exception {
try {
lex = new Lexer(sourceProgram);
scan();
} catch (Exception e) {
System.out.println("********exception*******" + e.toString());
throw e;
};
}
private void scan() {

currentToken = lex.nextToken();
if (currentToken != null) {
currentToken.print(); // debug printout
}
return;
}
45
Parser - execute
/**
* Execute the parse command
*
* @return the AST for the source program
* @exception Exception - pass on any type of exception raised
*/
public AST execute() throws Exception {
try {
return rProgram();
} catch (SyntaxError e) {
e.print();
throw e;
}
}
46 We need to check out expect next, but first, why are we invoking rBlock here?
Parser - rProgram (because grammar - more on this in a minute)
/**
* <pre>
* Program -> 'program' block ==> program
* </pre>
*
* @return the program tree
* @exception SyntaxError - thrown for any syntax error
*/
public AST rProgram() throws SyntaxError {
// note that rProgram actually returns a ProgramTree; we use the
// principle of substitutability to indicate it returns an AST
AST t = new ProgramTree();
expect(Tokens.Program);
t.addKid(rBlock());
return t;
}
47 Here’s where we do syntax validation - in rProgram, we

Parser - expect expect( Tokens.Program ). If we do not have a Program token here, the
program is invalid, and we throw a SyntaxError
private void expect(Tokens kind) throws SyntaxError {

if (isNextTok(kind)) {
scan();
Note that this also advances currentToken to the next token
return;
}
throw new SyntaxError(currentToken, kind);
}
private boolean isNextTok(Tokens kind) {

if ((currentToken == null) || (currentToken.getKind() != kind)) {
return false;
}
return true;
}
48
Parser - So far
• Create an instance of Parser
• Call execute, which builds the ProgramTree (via

rProgram)
• Check that the next token is the program token and scan
past it, reporting a SyntaxError if the scanned token
doesn’t match the program token
• Call rBlock to check the syntax of the block that should

follow, and get the BlockTree that it returns, adding it as
a child of the ProgramTree
49
Parser
• The grammar describes the structure of all the phrases in

the language (it’s also known as a phrase structure
grammar)
• Each nonterminal describes the structure of a set of

phrases
• Example: the S nonterminal describes the set of

statement phrases
• Example: the Program nonterminal describes the high

level overall program structure
50
Operation of each Parser method
• When called, the token being scanned should be the start

of one of the phrases described by the corresponding
nonterminal
• The method will check the stream of tokens derived from

the user input stream to ensure they are syntactically
correct, as described by the rule
• As the method checks the user token stream, it advances

the scanner. When it finishes scanning the phrase(s) for
the method, the scanner will be advanced to the first
token JUST AFTER the phrase
51
Operation of each Parser method, continued
• As the method checks the stream of tokens for syntactic

correctness, it builds the AST as required.
• In this case, it builds an AST with root ProgramTree,

and adds all the children for the sub phrases (recall that
this rule only has one type of phrase with only one
subphrase - block: PROGRAM → ‘program’ BLOCK)
• The method returns the AST just built
• If the method detects a syntax error, it will throw a

SyntaxError exception
52
SyntaxError Exception class
class SyntaxError extends Exception {
private static final long serialVersionUID = 1L;
private Token tokenFound;
private Tokens kindExpected;
/**
* record the syntax error just encountered
*
* @param tokenFound is the token just found by the parser
* @param kindExpected is the token we expected to find based on the current
* context
*/
public SyntaxError(Token tokenFound, Tokens kindExpected) {
this.tokenFound = tokenFound;
this.kindExpected = kindExpected;
}
void print() {
System.out.println("Expected: "+ kindExpected);
return;
}
}
53
Parser - rBlock
• Recall that we left off in the rProgram method, which

calls the rBlock method
• Production: BLOCK → ‘{‘ D* S* ‘}’

public AST rBlock() throws SyntaxError {
expect(Tokens.LeftBrace);
AST t = new BlockTree();
while (startingDecl()) { // get decls
t.addKid(rDecl());
}
while (startingStatement()) { // get statements
t.addKid(rStatement());
}
expect(Tokens.RightBrace);
return t;
}
54
Parser - rBlock
• Check for the left brace and scan past it - report a

SyntaxError if not found
• Next, we expect 0 or more Ds (declarations) followed by

0 or more Ss (statements)
• Repeatedly check if the next token can start a D, if so

start and add a Decl AST via rDecl
• Do the same for Ss
• Check for the closing right brace

55 As a style note, when I see the pattern: if( boolean expression ) return true
Parser - startingDecl, startingStatement else return false, I replace with return boolean expression - would make this a
little clearer IMHO
boolean startingDecl() {
if( isNextTok(Tokens.Int) || isNextTok(Tokens.BOOLean) ) {
return true;
}
return false;
}
boolean startingStatement() {
if( isNextTok(Tokens.If) || isNextTok(Tokens.While) ||
isNextTok(Tokens.Return) || isNextTok(Tokens.LeftBrace) ||
isNextTok(Tokens.Identifier))
{
return true;
}
return false;
}
56
Parser - rDecl
• Declarations result in two possible productions:
• D → TYPE NAME
• D → TYPE NAME FUNHEAD BLOCK
• Note that rDecl may return either a DeclTree (for the

first production) or a FunctionDeclTree (for the
second production)
57 Point out isNextTok, instead of expect - if left parent, then FunctionDeclTree.

Parser - rDecl Point out cascading dot operator - we’re adding a rType generated child to
the FunctionDeclTree, and adding an rName generated child to the rType child
public AST rDecl() throws SyntaxError {

AST t, t1;
t = rType();
Note not clean code - I prefer explicit else blocks, and whitespace to help
t1 = rName();
if (isNextTok(Tokens.LeftParen)) {
// function - note naming of predicate
parse this…
// isNextTok does not scan past the LeftParen, just checks for it
t = (new FunctionDeclTree()).addKid(t).addKid(t1);
t.addKid(rFunHead());
t.addKid(rBlock());
return t;
}
t = (new DeclTree()).addKid(t).addKid(t1);
return t;
}
58 type is gonna be important for assignment 3 - why? (we added some type
Parser - rType, rName tokens…)
public AST rType() throws SyntaxError {
AST t;
if (isNextTok(Tokens.Int)) {
t = new IntTypeTree();
scan();
} else {
expect(Tokens.BOOLean);
t = new BoolTypeTree();
}
return t;
}
public AST rName() throws SyntaxError {

AST t;
if (isNextTok(Tokens.Identifier)) {
t = new IdTree(currentToken);
scan();
return t;
}
throw new SyntaxError(currentToken, Tokens.Identifier);
}
59 List of 0 or more decl’s, separated by commas, all in parens

Parser - rFunHead
public AST rFunHead() throws SyntaxError {

AST t = new FormalsTree();
expect(Tokens.LeftParen);
if (!isNextTok(Tokens.RightParen)) {
do {
t.addKid(rDecl());
if (isNextTok(Tokens.Comma)) {
scan();
} else {
break;
}
} while (true);
}
expect(Tokens.RightParen);
return t;
}
60 This rule indicates we should pick as many Ts as possible, Ts are left

Parser - rSimpleExpr associative
• SE → T
• SE → SE ‘+’ T
• SE → SE ‘-‘ T
• SE → SE ‘|’ T
public AST rSimpleExpr() throws SyntaxError {

AST t, kid = rTerm();
while ((t = getAddOperTree()) != null) {
t.addKid(kid);
t.addKid(rTerm());
kid = t;
}
return kid;
}
61
Parser - rSimpleExpr
• Consider the derivation: 

SE → SE + T → SE + T + T → SE + T + T + … + T 
where + could be any of the adding operators (+ - |)
• We want to find any number of Ts with any of the adding

operators between them
• The adding operators should be performed in left-to-right

order (left associative)
62
Associativity (Wiki)
• Associativity is only needed when the operators in an

expression have the same precedence.
• Usually + and - have the same precedence.
• Consider the expression 7 − 4 + 2. The result could be

either (7 − 4) + 2 = 5 or 7 − (4 + 2) = 1.
• The former result corresponds to the case when + and

− are left-associative, the latter to when + and - are
right-associative.
63
Associativity (Wiki)
• In order to reflect normal usage, addition, subtraction,

multiplication, and division operators are usually left-
associative, while an exponentiation operator (if present)
is right-associative
• Any assignment operators are also typically right-

associative.
• To prevent cases where operands would be associated

with two operators, or no operator at all, operators with
the same precedence must have the same associativity.
64 Picking back up after review of associativity…
Parser - getAddOperTree
private AST getAddOperTree() {

Tokens kind = currentToken.getKind();
if (addingOps.contains(kind)) {
AST t = new AddOpTree(currentToken);
scan();
return t;
} else {
return null;
}
}
65 Will be in slide deck printout in iLearn, as well as in source code on github

But wait, there’s more
• We don’t need these for the program we’re analyzing, so skipping

for the sake of time. Please review:
• rExpr
• rTerm
• getMultOperTree
• getRelationTree
• rFactor
• rStatement
66
Parser
program { int i
int f( int j ) { int i return j + 5 }
i = f( 7 )
}
• Result: https://gist.github.com/jrob8577/
d4581f99d5006c944ecc88238f391276

Parsing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parsing

Uploaded by

Copyright:

Available Formats

1

2 provide context, define parsing (again)

• Building ASTs for Grammar

• Parser Source Code

• Recall that lexical analysis is the process by which the

• Check that the tokens are assembled in the correct fashion

• Check the structure and syntax of user’s program

• Build the Abstract Syntax Tree (AST) to represent the program

• AST is a representation of the source program in a form that is

• Building ASTs for Grammar

• Parser Source Code

6 This is what you’ll spend a few weeks on in Theory and Programming

• Describes how to form strings from the language’s

• A finite set N of nonterminal symbols that is disjoint

• A finite set T of terminal symbols that is disjoint from N

• A finite set P of production (rewriting) rules

• A distinguished symbol S that is the start symbol

• Describe the main structural components (syntactic

• The tokens used by the programmer, and determined by

• Rules that describe what the nonterminals may be

• Example: S → if E then BLOCK else BLOCK

• This category describes if statements

• S represents statements in the user program; S is the rule’s left

• S is in the set of N (nonterminals)

• The RHS is a string of symbols composed from those symbols in

• The special start symbol S is a symbol in N

• We may continue rewriting as long as the produced

• A grammar is defined as:

• Note that terminals are typically in lower case, while

• A sample derivation is:

• Building ASTs for Grammar

• Parser Source Code

Grammar for X S = Statements

Production Category * = zero or more

PROGRAM → ‘program’ BLOCK program ? = zero or one

D → TYPE NAME decl

D → TYPE NAME FUNHEAD BLOCK functionDecl

FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’ formals

S → ‘if’ E ‘then’ BLOCK ‘else’ BLOCK if

S → ‘while’ E BLOCK while

S → NAME ‘=‘ E assign

F → NAME ‘(‘ (E list ‘,’)? ‘)’ call

21 IDENTIFIER → LETTER (CHARACTER)*

Grammar for X CHARACTER → LETTER | _ | … | $

• <id> stands for any valid identifier (what would a LETTER → a | b | c | … | z

production look like for this?)

• <int> stands for any integer (what would a production

• A sample derivation is:

• Building ASTs for Grammar

• Parser Source Code

• There will be one (recursive) procedure for each

• An AST subtree will be built to represent the right-hand-

• If the user’s program doesn’t conform to the grammar,

• When we look at the code, note that the procedure for

• PROGRAM → ‘program’ BLOCK

• BLOCK → ‘{‘ D* S* ‘}’

• D → TYPE NAME FUNHEAD BLOCK

• TYPE - Return type of the function

• NAME - Name of the function

• FUNHEAD - Function formals

• BLOCK - Function body

• FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’

• There is one tree per formal declaration (D list ‘,’)

• A grammar is defined as: 

• A sample derivation is: 

• A sample derivation is: 

• PROGRAM → ‘program’ BLOCK 

• BLOCK → ‘{‘ D* S* ‘}’ 

• D → TYPE NAME FUNHEAD BLOCK 

• FUNHEAD → ‘(‘ (D list ‘,’)? ‘)’ 

• F → NAME ‘(‘ (E list ‘,’)? ‘)’