You are on page 1of 7

1

8. Lexical Analysis

2/24/2018

John Roberts

2
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

3
Lexical Analysis

• Read a stream of characters that make up the source


program, and create a stream of tokens by combining the
characters appropriately

• tokens are sometimes also referred to as lexical units,


or lexemes

• Example: the characters ’t’, ’h’, ‘e’, ’n’ will be combined


to build the then token

• Example: the characters ‘1’, ‘2’, ‘4’, ‘7’ will be combined


to form an integer token with a value of 1247
4
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

5
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

6
The Lexer

• We will be working with the lexer package

• Recall that the responsibility is to generate Tokens


7 Generating tokens
Token Categories

Category Tokens

Reserved Words program int boolean if then else while return

Identifiers <the same as Java identifiers>

Integers <a sequence of digits>

Operators = == != < <= + - * / | &

Separators {}(),

Comments // until end of line

Whitespace <spaces> <newlines> and other Java whitespace


characters

We’ll see how we use this shortly


1 Program program
2 Int int 8
3 BOOLean boolean
tokens file 4
5
If if
Then then
6 Else else
7 While while
• The tokens are defined in a tokens 8 Function function
9 Return return
file 10 Identifier <id>
11 INTeger <int>
12 LeftBrace {
• Each line in the file will have two 13 RightBrace }
strings: 14
15
LeftParen (
RightParen )
16 Comma ,
17 Assign =
• The Symbolic constant we will use 18 Equal ==
in the compiler for the token 19
20
NotEqual !=
Less <
21 LessEqual <=
22 Plus +
• The actual token 23 Minus -
24 Or |
25 And &
26 Multiply *
27 Divide /
28 Comment //

9
Token Setup

• TokenSetup.java will read tokens, and automatically


generate the files Tokens.java and
TokenType.java

• The Tokens enum is actually a class - you can add


methods, instance fields, and a constructor that can only
be used to construct the enumerated values

• Values are accessed as Tokens.If, etc.


10
TokenSetup.java

• Examine code to ensure we understand how it works

• Execute TokenSetup and inspect Tokens.java and


TokenTypes.java

11
SourceReader.java

• Examine code to ensure we understand how it works

• Note that we will be updating this file to generate better


output (which we’ll see in a minute when we run Lexer)

12
Token.java

• Each Token contains four pieces of information

• String of Token found in source

• TokenType

• Starting column from source file

• Ending column

• The first two items are grouped as a Symbol


13 Note we’ve seen this hash pattern before…
Symbol.java

• String from the source, and TokenType

• All Strings (corresponding to tokens) found in the source


program will be placed into the hash table in the Symbol
class (the Symbol table)

• Before we begin, we place all Tokens in the Symbol


hash table

• Each String will (should) be inserted exactly once

1 program { int j int k 14


2 j = j + k
Symbol.java example 3 }

Token( Symbol( "program", Tokens.Program ), 1, 7 )


Token( Symbol( "{", Tokens.LeftBrace ), 9, 9 )
Token( Symbol( "int", Tokens.Int ), 11, 13 )
Token( Symbol( "j", Tokens.Identifier ), 15, 15 )
Token( Symbol( "int", Tokens.Int ), 17, 19 )
Token( Symbol( "k", Tokens.Identifier ), 21, 21 )
Token( Symbol( "j", Tokens.Identifier ), 2, 2 )
Token( Symbol( "=", Tokens.Assign ), 4, 4 )
Token( Symbol( "j", Tokens.Identifier ), 6, 6 )
Token( Symbol( "+", Tokens.Plus ), 8, 8 )
Token( Symbol( "k", Tokens.Identifier ), 10, 10 )
Token( Symbol( "}", Tokens.RightBrace ), 1, 1 )

• Symbol( String s, Tokens kind ) - insert


s into the hash table with value given by kind; if the
entry is already in the table, then just return the entry

15
Symbol.java example

• Note that we repeated a Symbol three times



Symbol( “j”, Tokens.Identifier )

• For efficiency, we only want to create one instance of


each Symbol, so we use the hash table to check if the
Symbol has already been created. If so, re-use, if not,
create a new instance.

• Logic encapsulated in Symbol class


16
Overview
• Lexical Analysis

• Assignment

• Project Code Overview

• Lexing

17
Performing Lexical Analysis

• Prior to processing the user’s program, we’ll create


Symbol instances for all reserved words, operators, etc.
so we can find them later (see TokenType.java)

• Once the lexer starts processing the user’s program, the


only new symbols that will be created (added to the hash
map) will be identifiers and numbers - all other symbols
would have been created before

18
Initializing

• Insert all token in HashMap<String, Symbol>

• tokens HashMap in TokenType holds all of the known Token/


Symbol pairs, e.g.

tokens.put(

Tokens.Program, 

Symbol.symbol(“program",Tokens.Program)

);

• Each of these are stored in the symbol table as they are


generated (see implementation for Symbol.symbol)

• At this point, Symbol.symbols.get( “program” ) yields


Symbol( “program”, Tokens.Program )
19
Lexing

• Scan the program line by line (character by character), and


insert symbols not already in the the symbols table (identifiers
and ints)

• If we look up an identifier in the symbols:

• Reserved word (e.g. program) - found and


Symbol( “program”, Tokens.Program ) returned

• User id not already in symbols - we don’t find it, so we put


a new entry return the new Symbol

• User id already in symbols - return the entry

20
Lexing

• If we look up other tokens in symbols

• Numbers - put new entry, if not already there

• Not found - don’t do anything

• e.g. = vs. == vs. !=, / vs. // - these are either one or


two character tokens

• e.g x =abc + y - we can key on the character =,


and save the a for the start of the next token (the abc
identifier)

You might also like