You are on page 1of 15

COMPILER CONSTRUCTION

Principles and Practice

Kenneth C. Louden
San Jose State University

1. INTRODUCTION

2. SCANNING
3. CONTEXT-FREE GRMMARS AND PARSING
4. TOP-DOWN PARSING
5. BOTTOM-UP PARSING

6. SEMANTIC ANALYSIS
7. RUNTIME ENVIRONMENT
8. CODE GENERATION

Main References:
(1) 《编译原理及实践》,(美)Kenneth C. Louden 著,冯博琴、冯岚等译,机械工
业出版社。
(2) 《编译程序原理与技术》,李赣生、王华民编著,清华大学出版社。
(3) 《程序设计语言:编译原理》,陈火旺等编著,国防工业出版社。

1
Chapter 1 Introduction

Emphasis: History of the compiler


Description of programs related to compilers
Compiling translation process
Major Data Structures of a compiler
Other related issues
Bootstapping and porting

What and Why Compliers?

Compilers: Computer Programs that translate one language to another


Source language(input) to target language (output)

Source Target
compiler
program program

Source language: high-level language c or c++


Target language: object code, machine code (machine instruction)

Purposes of learning compilers:


1.Basic knowledge (theoretical techniques --- automata theory)
2.Tools and practical experience to design and program an actual compiler

Additional Usage of compiling techniques:


Developing command interpreters, interface programs

TINY : the language for the discussion in the text


C-Minus : consist of a small but sufficiently complex subset of C,
It is more extensive than TINY and suitable for a class project.

2
1.1 A brief history of compiler
1. In the late1940s, the stored-program computer invented by John von Neumann
Programs were written in machine language, such as(Intel 8x86 in IBM PCs)
c7 06 0000 0002
means to move number 2 to the location 0000

2. Assembly language: numeric codes were replaced symbolic forms.


Mov x, 2
Assembler: translate the symbolic codes and memory location of assembly
language into the corresponding numeric codes.
Defects of the assembly language :
difficult to read, write and understanding;
Dependent on the particular machine.

3. FORTRAN language and its compiler: between 1954 and 1957, developed by
the team at IBM , John Backus.
The first compiler was developed

4.The structure of natural language studied by Noam Chomsky,


The classification of languages according to the complexity of their grammars
and the power of the algorithms needed to recognize them.

Four levels of grammars: type 0, type 1,type2,type3 grammars


Type 0: Turing machine
Type 1: context-sensitive grammar
Type 2: context-free grammar, the most useful of programming language
Type 3: right-linear grammar, regular expressions and finite automata

5. Parsing problems: studied in 1960s and 1970s


Code improvement techniques (optimization techniques): improve Compiler’s
efficiency
Compiler-compilers (parser generator ): only in one part of the compiler
process.
YACC written in 1975 by Steve Johnson for the UNIX system.
Lex written in 1975 by Mike Lest.

6. Recent advances in compiler design:


application of more sophisticated algorithms for inferring and /or simplifying
the information contained in a program.
(with the development of more sophisticated programming languages
that allow this kind of analysis.)
development of standard windowing environments.
(interactive development environment. IDE)

3
1.2 Programs related to compilers
1. Interpreters: Another language translator.
It executes the source program immediately.
Interpreters
Compilers Depending on the language in use and the situation
Interpreters: BASIC ,LISP and so on.

Compilers : speed execution

2. Assemblers
A translator translates assembly language into object code

3. Linkers
Collects code separately compiled or assembled in different object files into a file.
Connects the code for standard library functions.
Connects resources supplied by the operating system of the computer.

4. Loaders
Relocatable : the code is not completely fixed .
Loaders resolve all relocatable address relative to the starting address.

5. Preprocessors
Preprocessors: delete comments, include other files, perform macro substitutions.

6. Editors
Produce a standard file( structure based editors)

7. Debuggers
Determine execution errors in a compiled program.

8. Profilers
Collect statistics on the behavior of an object program during execution.
Statistics: the number of times each procedure is called, the percentage of
execution time spent in each procedure.

9. project managers
coordinate the files being worked on by different people.
sccs(source code control system ) and rcs(revision control system) are project
manager programs on Unix systems.

4
The translation process
The phase of a compiler:

Source code

scanner

tokens

parser
Literal table
Syntax tree

Semantic analyzer
Symbol table
Annotated tree

Source code optimizer

Intermediate code Error handler

Code generator

Target code

Target code optimizer

Target code

5
1. The scanner
Lexical analysis: input a stream of characters, output tokens
a[index] = 4 + 2
Tokens: a, [, index, ], = , 4, + , 2

The task of the scanner: the recognition of tokens, enter identifiers into the symbol
table, or enter literal into the literal table.

2. The parser
Determine the structure of the program
Input: the forms of tokens
Output: a parse tree or a syntax tree
a syntax tree is a condensation of the information contained in the parse tree.

expression

Assign-expression

expression = expression

Subscript-expression Additive-expressive

expression + expression
expression
[ expression ]

Number 4 Number 2
Identifier a Identifier index

6
3. The semantic analyzer
Static semantics: be cannot be conveniently expressed as syntax and analyzed by
the parser, but can be determined prior to execution.
For example: declarations and type checking,data types

Dynamic semantics: be determined by executing it , cannot be determined by a


compiler.

Assign-expression

Subscript-expression Additive-expression
integer integer

Identifier Identifier Number Number


a index 4 2
Array of integer integer integer integer

4. The source code optimizer


Source-level optimization:
4+2 6, constant folding

Three–address code:
(intermediate code: any internal representation for the source code used by the
compiler)
t = 4+2
a[index] = t

two phase optimizer:


1. t = 6
a[index] = t
2. a[index] = 6

intermediate code: any internal representation for the source code used by the
compiler. (syntax tree ,three-address, four-address and so on)

7
5. The code generator
Input: intermediate code or IR
Output: machine code, code for the target machine

6. The target code optimizer


Improve the target code generated by the code generator
Task : choosing addressing mode to improve performance
Replacing slow instructions by faster ones
Eliminating redundant or unnecessary operations

MOV R0 , index MOV R0, index


MUL R0 , 2 SHL R0
MOV R1, &a MOV &a[R0],6
ADD R1 , R0
MOV *R1, 6

8
Major data structures in a compiler
1. tokens:
a value of an enumerated data type the sets of tokens
2. the syntax tree:
each node is a record whose fields represent the information
collected by the parser and semantic analyzer
3. the symbol table:
information associated with identifiers:
functions, variables, constants, and data types.
the scanner
the parser insertion
The symbol table interacts with the semantic analyzer deletion
the optimization access
code generation
4. the literal table:
store: constants and strings
need quick insertion and lookup, need not allow deletions
5. intermediate code :
this code kept as an array of text strings, a temporary text file, or as a
linked list of structures.
6. temporary files
using temporary files to hold the products of intermediate steps
for example: backpatch address during code generation
if x = 0 then …….else …….
Code : CMP x, 0
JNE NEXT
………
NEXT:
……….

9
1.5 other issues in compiler structure
Viewing the compiler’s structure from different angles:
1. Analysis and synthesis
analysis :
lexical analysis 、syntax analysis、semantic analysis (optimization)
synthesis:
code generation (optimization)

2. front end and back end


separated depend on the source language or the target language.
the front end:
the scanner、parser、semantic analyzer, intermediate code synthesis
the back end:
the code generator, some optimization
Front end Back end
Source code Intermediate code Target code

Advantage: portability

3. passes
passes: process the entire source program several times
the initial pass: construct a syntax tree or intermediate code from the source

a pass may consist of several phases.

One complier with three passes


scanning and parsing;
semantic analysis and source-level optimization;
code generation and target-level optimization.

4. language definition and compilers


relation between the language definition and compiler

formal definition in mathematical terms for the language’s semantics


one common method: denotational semantics.

The structure and behavior of the runtime environment of the language affect
compiler construction

5. compiler options and interfaces


interfaces with the operating system
provide options to the user for various purposes

10
Bootstrapping and porting
Host language: the language in which the compiler itself is written.

Compiler for language A Existing compiler Running compiler


written in language B for language B for language A

Considering the following situations:


(1) The existing compiler for B runs on the target machine;
(2) The existing compiler for B runs on a machine different from the target
machine.
(3) How the first compilers were written when no compilers exited yet.
At first, the compiler is written in the machine language.
Today, the compiler is written in another language

T-diagram:
S T
(H is expected to be the same as T)
H

A compiler written in language H that translates language S into language T.

Combining T-diagram in two ways:

A B B C A C

H H H

On the same machine H, a compiler from A to C can be obtained by combine


the compiler for A to B with the compiler from B to C.

A B
A B
H H K

M K

Using a compiler from H to K to translate the implementation language of


another compiler from H to K.

11
The solution to the first situation mentioned above:

A H
A H
B B H

H H

The solution to the second situation mentioned above:

A H
A H
B B K

K K

The issue of a blunder of circularity:

S T

It’s common to write the compiler with the source language.

The solution to the third situation mentioned above -----Bootstrapping:

A H
A H
A A H

Compiler H H
written in own
language A

Compiler in Running but


machine language inefficient compiler

A H
A H
A A H
Compiler written H
in own language H
A

Running but Final version of


inefficient compiler the compiler

12
Solution to the porting:
In order to port the compiler from old host H to the new host K, use the old
compiler to produce a cross compiler and recompile the compiler to generate the new
one.

Step 1 A k
A k
A A H
Compiler source H
code retargeted H
to K

Original Compiler Cross Compiler

Step 2
A K
A H
A A K
Compiler source
code retargeted H K
to K

Cross Compiler Retargeted


compiler

13
The TINY sample language and compiler
Language TINY: as a running example ( as a source language )
Target language: assembly language (TM machine)

the tiny language

The features of a program in TINY:


1. a sequence of statements separated by semicolons
2. no procedure, no declarations
3. all variables are integer,
4. two control statement : if-else and repeat
5. read and write statements
6. comments with curly brackets; but can not be nested
7. expressions are Boolean and integer arithmetic expressions ( using < ,=), (+,-,*
/, parentheses, constants, variables ), Boolean expressions are only as tests in
control statements.

One sample program in TINY: Factorial function


Read x; {input an integer}
If x>0 then {don’t compute if x <=0}
Fact:=1;
Repeat
Fact :=fact *x;
X:=x-1;
Until x=0;
Write fact {output factorial of x}
End

14
The TINY compiler

C files: globals.h, util.h, scan.h, parse.h, symtab.h, analyze.h, code.h, cgen.h


Main.c, util.c, scan.c, parse.c, symtab.c, analyze.c, code.c, cgen.c

Four passes: 1. The scanner and the parser


2. semantic analysis: constructing the symbol table
3. semantic analysis: type checking
4. the code generator

main.c drives these passes. The central code is as follows:


syntaxTree = parse( );
buildSymtab (syntaxTree);
typeCheck(syntaxTree);
codeGen(syntaxTree,codefile);

The TM Machine

The target language: the assembly language


TM machine has some the properties of Reduced Instruction Set Computers(RISC).
1. all arithmetic and testing must take place in registers.
2. the addressing modes are extremely limited.

The simulator of the TM machine can directly execute the assembly files.

15

You might also like