Professional Documents
Culture Documents
David Silver
Lecture 8: Integrating Learning and Planning
Outline
1 Introduction
3 Integrated Architectures
4 Simulation-Based Search
Lecture 8: Integrating Learning and Planning
Introduction
Outline
1 Introduction
3 Integrated Architectures
4 Simulation-Based Search
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
No model
Learn value function (and/or policy) from experience
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
No model
Learn value function (and/or policy) from experience
Model-Based RL
Learn a model from experience
Plan value function (and/or policy) from model
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state action
St At
reward Rt
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state action
St At
reward Rt
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Outline
1 Introduction
3 Integrated Architectures
4 Simulation-Based Search
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Model-Based RL
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Advantages of Model-Based RL
Advantages:
Can efficiently learn model by supervised learning methods
Can reason about model uncertainty
Disadvantages:
First learn a model, then construct a value function
two sources of approximation error
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Learning a Model
What is a Model?
St+1 P (St+1 | St , At )
Rt+1 = R (Rt+1 | St , At )
Model Learning
S1 , A1 R2 , S2
S2 , A2 R3 , S3
..
.
ST 1 , AT 1 RT , ST
Examples of Models
R
Model is an explicit MDP, P,
Count visits N(s, a) to each state action pair
T
1 X
Ps,s
a
0 = 1(St , At , St+1 = s, a, s 0 )
N(s, a)
t=1
T
as = 1 X
R 1(St , At = s, a)Rt
N(s, a)
t=1
Alternatively
At each time-step t, record experience tuple
hSt , At , Rt+1 , St+1 i
To sample model, randomly pick tuple matching hs, a, , i
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Learning a Model
AB Example
Given a model M = hP , R i
Solve the MDP hS, A, P , R i
Using favourite planning algorithm
Value iteration
Policy iteration
Tree search
...
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Planning with a Model
Sample-Based Planning
St+1 P (St+1 | St , At )
Rt+1 = R (Rt+1 | St , At )
Outline
1 Introduction
3 Integrated Architectures
4 Simulation-Based Search
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
S 0 Ps,s
a
0
R = Ras
S 0 P (S 0 | S, A)
R = R (R | S, A)
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Model-Free RL
No model
Learn value function (and/or policy) from real experience
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Model-Free RL
No model
Learn value function (and/or policy) from real experience
Model-Based RL (using Sample-Based Planning)
Learn a model from real experience
Plan value function (and/or policy) from simulated experience
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Model-Free RL
No model
Learn value function (and/or policy) from real experience
Model-Based RL (using Sample-Based Planning)
Learn a model from real experience
Plan value function (and/or policy) from simulated experience
Dyna
Learn a model from real experience
Learn and plan value function (and/or policy) from real and
simulated experience
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna Architecture
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna-Q Algorithm
Lecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Outline
1 Introduction
3 Integrated Architectures
4 Simulation-Based Search
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Forward Search
Forward search algorithms select the best action by lookahead
They build a search tree with the current state st at the root
Using a model of the MDP to look ahead
st
T!
T T! TT! T! T!
! ! !
Simulation-Based Search
T!
T T! TT! T! T!
! ! !
at = argmax Q(st , a)
aA
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Monte-Carlo Search
Rules of Go
Usually played on 19x19, also 13x13 or 9x9 board
Simple rules, complex strategy
Black and white place down stones alternately
Surrounded stones are captured and removed
The player with more territory wins the game
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
MCTS in Go
Position Evaluation in Go
Monte-Carlo Evaluation in Go
Simulation
1 1 0 0 Outcomes
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
MCTS in Go
4 dan Zen
3 dan ZenErica
4 kyu ManyFaces
5 kyu GnuGo*
6 kyu GnuGo*
7 kyu
9 kyu Aya*
10 kyu
Jul 06 Jan 07 Jul 07 Jan 08 Jul 08 Jan 09 Jul 09 Jan 10 Jul 10 Jan 11
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Temporal-Difference Search
Temporal-Difference Search
Simulation-based search
Using TD instead of MC (bootstrapping)
MC tree search applies MC control to sub-MDP from now
TD search applies Sarsa to sub-MDP from now
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Temporal-Difference Search
MC vs. TD search
TD Search
Dyna-2
Results of TD search in Go
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Temporal-Difference Search
Questions?