Professional Documents
Culture Documents
Reinforcement Learning: AI = RL
RL in a nutshell:
I
state
action
st
at
reward
rt
Receives state st
Receives scalar reward rt
Executes action at
The environment:
I
I
I
Receives action at
Emits state st
Emits scalar reward rt
Examples of RL
Policy-based RL
I
Value-based RL
I
Model-based RL
I
Bellman Equation
I
0 0
Q (s, a) = Es 0 r + max
Q (s , a ) | s, a
0
a
Deep Q-Learning
I
0 0
L(w ) = E
Q(s
,
a
,
w
)
Q(s,
a,
w
)
r + max
a0
|
{z
}
target
L(w )
w
Deep Q-Networks
Avoid oscillations
Break correlations between Q-network and target
Robust gradients
L(w ) = Es,a,r ,s 0 D
r + max
Q(s , a , w ) Q(s, a, w )
0
a
action
st
at
reward
rt
DQN in Atari
I
DQN Demo
DQN
Breakout
Enduro
River Raid
Seaquest
Space Invaders
Q-learning
Q-learning
3
29
1453
276
302
+ Target Q
10
142
2868
1003
373
Q-learning
+ Replay
241
831
4103
823
826
Q-learning
+ Replay
+ Target Q
317
1006
7447
2894
1089
Normalized DQN
I
Parameter Server
Shard K-1
Shard K
Learner
DQN Loss
Shard K+1
Gradient
wrt loss
Gradient
Sync
Sync
Q(s,a; )
Target Q
Network
Q Network
Bundled
Mode
maxa Q(s,a; )
(s,a)
Actor
Environment
argmaxa Q(s,a; )
s
Q Network
Store
(s,a,r,s)
Replay
Memory
AdaGrad optimisation
Gorila Results
40
GAMES
30
20
BEATING
10
HIGHEST
0
0
TIME (Days)
Deterministic Actor-Critic
0
= Es,a,r ,s D
r + Q(s , (s ), w ) Q(s, a, w )
w
w
J(u)
Q(s, a, w ) (s, u)
= Es,a,r ,s 0 D
u
a
u
(s)
[Lillicrap et al.]
DDAC Demo
Model-Based RL
Learn a transition model of the environment
p(r , s 0 | s, a)
Plan using the transition model
I
left
left
right
right
left
right
Deep Models
(Gregor et al.)
DARN Demo
Challenges of Model-Based RL
Compounding errors
I
Deep Learning in Go
Monte-Carlo search
I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Convolutional Networks
I 12-layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree
(Maddison et al.)
Program
Human 6-dan
12-Layer ConvNet
8-Layer ConvNet*
Prior state-of-the-art
*Clarke & Storkey
Accuracy
52%
55%
44%
31-39%
Program
GnuGo
MoGo (100k)
Pachi (10k)
Pachi (100k)
Winning rate
97%
46%
47%
11%
Conclusion
Questions?
The only stupid question is the one you never ask -Rich Sutton