You are on page 1of 5

LAHOUAOUI and Pr Cherki DAOUI

Department of Informatic
Edu.lah.moh@gmail.com
daouic@yahoo.com
Sultan Moulay Slimane University
Faculty of Sciences and Techniques Bni-Mellal

Rsum We present in this paper the notion of reinforcement


learning with different learning methods and algorithms functional equation to be solved, called the Bellman equation.
(dynamic programming, Monte Carlo method, temporal Learning by reinforcement is the result of the encounter
difference methods). We insist on learning methods based on between experimental psychology and computational
temporal differences such as Q-learning and some of its neuroscience.
variants. We will describe the robot that has been realized to
experimentally test the validity of some algorithms of II. MODLE DE LAPPRENTISSAGE PAR RENFORCEMENT
reinforcement learning in the case of two-legged robotic Reanforcement learning defines a type of interaction between
walking. We conclude with the results of the comparison the agent and the environment. Since a real situation "s" in the
between two algorithms of reinforcement learning. environment, the agent chooses and executes an action "a"
which causes a transition to the state "s". It receives in return
a reinforcement signal "r" negative of penalty type if the
Mots cls - Learning, reinforcement, Qlearning, Sarsa,
action leads to a failure or positive reward type if the action is
robotics, dynamic programming.
beneficial. The agent then uses this signal to improve its
strategy, ie the sequence of its actions, in order to maximize the
I. INTRODUCTION accumulation of its future rewards. The interaction between
Developed since the 1980s, reinforcement learning is an the agent and the environment is represented by the diagram
automatic control method that does not require knowing the of FIG. 1:
system model, the advantage of this approach is to be able to
realize a controller capable of learning to control an unknown
system without Have to specify how the task is to be
performed. The controller learns by trial and error, that is,
from experiments.

Reinforcement learning occurs daily in our everyday lives,


whether we walk, learn a new programming language, or
practice a sport.

Reinforcement learning techniques are particularly useful in Figure 1: Reinforcement learning: agent / environment
the field of intelligence and in particular in mobile robotics. interaction diagram
Indeed, they allow an agent to acquire a desired behavior by
exploiting a simple reinforcement signal penalizing or III. PROCESSUS DE DCISION MARKOVIENS
rewarding the actions of the agent in its environment. Through
Most reinforcement learning algorithms are within the
this trial / error process, the agent gradually improves his
framework of Markov decision processes.
behavior to maximize his gains.
A Markov decision process (PDM) is defined as a quadruplet
In the late 1950s, an approach developed by Richard
"S, A, T, R" where:
BELLMAN [1] uses the concept of the state of the dynamic
system and of the value function to define an equation

1
- S is a set of states: Agorithme1 : value iteration

The states characterize the situations of an agent and the Start V0 V


environment at every moment.
n 0
- A is a set of actions:
repeat
An agent chooses one of the possible actions at each instant
t. Each state of the state space is associated with a set of for every s S do
possible actions of the action space, this relation is represented
by the following FIG. 2: Vn+1(s)=max aA R(s,a)+ P(s\s,a)Vn(s )

end for
REFERENCES
[1] Laetitia Matignon (2009).Synthse dagent n n +1

[2] tel ||Vn+1 Vn ||<

Figure 2: Relationship between states, actions for s S do

-T is the transition probability: T is a transition function (s)argmaxaA R(s,a)+ P(s\s,a)Vn(s)


defining a probability distribution on future states.
end For
-R is the reinforcement signal: Is a reward function that
associates an immediate reward with each transition. Return Vn,

-Policy and value function: A policy represents the choice of


an action to be performed in a given state.
V. METHOD OF MONTE CARLO
In order to determine what is good policy, one needs a
criterion that verifies the quality of a policy. The most common We have seen in the previous section that there exist
criterion is what is called the value function. methods for solving the Bellman optimality equation that
demanded that the value function associated with the policy be
The Bellman optimality equation: The value function of the
computed at each iteration. Since these dynamic programming
optimal policy can be expressed as V *. By definition, we have:
algorithms require knowledge of transition probabilities and
reward functions, methods have been developed to best
estimate the V value function of a fixed policy on the basis
of the simulated transitions only this policy.

The so-called Monte Carlo approach is conventionally


preferred, which amounts to simulating a large number of
IV. DYNAMIC PROGRAMMING METHOD (DP) trajectories originating from each state s of S, and estimating V
(s) by averaging the costs observed on each of these
We are now interested in algorithms that allow an agent in a trajectories. For each experiment performed, the agent
PDM to learn an optimal policy. In the case of dynamic memorizes the transitions he has made and the rewards he has
programming, the transition probabilities T and the reward received. It then updates an estimate of the value of the states
function R are assumed to be known. traveled by associating to each of them the share of the reward
received which belongs to him. In the course of these
-The value iteration algorithm experiments, the estimated value associated with each state
then converges to the exact value of the state for the policy it
The algorithm of value iteration is a dynamic programming follows. The main contribution of the Monte Carlo methods
algorithm described by Bellman [2] which consists in an lies in the technique which makes it possible to estimate the
iterative improvement of the value of each state of the PDM. value of a state on the basis of the reception of several
successive cumulative reward values associated with this state
during separate trajectories.

We are only interested here in episodic tasks. The trajectory


is the sequence of states traveled from an initial state to an end
state. Having a trajectory and returns associated with each
transition, one can observe the following returns the passage
by each state of the trajectory. If we have a large number of
passages by this state and a large number of trajectories, the

2
average of the returns observed for each State tends towards
the true average t t +1

Algorithme 2 : Monte Carlo tel st F


end for
Need : a policy
For every state s do
- Q-Learning Algorithm
for j do
Q-learning was proposed by Watkins 1989, taken up by S.
Generate a trajectory following the policy pi from the state Sehad [4] as a reinforcement learning method in the case of a
s0 = s. CDM when the evolution model is unknown. This is an
We denote by {st, at, rt} the sequence of triplets state, action, off_policy type method.
immediate return of this trajectory.
The TD (0) algorithm evaluates a policy. We must now seek
We assume the trajectory of bounded size T. to improve it and achieve an optimal policy. One could apply
the technique of improving the policy already encountered.
v(s0,j) rt However, we will adopt here another strategy that will provide
a more efficient algorithm, the Q-Learning algorithm.
end for

V (s) v(s,j) Algorithme 4 : Q-Learning


end for Q(s,a)0, (s,a) (S,A)
For do
The Monte Carlo methods thus make it possible to estimate initialise state initial s0
the value function of a policy by updating some of its
components at the end of each trajectory observed. These t 0
methods require that a large number of episodes. repeat
VI. TEMPORAL DIFFERENTIAL METHODS (TD) choose action to send at
watch rt et st+1
TD (Temporal Difference Learning) methods are
combinations of MC and DP method ideas and have been Q (st, at) Q (st, at) + [rt + maxa Q (st+1, a) Q (st, at)]
developed by R. Sutton [3]. Monte Carlo methods consist in t t +1
estimating value functions without relying on a model but on
an experimental phase consisting of a large number of tel st F
episodes.
end for
Similarly, TDs can learn from experience without the need
for an environmental model, but without waiting for the end of
each episode. Like the DP method, the TD method calculates
new evaluations from previous evaluations during the episode. VII. SARSA ALGORITHME
The Q-learning algorithm is called off-policy because the
value of the next state is estimated without taking into account
-TD Algorithm the action performed. The Sarsa algorithm is called on-policy
because it uses the actions chosen by the decision method to
estimate the value of the next state s'.
Algorithme 3 : TD(0)
Need : a policy Algorithme 5 : SARSA

V 0
Q(s, a)0, (s, a) ( S, A )
for do
for do
initialise state initial s0
initialise state initial s0
t 0
Repeat t 0

Send action at = (st) choose action to send at


watch rt et st+1
V (st) V(s) + [rt + V (st+1) V (st)]

3
Repeat
send at
watch rt et st+1
choose action to send at+1

Q (st, at) Q(st, at) + [rt + Q (st+1, a) Q (st, at)]


t t +1
tel st F
end for
Figure 4: The grid 5 5 model (left) and a
cyclic walking policy (right). The Cycle States are
labeled as

VIII. CRAWLER ROBOT


The robot has 2 hands each controlled by a servomotor,
which allows to advance forward or backwards, changing the
direction of rotation of the servomotors, as well as an Arduino
board and an ultrasonic sensor.

Table 1: Four states of an optimal simple


policy.
In order to learn an optimal policy * which will allow the
Figure 3: A model of the robot crawler with its robot to walk forward, we use the Q-learning algorithm. This
two gx and gy joints. algorithm essentially works by assigning a value to each action-
state pair (s, a) State-action value Q (s, a) is an estimate of the
expected cumulative reward.

We realized a robot with two arms as described in figure 3


and that will learn to walk alone just by interacting with its The robot is made up of two arms that will give it the ability
environment. to walk forward, and each arm and attached to a servo motor
that will allow to change the angle of the arm to move from one
In the case of discrete positions and small angles of state to another.
movement of the joints, the state space of the robot arm can be
approached by a grid. To move forward, the robot must IX. . EXPERIMENTAL RESULTS
repeatedly perform a cycle of movements as shown in Figure 4
or in the sequence shown in Table 1. We present in this part some experimental results
determined from the programs realized with the Python
language, we will compare between the algorithms best known
of the learning by reinforcement SARSA and Q-Learning as
regards the number of iterations carried out before The
convergence also another comparison of the behavior of the
algorithms in the different sizes of the grids and finally the
average learning time.

4
40 33 On the other hand, in Q-Learning the agent starts in state 1,
28
Nombre d'itrations
26 performs action 1 and gets a reward (reward 1), then looks and
30 21 24 22 sees what the maximum possible reward for an action is in
20 state 2, and Uses only to update the value of the action to
10 execute action 1 in state 1.
0 So the difference is in how the future reward happens. In Q-
Grille 7X7 Grille 10X10 Grille 15X15 Learning, it is simply the highest possible action that can be
taken from state 2, and SARSA it is the value of the actual
Taille de L'environnement action that was taken.

Q-Learning SARSA This means that SARSA takes into account the control
policy by which the agent is moving, and integrates only in its
Figure 5: Difference in number of iterations update of action values, where Q-Learning simply assumes
that the optimal policy is followed.
between SARSA and Q-learning in relation to the
size of the environment X. CONCLUSION
FIG. 5 represents the difference in number of iterations Now we were able to see how reanforcement learning
performed before the convergence of these two algorithms, it is works , its qualities as its defects. As well as the difference
clear that the Q-Learning algorithm brings a significant between this method of learning and all the others. But
reduction in the number of iterations. there is not one method of learning better than the others.
The effectiveness of learning or its method of application
depends essentially on its use and the type of treatment to be
managed.

XI. REFERENCES

[1] Laetitia Matignon (2009).Synthse dagents adaptatifs et


coopratifs par apprentissage par renforcement. Application
la commande dun systme distribue de micromanipulation.

[2] R.Bellman (1957).Dynamic Programming .Princeton


University Press.

[3] Sutton R.S., and Barto A.G. Reinforcement Learning.


Mit press, Cambridge, Bradford book, 1998, 322 p. ISBN 0-
262-19398-1.
Figure 4: Comparison between Q-learning and
Sarsa according to reward and episode [4]Sehad S. Contribution l'tude et au dveloppement de
modles connexionnistes a apprentissage par renforcement :
In this figure we compare the two algorithms with respect to application a d'acquisition de comportements adaptatifs. Thse
the reward accumulated in each iteration, we see that Q- gnie informatique et traitement du signal. Montpellier :
learning performs fewer iterations to maximize the reward. Universit de Montpellier II, 1996, 112 p.

In this figure we compare the two algorithms with respect to


the reward accumulated in each iteration, we see that Q-
learning performs less iteratOn note that Q Learning and
SARSA are very similar. In the case of SARSA the update
formula follows exactly the choice of the action (which is not
always optimal). In the case of Q Learning, the update formula
uses the optimal value of possible actions after the next state,
whereas as in SARSA, the action chosen after the next state
may not be optimal. This small difference between SARSA and
Q Learning makes Q-Learning a little more effective than
SARSA. Ions to maximize reward.

In SARSA, the agent starts in state 1, performs action 1, and


obtains a reward (reward 1). Now it is in state 2 and performs
another action (action 2) and obtains the reward of that state
(reward 2) before it returns and updates the value of action 1
performed in the State 1.

You might also like