Professional Documents
Culture Documents
Department of Informatic
Edu.lah.moh@gmail.com
daouic@yahoo.com
Sultan Moulay Slimane University
Faculty of Sciences and Techniques Bni-Mellal
Reinforcement learning techniques are particularly useful in Figure 1: Reinforcement learning: agent / environment
the field of intelligence and in particular in mobile robotics. interaction diagram
Indeed, they allow an agent to acquire a desired behavior by
exploiting a simple reinforcement signal penalizing or III. PROCESSUS DE DCISION MARKOVIENS
rewarding the actions of the agent in its environment. Through
Most reinforcement learning algorithms are within the
this trial / error process, the agent gradually improves his
framework of Markov decision processes.
behavior to maximize his gains.
A Markov decision process (PDM) is defined as a quadruplet
In the late 1950s, an approach developed by Richard
"S, A, T, R" where:
BELLMAN [1] uses the concept of the state of the dynamic
system and of the value function to define an equation
1
- S is a set of states: Agorithme1 : value iteration
end for
REFERENCES
[1] Laetitia Matignon (2009).Synthse dagent n n +1
2
average of the returns observed for each State tends towards
the true average t t +1
3
Repeat
send at
watch rt et st+1
choose action to send at+1
4
40 33 On the other hand, in Q-Learning the agent starts in state 1,
28
Nombre d'itrations
26 performs action 1 and gets a reward (reward 1), then looks and
30 21 24 22 sees what the maximum possible reward for an action is in
20 state 2, and Uses only to update the value of the action to
10 execute action 1 in state 1.
0 So the difference is in how the future reward happens. In Q-
Grille 7X7 Grille 10X10 Grille 15X15 Learning, it is simply the highest possible action that can be
taken from state 2, and SARSA it is the value of the actual
Taille de L'environnement action that was taken.
Q-Learning SARSA This means that SARSA takes into account the control
policy by which the agent is moving, and integrates only in its
Figure 5: Difference in number of iterations update of action values, where Q-Learning simply assumes
that the optimal policy is followed.
between SARSA and Q-learning in relation to the
size of the environment X. CONCLUSION
FIG. 5 represents the difference in number of iterations Now we were able to see how reanforcement learning
performed before the convergence of these two algorithms, it is works , its qualities as its defects. As well as the difference
clear that the Q-Learning algorithm brings a significant between this method of learning and all the others. But
reduction in the number of iterations. there is not one method of learning better than the others.
The effectiveness of learning or its method of application
depends essentially on its use and the type of treatment to be
managed.
XI. REFERENCES