Using A Controller Based On Reinforcement Learning

1
Using a controller based on reinforcement learning

for a passive dynamic walking robot
E. Schuitema (+), D. G. E. Hobbelen (*), P. P. Jonker (+), M. Wisse (*), J. G. D. Karssen (*)
Abstract One of the difficulties with passive dynamic walking

is the stability of walking. In our robot, small uneven or tilted
parts of the floor disturb the locomotion and must be dealt
with by the feedback controller of the hip actuation mechanism.
This paper presents a solution to the problem in the form of
controller that is based on reinforcement learning. The control
mechanism is studied using a simulation model that is based
on a mechanical prototype of passive dynamic walking robot
with a conventional feedback controller. The successful walking
results of our simulated walking robot with a controller based
on reinforcement learning showed that in addition to the prime
principle of our mechanical prototype, new possibilities such as
optimization towards various goals like maximum speed and
minimal cost of transport, and adaptation to unknown situations
can be quickly found.
Index Terms Passive Dynamic Walking, Reinforcement
Learning, Hip Controller, Biped.
I. I NTRODUCTION
WO-LEGGED WALKING ROBOTS have a strong attractive appeal due to the resemblance with human
beings. Consequently, some major research institutions and
private companies have started to develop bipedal (two-legged)
robots, which has led to sophisticated machines [14], [8].
To enable economically viable commercialization (e.g. for
entertainment), the challenge is now to reduce the design
complexity of these early successes, in search for the ideal set
of characteristics: stability, simplicity, and energy efficiency.
A promising idea for the simultaneous reduction of complexity and energy consumption, while maintaining or even
increasing the stability, is McGeers concept of passive dynamic walking [9]. On a shallow slope, a system consisting of
two legs with well-chosen mass properties can already show
stable and sustained walking [6]. No actuators or controls are
necessary as the swing leg moves in its natural frequency.
Using McGeers concept as a starting point we realized a
number of 2D and 3D mechanical prototypes with increasing
complexity [24], [22], [5]. These prototypes are all powered
by hip actuation and the control of these robots is extremely
simple; a foot switch per leg triggers a change in desired hip
angle, resulting in swing of the opposite leg.
Although passive dynamics combined with this simple
controller already stabilize the effect of small disturbances,
larger disturbances, such as an uneven floor, quickly lead to
failures [15]. Also, the simple controller does not guarantee
optimal efficiency or speed. Consequently, in this paper we
elaborate on the introduction of more complex controllers
The authors are with the Faculties of (+) Applied Sciences, Lorentzweg
1, 2628 CJ and (*) Mechanical Engineering, Mekelweg 2, 2628
CD, Delft University of Technology, Delft, The Netherlands. E-mail:
P.P.Jonker@tnw.tudelft.nl .
based on learning. A learning controller has several advantages:
It is model free, so no model of the bipeds dynamic

system nor of the environment is needed.
It uses a fully result driven optimization method.
It is on-line learning, in principle possible on a real robot.
It is adaptive, in the sense that when the robot or its
environment changes without notice, the controller can
adapt until performance is again maximal.
It can optimize relatively easily towards several goals,
such as: minimum cost of transport, largest average
forward speed, or both.
Section II gives an overview of the concept of passive

dynamic walking, our mechanical prototype (Fig. 1) and the
2D simulation model describing the dynamics of this prototype. This simulation model is used for our learning controller
studies. Section III describes the principles of reinforcement
learning, their application in a controller for walking and our
measurements. In Section IV we conclude that a reinforcement based controller provides an elegant and simple control
solution for stable and efficient biped walking.
Fig. 1. Meta; a 2D robot based on the principle of passive dynamic walking.

This study is based on the simulation model of this prototype.
II. PASSIVE DYNAMIC WALKING

A. Basic Principles
Since McGeers work, the idea of passive dynamic walking
has gained in popularity. The most advanced fully passive
walker, constructed at Cornell University, has two legs and
stable three-dimensional dynamics with knees, and counterswinging arms [6].
The purely passive walking prototypes demonstrate convincing walking patterns, however, they require a slope as well as a
smooth and well adjusted walking surface. A small disturbance
(e.g. introduced by the manual launch) can still be handled,
but larger disturbances quickly lead to a failure [15].
One way to power passive dynamic walkers to walk on a
level floor and make them more robust to large disturbances is
hip actuation. This type of actuation can supply the necessary
energy for maintaining a walking motion and keep the robot
from falling forward [23]. The faster the swing leg is swung
forward (and then kept there), the more robust the walker is
against disturbances. This creates a trade-off between energy
consumption and robustness for the amount of hip actuation
that is applied.
B. Mechanical prototype
The combination of passive dynamics and hip actuation has
resulted in multiple prototypes made at the Delft Biorobotics
Laboratory. The most recent 2D model is Meta (Fig. 1), which
is the subject of this study. This prototype is a 2D walker
consisting of 7 body parts (an upper body, two upper legs,
two lower legs and two feet). It has a total of 5 Degrees
of Freedom, located in a hip joint, two knee joints and two
ankle joints. The upper body is connected to the upper legs by
a bisecting hip mechanism, which passively keeps the upper
body in the intermediate angle of the legs [22].
The system is powered by a DC motor that is located at
the hip. This actuator is connected to the hip joint through
a compliant element, based on the concept of Series Elastic
Actuation first introduced by the MIT Leg Lab [13]. By
measuring the elongation of this compliant element, this allows
the hip joint to be force controlled. The compliance ensures
that the actuators output impedance is low, which makes it
possible to replicate passive dynamic motions. Also it ensures
that the actuator can perform well under the presence of
impacts. This actuator construction allows us to apply a desired
torque pattern up to a maximum torque of around 10 Nm with
a bandwidth of around 20 Hz. These properties should allow
the reinforcement learning based controller to be implemented
in practice in the near future.
The prototype is fully autonomous running on lithium ion
polymer batteries. The control platform is a PC/104 stack with
a 400 MHz processor and the controllers are implemented
through the Matlab Simulink xPC Target environment. The
angles of all 5 joints as well as the elongation of the actuators
compliant element are measured in real-time using incremental
encoders. Next to these sensors there are two switches underneath the two feet to detect foot contact.
The knee and ankle joints are both fully passive, but the
knee joint can be locked to keep the knee extended whenever
the robot is standing on the corresponding leg.
The prototype can walk based on a fairly simple control
algorithm. The hip angle is PD controlled given a constant
reference hip angle. If the foot switch of the current swing
leg is contacted (and thus becomes the new stance leg), the
reference angle is inverted, effectively pulling the new swing
leg forward. Simultaneously, the knee latches of the new swing
leg are released briefly. Then, the system just waits for the new
swing legs foot switch to make contact, assuming that knee
extension takes place before heel contact.
C. Dynamic system model

The dynamic simulation model that is used in this study was
made using the Open Dynamics Engine physics simulator [16].
The model consists of the same 7 body parts as the prototype
modeled by rigid links having a mass and moment of inertia
associated with them (Fig. 2). The joints are modeled by
stiff spring-damper combinations. The knees are provided
with a hyperextension stop and a locking mechanism which
is released just after the start of the swing phase. The hip
bisecting mechanism that keeps the upper body upright is
modeled by introducing a kinematic chain through two added
bodies with negligible mass. The floor is - provisionally assumed to be a rigid, flat, and level surface.
Contact between the foot and the ground is also modeled by
a tuned spring-damper combination which is active whenever
part of the foot is below the ground. The model of the
foot mainly consists of two cylinders at the back and the
front of the foot. The spring damper combination is tuned
such that the qualitative motion of the model is similar to a
rigid contact model made in Matlab which has been validated
using measurements from a former prototype [22]. A profound
validation of our ODE model with the prototype will be
performed in the near future.
bI bm
bw
bc
uI um
uw
uc
ul
lc
ll
lI lm
lw
k
a
fI fm
fw
fl
fr
fh
Fig. 2. Two-dimensional 7-link model. Left the parameter definition, right

the Degrees of Freedom (DoFs). Only the DoFs of the swing leg are given,
which are identical to the DoFs of the other leg.
Fig. 3.
Trainer
Trainer
Trainer
Trainer
Learning
Learning
Learning
Learning
Simulator
Simulator
Robot
Robot
Learning in a simulator first and downloading the result
A set of physically realistic parameter values were derived

from the prototype; see Table I. Its values were used throughout this study.
TABLE I
D EFAULT PARAMETER VALUES FOR THE SIMULATION MODEL .
mass m [kg]
mom. of Inertia I [kgm2 ]
length l [m]
vert. dist. CoM c [m]
hor. offset CoM w [m]
foot radius fr [m]
foot hor. offset fh [m]
body
8
0.11
0.45
0.2
0.02
-
upper
leg
0.7
0.005
0.3
0.15
0
-
lower
leg
0.7
0.005
0.3
0.15
0
-
foot
0.1
0.0001
0.06
0
0.015
0.02
0.015
III. R EINFORCEMENT L EARNING BASED C ONTROL

A. Simulation versus on-line learning
A learning controller has several advantages over a normal
PID controller: It is model free, so no model of the bipeds
dynamic system nor of the environment is needed. It uses
result driven optimization. It is adaptive, in the sense that
when the robot or its environment changes without notice,
the controller can adapt until performance is again maximal.
It can optimize relatively easily towards several goals, such as:
minimum cost of transport, largest average forward speed, or
both. In principle, it can be performed on-line on the real robot
itself. However, problematic with robot control using learning
through trial and error from scratch, is that the robot will fall
down quite some times, that the robot needs to be initialized
in initial states over and over again and that its behavior needs
to be monitored adequately. With a good simulator, adequately
describing the real robot, learning an adaptive and optimizing
controller can be done without tedious human labor to coach
the robot and without the robot damaging itself. Moreover,
learning occurs at the computers calculation speed, which
usually means several times realtime. The final result can be
downloaded into the controller of the real robot after which
learning can be continued. Fig. 3 shows the learning controller
that first learns to adapt to a simulator of the robot, after which
its result can be downloaded to the controller of a real robot.
Note that the controller is divided into the controller itself
and a trainer (internal to the controller on a meta level) that
controls the reward assignments.
B. State of the art
Using reinforcement learning techniques for the control of
walking bipeds is not new [3], [4], [12], [20]. Especially
interesting for passive dynamics based walkers is Poincare

based reinforcement learning as discussed in [11], [7], [10].
Other promising current work in the field of learning and
execution simultaneously is found in [18], [19]. Due to the
mechanical design of their robot, it is able to acquire a robust
policy for dynamic bipedal walking from scratch. The trials are
implemented on the physical robot itself, a simulator for offline pre-learning is not necessary. The robot begins walking
within a minute and learning converges in +/- 20 minutes. It
quickly and continually adapts to the terrain with every step
it takes.
Our approach is based on experiences from various successful mechanical prototypes and is similar to the approach
of e.g. [11]. Although, we aim for a solution as found in
[19], and also our simulated robot often converges quickly to
walking, see Fig. 5, until now we feel more comfortable with
the approach of learning a number of controllers from random
initialization and downloading the best of their results into the
physical robot. See section III-I. Not found in literature is the
optimization that we applied towards various goals, such as
speed and efficiency. See section III-H. Unlike methods based
on Poincare mapping, our method does not require periodic
solutions with a one footstep period.
C. Reinforcement learning principles
Reinforcement learning is learning what to do - how to map
situations to actions - so as to maximize a numerical reward
signal [17]. In principle it does a trial-and-error search through
a state-action space to optimize the cumulative discounted sum
of rewards. This may include rewards delayed over several
time steps. In reinforcement learning for control problems, we
are trying to find an optimal action selection method or policy
, which gives us the optimal action-value function defined
by:
(1)
Q (s, a) = max Q (s, a)
s S and a A(s), which may be shared by several
optimal policies . Q-learning [21] is an off-policy temporal
difference (TD) control algorithm that approximates the optimal action-value function independent of the policy being
followed, in our case the -greedy policy. The update rule for
Q-learning, which is done after every state transition, is:
Q(st , at ) Q(st , at )+[rt+1 +maxa Q(st+1 , a )Q(st , at )]
(2)
in which s is our state signal, a is the chosen action, st+1
is the new state after action a has been performed and a is
an iterator to find the action which gives us the maximum
Q(st+1 , a ).
is the learnrate, constant in our tests, which defines
how much of the new estimate is blended with the old
estimate.
r is the reward received after taking action a in state s.
is the rate at which delayed reward are discounted every
time step.
During learning, actions are selected according to the -greedy
policy: there is an (1 ) chance of choosing the action that
gives us the maximum Q(st+1 , a ) (exploitation), and an
chance of choosing a random action (exploration). When the

state signal succeeds in retaining all relevant information about
the current learning situation, it is said to have the Markov
property.
A standard technique often combined with Q-learning is the
use of an eligibility trace. By keeping a record of the visited
state-action pairs over the last few time steps, all state-action
pairs in the trace, with decaying importance are updated. In
this paper we used Q-learning combined with an eligibility
trace the way Watkins [21] first proposed: Watkins Q().
To approximate our action-value function, CMACS tile
coding was used [17], [21], [1], [2], a linear function approximator. For each input and output dimension, values within the
dimension dependent tile width are discretized to one state,
creating a tiling. By constructing several randomly shifted
tilings, each real valued state-action pair falls into a number
of tilings. The Q-value of a certain state-action pair is then
approximated by averaging all tile values that the state-action
pair falls into. Throughout this research, we used 10 tilings.
The Q-values are all initialized with a random value between
0 and 1. Depending on the random number generator, the
initial values can be favorable or non-favorable in finding an
actuation pattern for a walking motion.
D. Learning with a dynamic system simulator
The state space of the walking biped problem consists of
six dimensions: angle and angular velocity of upper stance
leg, upper swing leg, and the lower swing leg. In order not
to learn the same thing twice, symmetry between the left and
right leg is implemented by mirroring left and right leg state
information when the stance leg changes. In the mirrored case,
the chosen hip torque is also mirrored by negation. This defines
the state of the robot except for the feet, thereby not fully
complying to the Markov property, but coming very close
when finding walking cycles. There is one output dimension:
the torque to be applied on the hip joint, which was given a
range between -8 and 8 Nm, divided in 60 discrete torques;
to be evaluated in the function approximator when choosing
the best action. All dimensions (input and output) were given
approximately the same number of discrete states within their
range of occurrence during a walking cycle. This boils down
to about 100,000 discrete states, or estimating 1,000,000 Qvalues when using 10 tilings. The parameters of Q-learning
were set to =0.25, =1.0, =0.05 and =0.92, while decays
with time with a discount rate of 0.9999 /s. The values for ,
and are very common in Q-learning. The choice of
will be explained for each learning problem. A test run was
performed after every 20 learning runs, measuring average hip
speed, cost of transport and the number of footsteps taken.
E. Learning to walk
At the start of a learning run, the robot is placed in an initial
condition which is known to lead to a stable walking motion
with the PD controlled hip actuation: a left leg angle of 0.17
rad, right leg angle of -0.5 rad (both with the absolute vertical)
and an angular velocity of 0.55 rad/s for all body parts. It
places the robot in such a state that the first footstep can hardly
Fig. 4.
The simulated robot performing steps
be missed (see Fig.5). The learning run ends when either the
robot fell (ground contact of the head, knees or hip) or when
it has made 16 footsteps. The discount factor was set to 1.0,
since time does not play a role in this learning problem. In
order to keep the expected total (undiscounted) sum of rewards
bounded, the maximum number of footsteps is limited to 16.
To learn a stable walking motion, the following rewarding
scheme was chosen: A positive reward is given when a footstep
occurs, while a negative reward is given per time step when the
hip moves backward. A footstep does not count if the hip angle
exceeds 1.2 rad, to avoid rewarding overly stretched steps. This
scheme leaves a large freedom to the actual actuation pattern,
since there is not one best way to finish 16 footsteps when
disturbances are small or zero. This kind of reward structure
leads to a walking motion very fast, often under 30 minutes of
simulation time, sometimes under 3 minutes, depending on the
initial random initialization of the Q-values and the amount of
exploration that was set. Inherently, in all learning problems in
this paper, a tradeoff will be made between robustness against
disturbances and the goal set by rewards, simply because the
total expected return will be higher in the case of finishing
the full run of 16 footsteps. Although the disturbances are
self-induced by either exploration, an irregular gait and/or the
initial condition, the states outside the optimal walking motion
may occur equally well because of external disturbances.
F. Minimizing cost of transport
To minimize the specific cost of transport (C.o.T.), defined
as the amount of energy used per unit transported system
weight (m.g) per distance traveled, the following rewarding
scheme was chosen: A reward of +200/m, proportional to
the length of the footstep when a footstep occurs, a reward
of -8.3/J.s proportional to the motor work done per time
step, and a reward of -333/s every time step that the hip
moves backward. The first reward is largely deterministic
because the angles of both upper legs will define the length of
the footstep, provided that both feet are touching the floor
and that the length of both legs is constant. The second
reward is completely deterministic, being calculated from the
angular velocities of both upper legs (which are part of the
state space) and the hip motor torque (chosen as action).
Again no discounting is used ( = 1.0). The optimal policy
will be the one that maximizes the tradeoff between making
Averaged number of steps taken
18
H. Minimizing C.o.T and maximizing speed
16
14
12
10
8
6
4
Efficient
Fast
Fast and efficient
2
0
0
20
40
60
80
Time [min]
100
Fig. 5. Learning curves: average number of footsteps over learning time,

averaging 50 learning episodes for each optimization problem.
large footsteps and spending energy. The negative reward

for backward movement of the hip should not occur when
a walking cycle has been found, and thus will mostly play
a role at the start of the learning process. Although, when
walking slowly and accidentally becoming instable on the
brim of falling backward, the robot often keeps its leg with
unlocked knee straight and stiff, standing still. Fig. 5 shows
the average learning curve of 50 learning episodes (different
random seeds), optimizing on minimum cost of transport.
The average and minimum cost of transport can be found in
Table II.
TABLE II
AVERAGE AND BEST VALUES FOR HIP SPEED AND COST OF TRANSPORT
(C OT) FOR ALL THREE OPTIMIZATION PROBLEMS .
Average speed [m/s]

Maximum speed [m/s]
Average CoT [-]
Minimum CoT [-]
Optimization
on speed
Optimization
on CoT
0.554
0.582
0.175
0.120
0.526
0.549
0.102
0.078
Optimization
on speed
and CoT
0.540
0.566
0.121
0.090
Both previous reward structures can be blended. All rewards together (proportional footstep length reward, motor
work penalty, time step penalty, backward movement penalty)
produce a tradeoff between minimum C.o.T. and maximum
forward speed. This tradeoff will depend on the exact numbers
of the rewards for motor work, time step and footstep length.
In our test, we used the following reward scheme: A reward
of 350/m proportional to the length of the footstep, when
a footstep occurs, a reward of -8.3/J.s proportional to the
motor work done every time step, a reward of -56/s every
time step, and a reward of -333/s every time step when the
hip moves backward. Fig. 5 shows the average learning curve
of 50 learning episodes (different random seeds), optimizing
on minimum cost of transport as well as maximum average
forward speed. The average and maximum forward velocity
as well as the average and minimum cost of transport can be
found in Table II.
I. Learning curve, random initialization and ranking
In general the robot learns to walk very quickly, as Fig. 5
shows. A stable walking motion is often found within 20
minutes. In order to verify that the robot is not in a local
minimum (i.e. the C.o.T might suddenly drop at some point),
the simulations need to be performed for quite some time. We
performed tests with simulation times of 15 hours, showing
no performance drop, indicating convergence.
Due to the random initialization, not all attempts to learn
are all equally successful. Some random seeds never lead to
a result. Optimizing on minimum cost of transport failed to
converge once in our 50 test runs. It is even so that walkers
develop their own character. For example, initially some
walkers might develop a preference for a short step with their
left leg and a large step with their right leg. Some dribble and
some tend to walk like Russian elite soldiers. Due to the built
in exploration and the optimization (e.g. towards efficiency)
odd behaviors mostly disappear in the long run. A ranking on
performance of all results makes it possible to select the best
walkers as download candidates for the real robot.
J. Robustness and adaptivity
G. Maximizing speed
To maximize the forward speed, the following rewarding
scheme was chosen: A reward of 150/m proportional to the
length of the footstep, when a footstep occurs, a reward of
-56/s every time step, and a reward of -333/s every time
step when the hip moves backward. Again, no discounting
is used, although time does play a role in optimizing this
problem. Our reward should linearly decrease with time, not
exponentially as is the case with discounting. Fig. 5 shows
the average learning curve of 50 learning episodes (different
random seeds), optimizing on maximum forward speed of the
hip. The average and maximum forward speed can be found
in Table II.
In order to try how robust the controller is for disturbances,

we have set-up a simulation in which we, before each run
of 16 footsteps, randomly changed the height of the tiles of
the floor. In worst case each step encounters another height.
The system appears to be able to learn to cope with these
disturbances in the floor up to 1.0 cm, which is slightly better
than with its real mechanical counterpart and a PD controller.
To illustrate the adaptive behavior, the robot was first placed
on a level surface, learning to walk. After some time, a change
in the environment was introduced by means of a small ramp.
At first, performance drops. After a relatively short time,
performance has recovered to its maximum again. Especially
when trying to walk with a minimum cost of transport, this
behavior is desirable. A learning controller without being
notified of the angle of the ramp, will find a new optimum
after some time, purely result driven: a desirable feature for

autonomously operating robots.
IV. C ONCLUSION
Using a generic learning algorithm, stable walking motions
can be found for a passive dynamic walking robot with hip
actuation, by learning to control the torque applied in the
hip to the upper legs. To test the learning algorithm, a two
dimensional model of a passive dynamic walking biped was
used that of which its mechanical counterpart is known to
walk stably with a PD controller for hip actuation. A dynamic
system model of the robot was used to train the learning
controller. None of the body dynamics of the mechanical
robot were provided to the learning algorithm itself. Using a
single learning module, simple ways of optimizing the walking
motion on goals such as minimum cost of transport and
maximum forward velocity, were demonstrated. Convergence
times showed to be acceptable even when optimizing on
difficult criteria such as minimum cost of transport. By means
of standard and easy to implement Q()-learning, problems
are solved which are very difficult to tackle with conventional
analysis.
We have verified the robustness of the system for disturbances, leading to the observation that height differences of
1.0 cm can be dealt with. The system can adapt itself quickly
to a change in the environment such as a weak ramp.
Q-learning proves to operate as a very efficient search
algorithm for finding the optimal path through a large stateaction space with simple rewards, when they are chosen
carefully.
R EFERENCES
[1] James S. Albus. A theory of cerebellar function. Mathematical
Biosciences, 10:2561, 1971.
[2] James S. Albus. Brains, behavior, and Robotics. BYTE Books, McGrawHill, Peterborough, NH, Nov 1981.
[3] H. Benbrahim and J. Franklin. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, (22):283302, 1997.
[4] C.-M. Chew and G. A. Pratt. Dynamic bipedal walking assisted by
learning. Robotica, 20(5):477491, 2002.
[5] S. H. Collins, A. Ruina, R. Tedrake, and M. Wisse. Efficient bipedal
robots based on passive-dynamic walkers. Science, 307(5712):1082
1085, 2005.
[6] S. H. Collins, M. Wisse, and A. Ruina. A two legged kneed passive
dynamic walking robot. Int. J. of Robotics Research, 20(7):607615,
July 2001.
[7] G. Endo, J. Morimoto, J.Nakanishi, and G.M.W. Cheng. An empirical
exploration of a neural oscillator for biped locomotion control. In Proc.
4th IEEE Int. Conf. on Robotics and Automation, pages 3030 3035
Vol.3, Apr 26-May 1 2004.
[8] Y. Kuroki, M. Fujita, T. Ishida, K. Nagasaka, and J. Yamaguchi. A
small biped entertainment robot exploring attractive applications. In
Proc., IEEE Int. Conf. on Robotics and Automation, pages 471476,
2003.
[9] T. McGeer. Passive dynamic walking. Int. J. Robot. Res., 9(2):6282,
April 1990.
[10] J. Morimoto, J. Cheng, C.G. Atkeson, and G. Zeglin. A simple
reinforcement learning algorithm for biped walking. In Proc. 4th IEEE
Int. Conf. on Robotics and Automation, pages 3030 3035 Vol.3, Apr
26-May 1 2004.
[11] J. Morimoto, J.Nakanishi, G. Endo, and G.M.W. Cheng. Acquisition of
a biped walking pattern using a poincare map. In Proc. 4th IEEE/RAS
Int. Conf. on Humanoid Robots, pages 912 924 Vol. 2, Nov. 10-12
2004.
[12] Y. Nakamura, M. Sato, and S. lshii.

Reinforcement learning
for biped robot.
In Proc. 2nd Int. Symp. on Adaptive Motion
of Animals and Machines. www.kimura.is.uec.ac.jp/amam2003/onlineproceedings.html, 2003.
[13] G. A. Pratt and M. M. Williamson. Series elastic actuators. IEEE
International Conference on Intelligent Robots and Systems, pages 399
406, 1995.
[14] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and
M. Fujita. The intelligent asimo: System overview and integration. In
Proc., Int. Conf. on Intelligent Robots and Systems, pages 24782483,
2002.
[15] A. L. Schwab and M. Wisse. Basin of attraction of the simplest walking
model. In Proc., ASME Design Engineering Technical Conferences,
Pennsylvania, 2001. ASME. Paper number DETC2001/VIB-21363.
[16] R. Smith. Open dynamics engine. Electronic Citation, 2005.
[17] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an
introduction. The MIT Press, Cambridge, MA, 1998. ISBN 0-26219398-1.
[18] R. Tedrake, T.W. Zhang, M.-F. Fong, and H.S. Seung. Actuating a simple
3D passive dynamic walker. In Proc., IEEE Int. Conf. on Robotics and
Automation, 2004.
[19] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Learning
to walk in 20 minutes. In Proc. 14th Yale Workshop on Adaptive and
Learning Systems. Yale University, New Haven, CT, 2005.
[20] E. Vaughan, E. Di Paolo, and I. Harvey. The evolution of control and
adaptation in a 3d powered passive dynamic walker. In Proc. 9th Int.
Conf. on the Simulation and Synthesis of Living Systems, pages 2849
2854, Boston, September 12-15 2004. MIT Press.
[21] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis,
Cambridge University, Cambridge, UK, 1989.
[22] M. Wisse, D. G. E. Hobbelen, and A. L. Schwab. Adding the upper
body to passive dynamic walking robots by means of a bisecting hip
mechanism. IEEE Transactions on Robotics, (submitted), 2005.
[23] M. Wisse, A. L. Schwab, R. Q. v. d. Linde, and F. C. T. v. d. Helm. How
to keep from falling forward; elementary swing leg action for passive
dynamic walkers. IEEE Transactions on Robotics, 21(3):393401, 2005.
[24] M. Wisse and J. v. Frankenhuyzen. Design and construction of mike;
a 2d autonomous biped based on passive dynamic walking. 2nd
International Symposium on Adaptive Motion of Animals and Machines,
2003.

Using A Controller Based On Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using A Controller Based On Reinforcement Learning

Uploaded by

Copyright:

Available Formats

1

Using a controller based on reinforcement learning

Abstract One of the difficulties with passive dynamic walking

based on learning. A learning controller has several advantages:

It is model free, so no model of the bipeds dynamic

Section II gives an overview of the concept of passive

Fig. 1. Meta; a 2D robot based on the principle of passive dynamic walking.

II. PASSIVE DYNAMIC WALKING

C. Dynamic system model

Fig. 2. Two-dimensional 7-link model. Left the parameter definition, right

Learning in a simulator first and downloading the result

A set of physically realistic parameter values were derived

III. R EINFORCEMENT L EARNING BASED C ONTROL

interesting for passive dynamics based walkers is Poincare

chance of choosing a random action (exploration). When the

The simulated robot performing steps

Averaged number of steps taken

H. Minimizing C.o.T and maximizing speed

Fig. 5. Learning curves: average number of footsteps over learning time,

large footsteps and spending energy. The negative reward

Average speed [m/s]

In order to try how robust the controller is for disturbances,

after some time, purely result driven: a desirable feature for

[12] Y. Nakamura, M. Sato, and S. lshii.

You might also like