Professional Documents
Culture Documents
I. I NTRODUCTION
WO-LEGGED WALKING ROBOTS have a strong attractive appeal due to the resemblance with human
beings. Consequently, some major research institutions and
private companies have started to develop bipedal (two-legged)
robots, which has led to sophisticated machines [14], [8].
To enable economically viable commercialization (e.g. for
entertainment), the challenge is now to reduce the design
complexity of these early successes, in search for the ideal set
of characteristics: stability, simplicity, and energy efficiency.
A promising idea for the simultaneous reduction of complexity and energy consumption, while maintaining or even
increasing the stability, is McGeers concept of passive dynamic walking [9]. On a shallow slope, a system consisting of
two legs with well-chosen mass properties can already show
stable and sustained walking [6]. No actuators or controls are
necessary as the swing leg moves in its natural frequency.
Using McGeers concept as a starting point we realized a
number of 2D and 3D mechanical prototypes with increasing
complexity [24], [22], [5]. These prototypes are all powered
by hip actuation and the control of these robots is extremely
simple; a foot switch per leg triggers a change in desired hip
angle, resulting in swing of the opposite leg.
Although passive dynamics combined with this simple
controller already stabilize the effect of small disturbances,
larger disturbances, such as an uneven floor, quickly lead to
failures [15]. Also, the simple controller does not guarantee
optimal efficiency or speed. Consequently, in this paper we
elaborate on the introduction of more complex controllers
The authors are with the Faculties of (+) Applied Sciences, Lorentzweg
1, 2628 CJ and (*) Mechanical Engineering, Mekelweg 2, 2628
CD, Delft University of Technology, Delft, The Netherlands. E-mail:
P.P.Jonker@tnw.tudelft.nl .
B. Mechanical prototype
The combination of passive dynamics and hip actuation has
resulted in multiple prototypes made at the Delft Biorobotics
Laboratory. The most recent 2D model is Meta (Fig. 1), which
is the subject of this study. This prototype is a 2D walker
consisting of 7 body parts (an upper body, two upper legs,
two lower legs and two feet). It has a total of 5 Degrees
of Freedom, located in a hip joint, two knee joints and two
ankle joints. The upper body is connected to the upper legs by
a bisecting hip mechanism, which passively keeps the upper
body in the intermediate angle of the legs [22].
The system is powered by a DC motor that is located at
the hip. This actuator is connected to the hip joint through
a compliant element, based on the concept of Series Elastic
Actuation first introduced by the MIT Leg Lab [13]. By
measuring the elongation of this compliant element, this allows
the hip joint to be force controlled. The compliance ensures
that the actuators output impedance is low, which makes it
possible to replicate passive dynamic motions. Also it ensures
that the actuator can perform well under the presence of
impacts. This actuator construction allows us to apply a desired
torque pattern up to a maximum torque of around 10 Nm with
a bandwidth of around 20 Hz. These properties should allow
the reinforcement learning based controller to be implemented
in practice in the near future.
The prototype is fully autonomous running on lithium ion
polymer batteries. The control platform is a PC/104 stack with
a 400 MHz processor and the controllers are implemented
through the Matlab Simulink xPC Target environment. The
angles of all 5 joints as well as the elongation of the actuators
compliant element are measured in real-time using incremental
encoders. Next to these sensors there are two switches underneath the two feet to detect foot contact.
The knee and ankle joints are both fully passive, but the
knee joint can be locked to keep the knee extended whenever
the robot is standing on the corresponding leg.
The prototype can walk based on a fairly simple control
algorithm. The hip angle is PD controlled given a constant
reference hip angle. If the foot switch of the current swing
leg is contacted (and thus becomes the new stance leg), the
reference angle is inverted, effectively pulling the new swing
leg forward. Simultaneously, the knee latches of the new swing
leg are released briefly. Then, the system just waits for the new
swing legs foot switch to make contact, assuming that knee
extension takes place before heel contact.
bI bm
bw
bc
uI um
uw
uc
ul
lc
ll
lI lm
lw
k
a
fI fm
fw
fl
fr
fh
Fig. 3.
Trainer
Trainer
Trainer
Trainer
Learning
Learning
Learning
Learning
Simulator
Simulator
Robot
Robot
mass m [kg]
mom. of Inertia I [kgm2 ]
length l [m]
vert. dist. CoM c [m]
hor. offset CoM w [m]
foot radius fr [m]
foot hor. offset fh [m]
body
8
0.11
0.45
0.2
0.02
-
upper
leg
0.7
0.005
0.3
0.15
0
-
lower
leg
0.7
0.005
0.3
0.15
0
-
foot
0.1
0.0001
0.06
0
0.015
0.02
0.015
Fig. 4.
be missed (see Fig.5). The learning run ends when either the
robot fell (ground contact of the head, knees or hip) or when
it has made 16 footsteps. The discount factor was set to 1.0,
since time does not play a role in this learning problem. In
order to keep the expected total (undiscounted) sum of rewards
bounded, the maximum number of footsteps is limited to 16.
To learn a stable walking motion, the following rewarding
scheme was chosen: A positive reward is given when a footstep
occurs, while a negative reward is given per time step when the
hip moves backward. A footstep does not count if the hip angle
exceeds 1.2 rad, to avoid rewarding overly stretched steps. This
scheme leaves a large freedom to the actual actuation pattern,
since there is not one best way to finish 16 footsteps when
disturbances are small or zero. This kind of reward structure
leads to a walking motion very fast, often under 30 minutes of
simulation time, sometimes under 3 minutes, depending on the
initial random initialization of the Q-values and the amount of
exploration that was set. Inherently, in all learning problems in
this paper, a tradeoff will be made between robustness against
disturbances and the goal set by rewards, simply because the
total expected return will be higher in the case of finishing
the full run of 16 footsteps. Although the disturbances are
self-induced by either exploration, an irregular gait and/or the
initial condition, the states outside the optimal walking motion
may occur equally well because of external disturbances.
F. Minimizing cost of transport
To minimize the specific cost of transport (C.o.T.), defined
as the amount of energy used per unit transported system
weight (m.g) per distance traveled, the following rewarding
scheme was chosen: A reward of +200/m, proportional to
the length of the footstep when a footstep occurs, a reward
of -8.3/J.s proportional to the motor work done per time
step, and a reward of -333/s every time step that the hip
moves backward. The first reward is largely deterministic
because the angles of both upper legs will define the length of
the footstep, provided that both feet are touching the floor
and that the length of both legs is constant. The second
reward is completely deterministic, being calculated from the
angular velocities of both upper legs (which are part of the
state space) and the hip motor torque (chosen as action).
Again no discounting is used ( = 1.0). The optimal policy
will be the one that maximizes the tradeoff between making
18
16
14
12
10
8
6
4
Efficient
Fast
Fast and efficient
2
0
0
20
40
60
80
Time [min]
100
Optimization
on speed
Optimization
on CoT
0.554
0.582
0.175
0.120
0.526
0.549
0.102
0.078
Optimization
on speed
and CoT
0.540
0.566
0.121
0.090
Both previous reward structures can be blended. All rewards together (proportional footstep length reward, motor
work penalty, time step penalty, backward movement penalty)
produce a tradeoff between minimum C.o.T. and maximum
forward speed. This tradeoff will depend on the exact numbers
of the rewards for motor work, time step and footstep length.
In our test, we used the following reward scheme: A reward
of 350/m proportional to the length of the footstep, when
a footstep occurs, a reward of -8.3/J.s proportional to the
motor work done every time step, a reward of -56/s every
time step, and a reward of -333/s every time step when the
hip moves backward. Fig. 5 shows the average learning curve
of 50 learning episodes (different random seeds), optimizing
on minimum cost of transport as well as maximum average
forward speed. The average and maximum forward velocity
as well as the average and minimum cost of transport can be
found in Table II.
I. Learning curve, random initialization and ranking
In general the robot learns to walk very quickly, as Fig. 5
shows. A stable walking motion is often found within 20
minutes. In order to verify that the robot is not in a local
minimum (i.e. the C.o.T might suddenly drop at some point),
the simulations need to be performed for quite some time. We
performed tests with simulation times of 15 hours, showing
no performance drop, indicating convergence.
Due to the random initialization, not all attempts to learn
are all equally successful. Some random seeds never lead to
a result. Optimizing on minimum cost of transport failed to
converge once in our 50 test runs. It is even so that walkers
develop their own character. For example, initially some
walkers might develop a preference for a short step with their
left leg and a large step with their right leg. Some dribble and
some tend to walk like Russian elite soldiers. Due to the built
in exploration and the optimization (e.g. towards efficiency)
odd behaviors mostly disappear in the long run. A ranking on
performance of all results makes it possible to select the best
walkers as download candidates for the real robot.
J. Robustness and adaptivity
G. Maximizing speed
To maximize the forward speed, the following rewarding
scheme was chosen: A reward of 150/m proportional to the
length of the footstep, when a footstep occurs, a reward of
-56/s every time step, and a reward of -333/s every time
step when the hip moves backward. Again, no discounting
is used, although time does play a role in optimizing this
problem. Our reward should linearly decrease with time, not
exponentially as is the case with discounting. Fig. 5 shows
the average learning curve of 50 learning episodes (different
random seeds), optimizing on maximum forward speed of the
hip. The average and maximum forward speed can be found
in Table II.