You are on page 1of 15

Experiences with running

MATLAB jobs on a power-


saving Condor Pool
Ian C. Smith
University of Liverpool Condor Pool
Contains around 300 machines running the Universitys Managed
Windows (XP) Service.

Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB


disk, configured with two job slots / machine.

Software updates via a weekly re-imaging process.

Single combined submit host / central manager running on Sun


V440 SMP server.

Restricted access to submit host for registered Condor users.

Currently running Condor 7.0.2 (moving to 7.2.x soon).

Policy is to run jobs only if a least 10 minutes of inactivity and low


load average during office hours and at anytime outside of office
hours.
MATLAB advantages
Originally developed for linear algebra algorithm development but
now contains many built-functions geared to different disciplines
divided into toolboxes.

Intuitive interactive environment allows rapid code development.

Simple but powerful file I/O: save <filename>, load


<filename> (useful for checkpointing).

Allows users to create their own functions stored as M-files.

Standalone applications can be built from M-files:


can run on platforms without MATLAB installed
do not need a licence to be able to run
can include all toolbox functions

APIs available for FORTRAN and C codes (MEX files)


MATLAB disadvantages
Even standalone applications can run slower than equivalent C or
FORTRAN implementations.

Standalone applications arent quite what they may seem:


more than just an .exe several files need to be packaged and deployed
need access to MATLAB run-time libraries usually via MATLAB Component
Runtime (150 MB self-extracting .exe)
luckily we have MATLAB pre-installed on all PCs in Condor pool (originally
used a network drive)

Run-time errors can be difficult to trace when MATLAB jobs are


run under Condor:
need to run under Condor on local PC
configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

Jobs submitted in a UNIX environment but code developed under


Windows.
Minor MATLAB irritations

Output files occasionally go missing:


specify all required files using transfer_output_files
identify problem jobs with condor_q held
resubmit with condor_release all

Jobs sometimes run forever:


use condor_vacate to move job to another machine
less of a problem during term time as jobs usually get evicted by logins

Difficult to reproduce these problems:


happen quite rarely ( < 1 in ~1000 jobs)
many jobs based on stochastic methods
MATLAB Research Applications

Predicting the spread of avian influenza outbreaks in poultry


flocks (Veterinary Clinical Science).

Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical


Science).

Testing of parallel genetic algorithms in a complex classification


system (Electrical Engineering and Electronics).

Simulation of the infection of a bacterial cell by a virus


(Mathematical Sciences).

Modelling the effects of radiotherapy on normal tissue using 3D


voxel arrays (Medical Imaging and Radiotherapy).
Power-saving at Liverpool
Have around 2 000 centrally managed PCs across campus which
were powered up overnight, at weekends and during vacations.

Original power-saving policy was to power-off machines after 30


minutes of inactivity, now hibernate them after 10 minutes of
inactivity

Policy has reduced wasteful inactivity time by ~ 200 000 250 000
hours per week (equivalent to 20-25 MWh) leading to an estimated
saving of approx. 125 000 p.a.

Makes extensive use of PowerMAN system from Data Synergy


comprising:
service which forces machines into a low-power state and reports machine
activity to Management Reporting Platform
Management Reporting Platform - central server from where usage stats
can be retrieved and viewed via a web browser
Adapting Condor for use with power-saving
PCs
Two main problems:
how to ensure Condor jobs are not evicted by hibernating/powered-off PCs
how to wake up dormant PCs to run Condor jobs on-demand

Originally used Microsoft system service to power-down PCs after


30 min inactivity:
runs .bat file which checks if a user is logged in and shuts machine down if
not
doesnt detect owner of Condor job as a logged-in user
need to check for presence of condor_exe.bat

PowerMAN service now prevents job eviction:


can provide PowerMAN with a list of protected programs
ensures that system remains active if a protected program is running
include condor_starter process as a protected program (only present while
a Condor job is running).
Adapting Condor for use with a power-
saving PCs
Wake-on-LAN (WoL) used to bring hibernating machines back to full
power:
NICs must be remain powered-up during hibernation/power-off
NICs must be capable of waking machines on receipt of a magic packet
network must be able to route magic packets

cron runs on the submit host which examines state of queue (condor_q)
and pool (condor_status):
if more idle jobs in queue than Unclaimed machines then need to wake up
hibernating machines
find number of powered up machines machines in each teaching centre
(classroom)
estimate the number of hibernating machines in each teaching centre from total
number of machines in each
sort centres from highest number of available machines to lowest
wake up centres in turn until sufficient machines woken to meet the demand (or
all centres woken up)
MAC addresses of machines are stored in files sorted according to teaching
centre (needed for Wake-on-LAN)
Automatic wake up issues
Assumes that any job can run on any machine:
users cannot choose particular teaching centres or machines in their job
Requirements
ideally, pool needs to be homogenous
errors in Requirements specification can cause severe problems
(machines repeatedly wake up then hibernate)
cron now includes a sanity check for this

Large clusters of jobs can cause condor scheduler to become


overloaded:
condor_q times out so cron cannot determine queue state
only a transient problem load eventually drops off and condor_q
responds again

Can only estimate number of hibernating machines in each centre

May wake up more machines than needed


Automatic wake up in action Condor pool
machine statistics
Automatic wake up in action PowerMAN
statistics
Recent and Future Developments

Recently moved to a policy of hibernating machines after


10 minutes of inactivity
submit host / central manager needs to work harder to get jobs running
before recently woken machines go back to hibernation
move execute hosts from Owner to Unclaimed state after just 5 minutes idle
update activity timer every 1 minute (default is 5 minutes)
increase number of scheduler and negotiator cycles using
SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60
around 25 % machines still hibernate after first wakeup
see a ramp up in machines running Condor jobs over about an hour
little impact on Condor users
energy wastage offset by savings with user logouts
Recent and Future Developments
Migrating to Condor 7.2 shortly
Has some interesting power-management features
Automatic power-down on execute hosts could provide a useful safety net
but PowerMAN likely to remain primary power management tool
Can retain records of ClassAds of machines in low-power state
could be useful in matchmaking jobs to powered-down machines
matchmaking logic already in Condor
nice if Condor could use this to provide a list of machines to wake-up on demand
... and wake them up with condor_wakeup ?
would like to ensure that powered-down machines are still out there (not broken,
permanently turned off, not listening etc)
also useful to see powered-off machines represented in condor_status output
Couple of extra wishes
allow jobs to claim all slots on a machine (useful if they have large memory
requirements)
provide a logged-in user machine ClassAd attribute
Further Information

http://www.liv.ac.uk/e-science/condor
i.c.smith@liverpool.ac.uk

You might also like