Andriani Daskalaki-Handbook of Research On Systems Biology Applications in Medicine (2008)

Handbook of Research
on Systems Biology
Applications in Medicine
Andriani Daskalaki
Max Planck Institute for Molecular Genetics, Berlin, Germany
Volume I
Medical Information science reference

Hershey New York
Director of Editorial Content:

Senior Managing Editor:
Managing Editor:
Assistant Managing Editor:
Typesetter:
Cover Design:
Printed at:
Kristin Klinger
Jennifer Neidig
Jamie Snavely
Carole Coulson
Sean Woznicki
Lisa Tosheff
Yurchak Printing Inc.
Published in the United States of America by

Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com
and in the United Kingdom by
Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanbookstore.com
Copyright 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identi.cation purposes only . Inclusion of the names of the products or companies does
not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Handbook of research on systems biology applications in medicine / Andriani Daskalaki, editor.
p. ; cm.
Includes bibliographical references and index.
Summary: This book highlights the use of systems approaches including genomic, cellular, proteomic, metabolomic, bioinformatics,
molecular, and biochemical, to address fundamental questions in complex diseases like cancer diabetes but also in ageing--Provided by
publisher.
ISBN 978-1-60566-076-9 (h/c)
1. Biological control systems--Handbooks, manuals, etc. 2. Medicine--Research--Handbooks, manuals, etc. I. Daskalaki, Andriani.
[DNLM: 1. Systems Biology--methods. 2. Models, Theoretical. QU 26.5 H2367 2009]
R852.H36 2009
610.72--dc22
2008020863
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of
the publisher.
If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
Editorial Advisory Board
Ralf Herwig
Max Planck Institute for Molecular Genetics,
Germany
Chuchuka Enwemeka
New York Institute of Technology, USA
Christoph Wierling
Germany
Elisabeth Maschke-Dutz
Germany
James Adjaye
Germany
Athina Lazakidou
University of Piraeus, Greece
Sofia Kossida
Academy of Athens, Greece
Melpomeni Lazakidou
General Hospital Salzburg, Austria
Anastasia Kastania
Athens University, Greece
List of Contributors
Abdeljaoued-Tej, Ines / ESSAI-UR Algorithmes et Structures, Tunisia............................................ 377

Adolphs, Julia / Freie Universitt Berlin, Germany.......................................................................... 573
Ahmed, Jessica / Charit Universitaetsmedizin Berlin, Germany..................................................... 423
Albrecht, Daniela / Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany.......................................................................................... 403
Al.eri, Roberta / CNR - Institute for Biomedical Technologies, Italy and CILEA, Italy................... 476
Argyropoulos, Christos / University of Pittsburgh Medical Center, USA......................................... 221
Bagos, Pantelis G. / University of Central Greece, and University of Athens, Greece.............. 167, 182
Baumann, Marc / Biomedicum, Helsinki University, Finland........................................................... 694
Benkahla, Alia / Institut Pasteur de Tunis, Tunisia............................................................................ 377
Benovoy, David / McGill University and Genome Qubec Innovation Centre, Canada................... 262
Beuthan, Jrgen / Charit Universitaetsmedizin Berlin, Germany................................................... 673
Boutayeb, Abdesslam / Facult des sciences Oujda-Morocco, Morocco.................................. 798, 809
Brakhage, Axel A. / Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany and Friedrich Schiller University (FSU), Germany........ 403
Brazhe, Alexey R. / Technical University of Denmark, Denmark and Moscow State University,
Russia.............................................................................................................................................. 656
Brazhe, Nadezda A. / Technical University of Denmark, Denmark and Moscow State University,
Russia.............................................................................................................................................. 656
Bryan, Kenneth / University College Dublin (UCD), Ireland........................................................... 826
Cho, Kwang-Hyun / Korea Advanced Institute of Science and Technology (KAIST), Korea.............. 11
Clevert, Djork-Arn / Charit Universitaetsmedizin Berlin, Germany and Johannes Kepler
University Linz, Austria................................................................................................................... 251
Cunningham, Pdraig / University College Dublin (UCD), Ireland................................................ 826
Daskalaki, Andriani / Max Planck Institute for Molecular Genetics, Germany............................... 643
Daskalakis, Antonis / University of Patras, Greece........................................................................... 221
de Bono, Bernard / European Bioinformatics Institute, UK and University of Malta, Malta........... 714
de Carvalho Lima Lobato, Ana Katerine / Federal University of Rio Grande do Norte, Brazil
and Potiguar University, Brazil....................................................................................................... 458
Dellagi, Koussay / Institut Pasteur de Tunis, Tunisia......................................................................... 377
Derouich, Mohamed / Facult des sciences Oujda-Morocco, Morocco.................................... 798, 809
Desai, Prerak / Utah State University, USA....................................................................................... 278
Deu.hard, Peter / Zuse Institute Berlin, Germany; Freie Universitt Berlin, Germany; and DFG
Research Center Matheon, Germany............................................................................................... 759
DiFranco, Matthew / University College Dublin (UCD), Ireland..................................................... 826
Dressler, Cathrin / Laser- und Medizin-Technologie GmbH, Berlin, Germany................................. 673
Esposti, F. / Politecnico di Milano, Italy............................................................................................. 541
Evelo, C.T.A. / University of Maastricht, The Netherlands................................................................ 339
Flack, L.K. / University of Queensland, Australia............................................................................. 209
Foley, Ross / University College Dublin (UCD), Ireland................................................................... 826
Gallagher, William M. / University College Dublin (UCD), Ireland................................................ 826
Georgiev, G. / Institute of Mechanics and Biomechanics, Bulgaria..................................................... 27
Ghazal, Peter / University of Edinburgh Medical School, Scotland and Centre for Systems
Biology at Edinburgh, Scotland.......................................................................................................... 1
Gillies, Duncan / Imperial College London, UK................................................................................ 516
Gopalakrishnan, Vanathi / University of Pittsburgh, USA............................................................... 126
Guizani-Tabbane, Lamia / Institut Pasteur de Tunis, Tunisia........................................................... 377
Guthke, Reinhard / Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany.......................................................................................... 403
Hache, Hendrik / Max Planck Institute for Molecular Genetics, Germany....................................... 497
Hamblin, Michael R. / Massachusetts General Hospital - Boston, USA; Harvard Medical School,
USA; and Harvard-MIT Division of Health Sciences and Technology, USA.................................. 588
Hamodrakas, Stavros J. / University of Athens, Greece........................................................... 167, 182
Hossbach, Julia / Charit Universitaetsmedizin Berlin, Germany..................................................... 423
Kleffe, Jrgen / Charit Universitaetsmedizin Berlin, Germany....................................................... 291
Kniemeyer, Olaf / Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany........................................................................................... 403
Kossida, Sophia / Biomedical Research Foundation of the Academy of Athens, Greece and
Biomedicum, Helsinki University, Finland...................................................................................... 694
Kotev, V. / Institute of Mechanics and Biomechanics, Bulgaria........................................................... 27
Kowald, Axel / Medizinisches Proteom Center (MPC), Ruhr-Universitt Bochum, Germany.......... 312
Kuznetsov, Andrew / Freiburg University, Germany........................................................................... 97
Kwan, Tony / McGill University and Genome Qubec Innovation Centre, Canada......................... 262
Lambin, P. / GROW Research Institute, University of Maastricht, The Netherlands........................ 339
Maffezzoli, A. / Politecnico di Milano, Italy....................................................................................... 541
Majewski, Jacek / McGill University and Genome Qubec Innovation Centre, Canada.................. 262
Makrantonaki, Evgenia / Dessau Medical Center, Germany and Charit Universitaetsmedizin
Berlin, Germany............................................................................................................................... 331
Maksimov, Georgy V. / Moscow State University, Russia................................................................. 656
Maschke-Dutz, Elisabeth / Max Planck Institute for Molecular Genetics, Germany......................... 74
Mavituna, Ferda / The University of Manchester, UK....................................................................... 458
McLachlan, G.J. / University of Queensland, Australia.................................................................... 209
Meinel, Thomas / Max Planck Institute for Molecular Genetics, Germany...................................... 143
Mendoza, Luis / Universidad Nacional Autnoma de Mxico, Mexico............................................. 530
Milanesi, Luciano / CNR - Institute for Biomedical Technologies, Italy........................................... 476
Miled, Slimane Ben / Institut Pasteur de Tunis, Tunisia and ENIT-LAMSIN, Tunisia...................... 377
Minet, Olaf / Charit Universitaetsmedizin Berlin, Germany............................................................ 673
Mishra, Alok / Imperial College London, UK.................................................................................... 516
Moschopoulos, Charalampos / Biomedical Research Foundation of the Academy of Athens, Greece....694

Mosekilde, Erik / Technical University of Denmark, Denmark......................................................... 656
Mulrane, Laoighse / University College Dublin (UCD), Ireland...................................................... 826
Munoz-Hernandez, Raul / The University of Manchester, UK......................................................... 458
Nikiforidis, George / University of Patras, Greece............................................................................ 221
Nikolov, S. / Institute of Mechanics and Biomechanics, Bulgaria........................................................ 27
Numata, Jorge / Freie Universitt Berlin, Germany.......................................................................... 731
Orei, Matej / VTT Technical Research Centre of Finland, Finland................................................ 354
Pavlov, Alexey N. / Saratov State University, Russia......................................................................... 656
Petrov, V. / Institute of Mechanics and Biomechanics, Bulgaria.......................................................... 27
Pham, Tuan D. / James Cook University, Australia........................................................................... 117
Preissner, Robert / Charit Universitaetsmedizin Berlin, Germany.................................................. 423
Rasche, Axel / Max-Planck-Institute for Molecular Genetics, Germany................................... 251, 361
Reinecke, Isabel / Zuse Institute Berlin, Germany............................................................................. 759
Rexhepaj, Elton / University College Dublin (UCD), Ireland........................................................... 826
Sakellaropoulos, George / University of Patras, Greece................................................................... 221
Santiago-Corts, Elizabeth / Universidad Nacional Autnoma de Mxico, Mexico........................ 530
Seigneuric, R. / GROW Research Institute, University of Maastricht, The Netherlands................... 339
Sgourakis, Nikolaos G. / Rensselaer Polytechnic Institute, USA...................................................... 167
Signorini, M.G. / Politecnico di Milano, Italy.................................................................................... 541
Sosnovtseva, Olga V. / Technical University of Denmark, Denmark................................................. 656
Sreenath, Sree / Case Systems Biology Initiative, Case Western Reserve University, USA................. 11
Starmans, M.H.W. / GROW Research Institute, University of Maastricht, The Netherlands........... 339
Stier, Heike / Charit Universitaetsmedizin Berlin, Germany............................................................ 291
Theodosiou, Athina / Biomedical Research Foundation of the Academy of Athens, Greece............. 694
van Erk, A. / University of Maastricht, The Netherlands................................................................... 339
van Riel, N.A.W. / Eindhoven University of Technology, The Netherlands....................................... 339
Vicini, Paolo / University of Washington, USA................................................................................... 556
Vidal-Puig, Antonio / Institute of Metabolic Science, Addenbrookes Hospital, UK......................... 354
Watson, R. William / University College Dublin (UCD), Ireland..................................................... 826
Weimer, Bart / Utah State University, USA........................................................................................ 278
Wellstead, Peter / The Hamilton Institute, National University of Maynooth, Ireland........................ 11
Wolkenhauer, Olaf / University of Rostock, Germany......................................................................... 11
Wouters, B.G. / GROW Research Institute, University of Maastricht, The Netherlands................... 339
Wrede, Paul / Charit Universitaetsmedizin Berlin, Germany.......................................... 291, 423, 438
Wruck, Wasco / Max Planck Institute for Molecular Genetics, Germany......................................... 239
Zabarylo, Urszula / Charit Universitaetsmedizin Berlin, Germany................................................ 673
Zouboulis, Christos C. / Dessau Medical Center, Germany and Charit Universitaetsmedizin
Berlin, Germany............................................................................................................................... 331
Table of Contents
Foreword . .......................................................................................................................................xxxiii
Preface . ............................................................................................................................................ xxxv
Acknowledgment . ................................................................................................................................ xl
Volume I
Section I
Basic Concepts in Medical Systems Biology
Chapter I
Pathway Biology Approach to Medicine ................................................................................................ 1

Peter Ghazal, University of Edinburgh Medical School, Scotland and Centre for Systems

Biology at Edinburgh, Scotland
Chapter II
Systems and Control Theory for Medical Systems Biology . ............................................................... 11

Peter Wellstead, The Hamilton Institute, National University of Maynooth, Ireland

Sree Sreenath, Case Systems Biology Initiative, Case Western Reserve University, USA

Kwang-Hyun Cho, Korea Advanced Institute of Science and Technology (KAIST), Korea

Olaf Wolkenhauer, University of Rostock, Germany
Chapter III
Mathematical Description of Time Delays in Pathways Cross Talk . ................................................... 27

S. Nikolov, Institute of Mechanics and Biomechanics, Bulgaria

V. Petrov, Institute of Mechanics and Biomechanics, Bulgaria

V. Kotev, Institute of Mechanics and Biomechanics, Bulgaria

G. Georgiev, Institute of Mechanics and Biomechanics, Bulgaria
Chapter IV
Deterministic Modeling in Medicine .................................................................................................... 74

Elisabeth Maschke-Dutz, Max Planck Institute for Molecular Genetics, Germany
Chapter V
Synthetic Biology as a Proof of Systems Biology ................................................................................ 97

Andrew Kuznetsov, Freiburg University, Germany
Section II
Advanced Computational Methods for Systems Biology
Chapter VI
Computational Models for the Analysis of Modern Biological Data ................................................. 117

Tuan D. Pham, James Cook University, Australia
Chapter VII
Computer Aided Knowledge Discovery in Biomedicine ................................................................... 126

Vanathi Gopalakrishnan, University of Pittsburgh, USA
Section III
Genomics and Bioinformatics for Systems Biology
Chapter VIII
Function and Homology of Proteins Similar in Sequence: Phylogenetic Profiling . .......................... 143

Thomas Meinel, Max Planck Institute for Molecular Genetics, Germany
Chapter IX
Computational Methods for the Prediction of GPCRs Coupling Selectivity ..................................... 167

Nikolaos G. Sgourakis, Rensselaer Polytechnic Institute, USA

Pantelis G. Bagos, University of Central Greece, and University of Athens, Greece

Stavros J. Hamodrakas, University of Athens, Greece
Chapter X
Bacterial -Barrel Outer Membrane Proteins: A Common Structural Theme Implicated
in a Wide Variety of Functional Roles ................................................................................................ 182


Section IV
Experimental Techniques for Systems Biology
Chapter XI
Clustering Methods for Gene-Expression Data .................................................................................. 209

L.K. Flack, University of Queensland, Australia

G.J. McLachlan, University of Queensland, Australia
Chapter XII
Uncovering Fine Structure in Gene Expression Profile by Maximum Entropy Modeling
of cDNA Microarray Images and Kernel Density Methods ............................................................... 221

George Sakellaropoulos, University of Patras, Greece

Antonis Daskalakis, University of Patras, Greece

George Nikiforidis, University of Patras, Greece

Christos Argyropoulos, University of Pittsburgh Medical Center, USA
Chapter XIII
Gene Expression Profiling with the BeadArrayTM Platform ............................................................... 239

Wasco Wruck, Max Planck Institute for Molecular Genetics, Germany
Chapter XIV
The Affymetrix GeneChip Microarray Platform .............................................................................. 251

Djork-Arn Clevert, Charit Universitaetsmedizin Berlin, Germany and Johannes Kepler

University Linz, Austria

Axel Rasche, Max-Planck-Institute for Molecular Genetics, Germany
Chapter XV
Alternative Isoform Detection Using Exon Arrays . ........................................................................... 262

Jacek Majewski, McGill University and Genome Qubec Innovation Centre, Canada

David Benovoy, McGill University and Genome Qubec Innovation Centre, Canada

Tony Kwan, McGill University and Genome Qubec Innovation Centre, Canada
Chapter XVI
Gene Expression in Microbial Systems for Growth and Metabolism ................................................ 278

Prerak Desai, Utah State University, USA

Bart Weimer, Utah State University, USA
Chapter XVII
Alternative Splicing and Disease ........................................................................................................ 291

Heike Stier, Charit Universitaetsmedizin Berlin, Germany

Paul Wrede, Charit Universitaetsmedizin Berlin, Germany

Jrgen Kleffe, Charit Universitaetsmedizin Berlin, Germany
Section V
Systems Biology and Aging
Chapter XVIII
Mathematical Modeling of the Aging Process . .................................................................................. 312

Axel Kowald, Medizinisches Proteom Center (MPC), Ruhr-Universitt Bochum, Germany
Chapter XIX
The Sebaceous Gland: A Model of Hormonal Aging ......................................................................... 331

Evgenia Makrantonaki, Dessau Medical Center, Germany and Charit Universitaetsmedizin

Berlin, Germany

Christos C. Zouboulis, Dessau Medical Center, Germany and Charit Universitaetsmedizin

Berlin, Germany
Section VI
Systems Biology Applications in Medicine
Chapter XX
Systems Biology Applied to Cancer Research ................................................................................... 339

R. Seigneuric, GROW Research Institute, University of Maastricht, The Netherlands

N.A.W. van Riel, Eindhoven University of Technology, The Netherlands

M.H.W. Starmans, GROW Research Institute, University of Maastricht, The Netherlands

A. van Erk, University of Maastricht, The Netherlands

C.T.A. Evelo, University of Maastricht, The Netherlands

B.G. Wouters, GROW Research Institute, University of Maastricht, The Netherlands

P. Lambin, GROW Research Institute, University of Maastricht, The Netherlands
Chapter XXI
Systems Biology Strategies in Studies of Energy Homeostasis In Vivo ............................................. 354

Matej Orei, VTT Technical Research Centre of Finland, Finland

Antonio Vidal-Puig, Institute of Metabolic Science, Addenbrookes Hospital, UK
Chapter XXII
Approaching Type 2 Diabetes Mellitus by Systems Biology ............................................................. 361

Chapter XXIII
Systems Biology and Infectious Diseases .......................................................................................... 377

Alia Benkahla, Institut Pasteur de Tunis, Tunisia

Lamia Guizani-Tabbane, Institut Pasteur de Tunis, Tunisia

Ines Abdeljaoued-Tej, ESSAI-UR Algorithmes et Structures, Tunisia

Slimane Ben Miled, Institut Pasteur de Tunis, Tunisia and ENIT-LAMSIN, Tunisia

Koussay Dellagi, Institut Pasteur de Tunis, Tunisia
Chapter XXIV
Systems Biology of Human-Pathogenic Fungi . ................................................................................. 403

Daniela Albrecht, Leibniz Institute for Natural Product Research and Infection Biology
Hans-Knoell-Institute (HKI), Germany

Reinhard Guthke, Leibniz Institute for Natural Product Research and Infection Biology

Olaf Kniemeyer, Leibniz Institute for Natural Product Research and Infection Biology

Axel A. Brakhage, Leibniz Institute for Natural Product Research and Infection Biology
Hans-Knoell-Institute (HKI), Germany and Friedrich Schiller University (FSU), Germany
Volume II
Section VII
Systems Biology and Drug Design
Chapter XXV
Development of Specific Gamma Secretase Inhibitors ...................................................................... 423

Jessica Ahmed, Charit Universitaetsmedizin Berlin, Germany

Julia Hossbach, Charit Universitaetsmedizin Berlin, Germany


Robert Preissner, Charit Universitaetsmedizin Berlin, Germany
Chapter XXVI
In Machina Systems for the Rational De Novo Peptide Design ......................................................... 438

Chapter XXVII
Applications of Metabolic Flux Balancing in Medicine . ................................................................... 458

Ferda Mavituna, The University of Manchester, UK

Raul Munoz-Hernandez, The University of Manchester, UK

Ana Katerine de Carvalho Lima Lobato, Federal University of Rio Grande do Norte, Brazil

and Potiguar University, Brazil
Section VIII
Data Integration and Data Mining
Chapter XXVIII
Multi-Level Data Integration and Data Mining in Systems Biology . ................................................ 476

Roberta Al.eri, CNR - Institute for Biomedical T echnologies, Italy and CILEA, Italy

Luciano Milanesi, CNR - Institute for Biomedical Technologies, Italy
Chapter XXIX
Methods for Reverse Engineering of Gene Regulatory Networks ..................................................... 497

Hendrik Hache, Max Planck Institute for Molecular Genetics, Germany
Chapter XXX
Data Integration for Regulatory Gene Module Discovery . ................................................................ 516

Alok Mishra, Imperial College London, UK

Duncan Gillies, Imperial College London, UK
Chapter XXXI
Discrete Networks as a Suitable Approach for the Analysis of Genetic Regulation .......................... 530

Elizabeth Santiago-Corts, Universidad Nacional Autnoma de Mxico, Mexico

Luis Mendoza, Universidad Nacional Autnoma de Mxico, Mexico
Chapter XXXII
Investigating the Collective Behavior of Neural Networks:
A Review of Signal Processing Approaches . ..................................................................................... 541

A. Maffezzoli, Politecnico di Milano, Italy

F. Esposti, Politecnico di Milano, Italy

M.G. Signorini, Politecnico di Milano, Italy
Chapter XXXIII
The System for Population Kinetics: Open Source Software for Population Analysis ...................... 556

Paolo Vicini, University of Washington, USA
Section IX
Systems Biology in Photochemical Processes
Chapter XXXIV
Photosynthesis: How Proteins Control Excitation Energy Transfer ................................................... 573

Julia Adolphs, Freie Universitt Berlin, Germany
Chapter XXXV
Photodynamic Therapy: A Systems Biology Approach . .................................................................... 588

Michael R. Hamblin, Massachusetts General Hospital - Boston, USA; Harvard Medical

School, USA; and Harvard-MIT Division of Health Sciences and Technology, USA
Chapter XXXVI
Modeling of Porphyrin Metabolism with PyBioS .............................................................................. 643

Andriani Daskalaki, Max Planck Institute for Molecular Genetics, Germany
Section X
Modeling Cellular Physiology
Chapter XXXVII
Interference Microscopy for Cellular Studies . ................................................................................... 656

Alexey R. Brazhe, Technical University of Denmark, Denmark and Moscow State University,

Russia

Nadezda A. Brazhe, Technical University of Denmark, Denmark and Moscow State

University, Russia

Alexey N. Pavlov, Saratov State University, Russia

Georgy V. Maksimov, Moscow State University, Russia

Erik Mosekilde, Technical University of Denmark, Denmark

Olga V. Sosnovtseva, Technical University of Denmark, Denmark
Chapter XXXVIII
Fluorescence Imaging of Mitochondrial Long-Term Depolarization in Cancer Cells
Exposed to Heat-Stress ....................................................................................................................... 673

Cathrin Dressler, Laser- und Medizin-Technologie GmbH, Berlin, Germany

Olaf Minet, Charit Universitaetsmedizin Berlin, Germany

Urszula Zabarylo, Charit Universitaetsmedizin Berlin, Germany

Jrgen Beuthan, Charit Universitaetsmedizin Berlin, Germany
Section XI
Tools for Molecular Networks
Chapter XXXIX
Protein Interactions and Diseases ....................................................................................................... 694

Athina Theodosiou, Biomedical Research Foundation of the Academy of Athens, Greece

Charalampos Moschopoulos, Biomedical Research Foundation of the Academy of Athens, Greece

Marc Baumann, Biomedicum, Helsinki University, Finland

Sophia Kossida, Biomedical Research Foundation of the Academy of Athens, Greece

and Biomedicum, Helsinki University, Finland
Chapter XL
The Breadth and Depth of BioMedical Molecular Networks: The Reactome Perspective ................ 714

Bernard de Bono, European Bioinformatics Institute, UK and University of Malta, Malta
Section XII
Mathematical Modeling Approaches
Chapter XLI
Entropy and Thermodynamics in Biomolecular Simulation .............................................................. 731

Jorge Numata, Freie Universitt Berlin, Germany
Chapter XLII
Model Development and Decomposition in Physiology .................................................................... 759

Isabel Reinecke, Zuse Institute Berlin, Germany

Peter Deuflhard, Zuse Institute Berlin, Germany; Freie Universitt Berlin, Germany;

and DFG Research Center Matheon, Germany
Chapter XLIII
A Pandemic Avian Influenza Mathematical Model ............................................................................ 798

Mohamed Derouich, Facult des sciences Oujda-Morocco, Morocco

Abdesslam Boutayeb, Facult des sciences Oujda-Morocco, Morocco
Chapter XLIV
Dengue Fever: A Mathematical Model with Immunization Program . ............................................... 809


Section XIII
Data Processing in Histopathology
Chapter XLV
Automated Image Analysis Approaches in Histopathology ............................................................... 826

Ross Foley, University College Dublin (UCD), Ireland

Matthew DiFranco, University College Dublin (UCD), Ireland

Kenneth Bryan, University College Dublin (UCD), Ireland

Elton Rexhepaj, University College Dublin (UCD), Ireland

Laoighse Mulrane, University College Dublin (UCD), Ireland

R. William Watson, University College Dublin (UCD), Ireland

Pdraig Cunningham, University College Dublin (UCD), Ireland

William M. Gallagher, University College Dublin (UCD), Ireland
Detailed Table of Contents
Foreword . .......................................................................................................................................xxxiii
Preface . ............................................................................................................................................ xxxv
Acknowledgment . ................................................................................................................................ xl
Volume I
Section I
Basic Concepts in Medical Systems Biology
Chapter I
Pathway Biology Approach to Medicine ................................................................................................ 1

Peter Ghazal, University of Edinburgh Medical School, Scotland and Centre for Systems

Biology at Edinburgh, Scotland
Systems biology provides a new approach to studying, analyzing and ultimately controlling biological
processes. Biological pathways represent a key sub-system level of organization that seamlessly perform
complex information processing and control tasks. The aim of pathway biology is to map and understand
the cause-effect relationships and dependencies associated with the complex interactions of biological
networks and systems. Drugs that therapeutically modulate the biological processes of disease are often
developed with limited knowledge of the underlying complexity of their specific targets. Considering the
combinatorial complexity from the outset might help identify potential causal relationships that could
lead to a better understanding of the drug-target biology as well as provide new biomarkers for modeling diagnosis and treatment response in patients. This chapter discusses the use of a pathway biology
approach to modeling biological processes, providing a new framework for experimental medicine in
the post-genomic era.
Chapter II
Systems and Control Theory for Medical Systems Biology . ............................................................... 11

Peter Wellstead, The Hamilton Institute, National University of Maynooth, Ireland

Sree Sreenath, Case Systems Biology Initiative, Case Western Reserve University, USA

Kwang-Hyun Cho, Korea Advanced Institute of Science and Technology (KAIST), Korea

Olaf Wolkenhauer, University of Rostock, Germany
In this chapter the authors describe systems and control theory concepts for systems biology and the
corresponding implications for medicine. The context for a systems approach to the life sciences is
outlined, followed by a brief history of systems and control theory. The technical aspects of systems
and control theory are then described in a way oriented toward their biological and medical application.
This description is then used as a reference base against which to indicate specific areas where systems
and control theory aspects of systems biology have strong medical implications.
Chapter III
Mathematical Description of Time Delays in Pathways Cross Talk . ................................................... 27

S. Nikolov, Institute of Mechanics and Biomechanics, Bulgaria

V. Petrov, Institute of Mechanics and Biomechanics, Bulgaria

V. Kotev, Institute of Mechanics and Biomechanics, Bulgaria

G. Georgiev, Institute of Mechanics and Biomechanics, Bulgaria
In this chapter the authors investigate how the inclusion of time delay alters the dynamic properties of (a)
delayed protein cross talk model, (b) time delay model of RNA silencing (RNA interference), and (c) time
delay in ERK and STAT interaction. The consequences of a time delay on the dynamics of those systems
are analyzed using Hopfs theorem and Lyapunov-Andronov theory. The analytical calculations predict
that time delay acts as a key bifurcation parameter, which is confirmed by numerical simulations.
Chapter IV
Deterministic Modeling in Medicine .................................................................................................... 74

Elisabeth Maschke-Dutz, Max Planck Institute for Molecular Genetics, Germany
This chapter describes the basic mathematical methods for the deterministic kinetic modeling of biochemical systems. Mathematical analysis methods, the respective algorithms, and appropriate tools and
resources, as well as established standards for data exchange, model representations, and definitions
are presented. The methods comprise time-course simulations, steady-state search, parameter scanning,
and metabolic control analysis among others. An application is demonstrated using a test-case model
that describes parts of the extrinsic apoptosis pathway and a small example network demonstrates an
implementation of metabolic control analysis.
Chapter V
Synthetic Biology as a Proof of Systems Biology ................................................................................ 97

Andrew Kuznetsov, Freiburg University, Germany
Biologists have used a reductionist approach to investigate the essence of life. In previous years, scientific
disciplines have merged with the aim of studying life on a global scale in terms of molecules and their
interactions. Based on high-throughput measurements, Systems Biology adopts mathematical modeling and computational simulation to reconstruct natural biological systems. Synthetic Biology seeks to
engineer artificial biological systems starting from standard molecular compounds coding in DNA. Can
Systems and Synthetic Biology be combined with the idea of creating a new scienceSYS Biologythat
will not demarcate natural and artificial realities? What will this approach bring to medicine?
Section II
Advanced Computational Methods for Systems Biology
Chapter VI
Computational Models for the Analysis of Modern Biological Data ................................................. 117

Tuan D. Pham, James Cook University, Australia
Computational models have been playing a significant role for the computer-based analysis of biological
and biomedical data. Given the recent availability of genomic sequences, microarray gene expression,
and proteomic data, there is an increasing demand for developing and applying advanced computational
techniques for exploring these types of data. For example, functional interpretation of gene expression
data, deciphering of how genes and proteins work together in pathways and networks, and extracting and
analyzing phenotypic features of mitotic cells for high throughput screening of novel anti-mitotic drugs.
Successful applications of advanced computational algorithms to solving modern life-science problems
will make significant impacts on several important and promising issues related to genomic medicine,
molecular imaging, and the scientific knowledge of the genetic basis of diseases. This chapter reviews
the fusion of engineering, computer science, and information sciences with biology and medicine, to
address some latest technical developments in the computational analyses of modern biological data:
microarray gene expression data, mass spectrometry data, and bioimaging.
Chapter VII
Computer Aided Knowledge Discovery in Biomedicine ................................................................... 126

Vanathi Gopalakrishnan, University of Pittsburgh, USA
This chapter provides a perspective on three important collaborative areas in systems biology research
that represent biological problems of clinical significance. The first area deals with macromolecular crystallization, which is a crucial step in protein structure determination. The second deals with proteomic
biomarker discovery from high-throughput mass spectral technologies, while the third area is protein
structure prediction and complex fold recognition from sequence and prior knowledge of structure properties. For each area, successful case studies are revisited from the perspective of computer-aided knowledge
discovery using machine learning and statistical methods. Information about protein sequence, structure,
and function is slowly accumulating in standardized forms within databases. Methods are needed to
maximize the use of this prior information for prediction and analysis purposes. This chapter provides
insights into such methods by which information available in existing databases can be processed and
combined with systems biology expertise to expedite biomedical discoveries
Section III
Genomics and Bioinformatics for Systems Biology
Chapter VIII
Function and Homology of Proteins Similar in Sequence: Phylogenetic Profiling . .......................... 143

Thomas Meinel, Max Planck Institute for Molecular Genetics, Germany
The calculation of sequence similarity is an easily feasible way to compute protein comparisons. The
comparison of complete proteomes touches one of the earliest topics in bioinformaticsthe biologically meaningful organization of proteins in protein families. Several approaches that interpret function
or evolutionary aspects of proteins from sequence similarity are reviewed. In particular, this reflects
the arsenal of techniques introduced until now. Phylogenetic profiling, a method that compares a set of
genes or proteins by their presence or absence across a given set of organisms, is also presented in this
chapter. Proteins in a functional context, for example, a pathway or a protein complex, are represented by
identical or similar phylogenetic profiles. The detection of functional contexts by phylogenetic profiling
is playing a prospective role as an analytic tool also in systems biology. Already established tools for
phylogenetic profiling as well as particular biological examples based on the SYSTERS protein family
data set are presented.
Chapter IX
Computational Methods for the Prediction of GPCRs Coupling Selectivity ..................................... 167

Nikolaos G. Sgourakis, Rensselaer Polytechnic Institute, USA


GPCRs comprise a wide and diverse class of eukaryotic transmembrane proteins with well-established
pharmacological significance. As a consequence of recent genome projects, there is a wealth of information at the sequence level that lacks any functional annotation. These receptors, often quoted as orphan
GPCRs, could potentially lead to novel drug targets. However, typical experiments that aim at elucidating
their function are hampered by the lack of knowledge on their selective coupling partners at the interior
of the cell, the G-proteins. Up-to-date computational efforts to predict properties of GPCRs have been
focused mainly on the ligand-binding specificity, while the aspect of coupling has been less studied.
Here, we present the main motivations, drawbacks, and results from the application of bioinformatics
techniques to predict the coupling specificity of GPCRs to G-proteins, and discuss the application of
the most successful methods in both experimental works that focus on a single receptor and large-scale
genome annotation studies.
Chapter X
Bacterial -Barrel Outer Membrane Proteins: A Common Structural Theme Implicated
in a Wide Variety of Functional Roles ................................................................................................ 182


-barrel outer membrane proteins constitute the second and less studied class of transmembrane proteins.
They are present exclusively in the outer membrane of Gram-negative bacteria and presumably in the
outer membrane of mitochondria and chloroplasts. During the last few years, remarkable advances have
been made towards the understanding of their functional and structural features. It is now well-known that
-barrels are performing a large variety of biologically important functions for the bacterial cell. Such
functions include acting as specific or non-specific channels, receptors for various compounds, enzymes,
translocation channels, structural and adhesion proteins. These functional roles are of great importance
for the survival of the bacterial cell under various environmental conditions or for the pathogenic proper-
ties expressed by these organisms. We review in this chapter the currently available literature regarding
the structure and function of bacterial outer membrane proteins. We emphasize the functional diversity
expressed by a common structural motif such as the -barrel, and we provide evidence from the current
literature for dozens of newly discovered families of transmembrane -barrels.
Section IV
Experimental Techniques for Systems Biology
Chapter XI
Clustering Methods for Gene-Expression Data .................................................................................. 209

L.K. Flack, University of Queensland, Australia

G.J. McLachlan, University of Queensland, Australia
Clustering methods are used to place items in natural patterns or convenient groups. They can be used to
place genes into clusters, with the genes placed in clusters having similar expression patterns across the
tissue samples of interest. They can also be used to cluster tissues into groups on the basis of their gene
profiles. Some of the methods used are hierarchical agglomerative clustering, k-means clustering, self
organizing maps, and model-based methods. This chapters focus is on using mixtures of multivariate
normal distributions to provide model-based clusterings of tissue samples and of genes.
Chapter XII
Uncovering Fine Structure in Gene Expression Profile by Maximum Entropy Modeling
of cDNA Microarray Images and Kernel Density Methods ............................................................... 221

George Sakellaropoulos, University of Patras, Greece

Antonis Daskalakis, University of Patras, Greece

George Nikiforidis, University of Patras, Greece

Christos Argyropoulos, University of Pittsburgh Medical Center, USA
The presentation and interpretation of microarray-based genome-wide gene expression profiles as
complex biological entities are considered to be problematic due to their featureless, dense nature. Furthermore microarray images are characterized by significant background noise, but the effects of the
latter on the holistic interpretation of gene expression profiles remains under-explored. We hypothesize
that a framework combining Bayesian methodology for background adjustment in microarray images
with model-free modeling tools, may serve the dual purpose of data and model reduction, exposing
hitherto hidden features of gene expression profiles. Within the proposed framework, microarray image
restoration and noise adjustment is facilitated by a class of prior Maximum Entropy distributions. The
resulting gene expression profiles are non-parametrically modeled by kernel density methods, which not
only normalize the data, but facilitate the generation of reduced mathematical descriptions of biological
variability as mixture models.
Chapter XIII
Gene Expression Profiling with the BeadArrayTM Platform ............................................................... 239

Wasco Wruck, Max Planck Institute for Molecular Genetics, Germany
This chapter describes the application of the BeadArrayTM technology for gene expression profiling. It
introduces the BeadArrayTM technology, shows possible approaches for data analysis, and demonstrates
to the reader how the technology performs in comparison to alternative microarray platforms. With this
technique, results of high quality can be achieved so that many researchers consider employing it for
their projects. It can be expected that it will gain a lot in importance in the future. The author hopes that
this resum will introduce researchers to this novel way of performing gene expression experiments,
thus giving them a profound base for judging which technology to employ.
Chapter XIV
The Affymetrix GeneChip Microarray Platform .............................................................................. 251

Djork-Arn Clevert, Charit Universitaetsmedizin Berlin, Germany and Johannes Kepler

University Linz, Austria

Readers shall find a quick introduction with recommendations into the Preprocessing of Affymetrix
GeneChip microarrays. In the rapidly growing field of microarrays, gene expression, especially the
Affymetrix GeneChip arrays, is an established technology present for over ten years on the market.
Used in biomedical research, the mass of information demands statistics for its analysis. This chapter
presents the particular design of GeneChip arrays, where much research has already been invested and
some validation resources for the comparison of the methods are available. For a basic understanding
of the preprocessing we emphasize the stepsnamely: background correction, normalization, perfect
match correction, summarizationcoupled with alternative probe-gene assignments. Combined with a
recommendation of successful methods, a first use of the new technology becomes possible.
Chapter XV
Alternative Isoform Detection Using Exon Arrays . ........................................................................... 262

Jacek Majewski, McGill University and Genome Qubec Innovation Centre, Canada

David Benovoy, McGill University and Genome Qubec Innovation Centre, Canada

Tony Kwan, McGill University and Genome Qubec Innovation Centre, Canada
Eukaryotic genes have the ability to produce several distinct products from a single genomic locus. Recent
developments in microarray technology allow monitoring of such isoform variation at a genome-wide
scale. In our research, we have used Affymetrix Exon Arrays to detect variation in alternative splicing,
initiation of transcription, and poly-adenylation among humans. We demonstrated that such variation
is common in human populations and has an underlying genetic component. Here, we use our study to
illustrate the use of Exon Arrays to detect alternative isoforms, to outline the analysis involved, and to
point out potential problems that may be encountered by researchers using this technology.
Chapter XVI
Gene Expression in Microbial Systems for Growth and Metabolism ................................................ 278

Prerak Desai, Utah State University, USA

Bart Weimer, Utah State University, USA
Systems biology is increasingly underpinning our concept of microbial physiology. However, the tools
needed for this approach produce such large data sets that we become paralyzed trying to link the data
with the biological interpretation. Often, microbiologists are forced to use unfamiliar statistical tools that
require computer science skills that are beyond our experience. Therefore, the analysis phase prohibits
the full integration of the tools associated with the burgeoning genome sequences that are publicly available. Mining the genomes for hidden gems of metabolic content is on the verge of exploding with new
tools for metabolic flux predictions. However, experimental evidence to verify the models is not keeping
pace. Merging bioinformatics with -omics tools to verify the metabolic models will be highlighted.
The goal of this chapter is to provide an overview of -omics tools to study microbial metabolism that
is accessible to a newcomer to microbial systems biology; yet, provide some new linking growth with
genetic regulation information that will appeal to experienced physiologists. A systems biology context
will be the underpinnings of this submission to link the growth (cell division) and survival (non-culturable) with metabolism and metabolic changes.
Chapter XVII
Alternative Splicing and Disease ........................................................................................................ 291

Heike Stier, Charit Universitaetsmedizin Berlin, Germany


Jrgen Kleffe, Charit Universitaetsmedizin Berlin, Germany
Alternative splicing is an important part of the regular process of gene expression. It controls time and
tissue dependent expression of specific splice forms and depends on the correct function of about 60
splicing factor proteins of which many are the product of alternative splicing itself. It is therefore not
surprising that even minor sequence disturbances can cause mis-spliced gene products with pathological effects. We survey some common diseases which can be traced back to a malfunction of alternative
splicing, including cystic fibrosis, beta-thalassemia, spinal muscular atrophy, and cancer. Cancer also
often results from even mis-spliced splicing factors leading to randomly spliced non-functional isoforms
of several genes.
Section V
Chapter XVIII
Mathematical Modeling of the Aging Process . .................................................................................. 312

Axel Kowald, Medizinisches Proteom Center (MPC), Ruhr-Universitt Bochum, Germany
Aging is a complex biological phenomenon that practically affects all multicellular eukaryotes. It is
manifested by an ever increasing mortality risk, which finally leads to the death of the organism. Modern
hygiene and medicine has led to an amazing increase in average life expectancy over the last 150 years.
However, the underlying biochemical mechanisms of the aging process are still poorly understood. However, a better understanding of these mechanisms is increasingly important, since the growing fraction of
elderly people in the population confronts our society with completely new and challenging problems.
The aim of this chapter is to provide an overview of the aging process, discuss how it relates to system
biological concepts, and explain how mathematical modeling can improve our understanding of biochemical processes involved in the aging process. We concentrate on the modeling of stochastic effects
that become important when the number of involved entities (molecules, organelles, cells) is very small
and the reaction rates are low. This is the case for the accumulation of defective mitochondria, which we
describe mathematically in detail. In recent years, several tools became available for stochastic modeling,
therefore we also provide a brief description of the most important ones. Of course, mitochondria are
not the only target of modeling efforts in aging research. Therefore, the chapter concludes with a brief
survey of other interesting computational models in this field of research.
Chapter XIX
The Sebaceous Gland: A Model of Hormonal Aging ......................................................................... 331

Evgenia Makrantonaki, Dessau Medical Center, Germany and Charit Universitaetsmedizin

Berlin, Germany

Christos C. Zouboulis, Dessau Medical Center, Germany and Charit Universitaetsmedizin

Berlin, Germany
This chapter introduces an in vitro model as a means of studying human hormonal aging. For this purpose, human sebaceous gland cells were maintained under a hormone-substituted environment consisting of growth factors and sex steroids in concentrations corresponding to those circulating in young
and postmenopausal women. The authors suggest that hormone decline, occurring with age, may play
a significant role not only in the maintenance of skin homeostasis, but also in the initiation of aging.
Furthermore, skin, the largest organ of the body, offers an alternative approach to understanding the
molecular mechanisms underlining the aging process.
Section VI
Systems Biology Applications in Medicine
Chapter XX
Systems Biology Applied to Cancer Research ................................................................................... 339

R. Seigneuric, GROW Research Institute, University of Maastricht, The Netherlands

N.A.W. van Riel, Eindhoven University of Technology, The Netherlands

M.H.W. Starmans, GROW Research Institute, University of Maastricht, The Netherlands

A. van Erk, University of Maastricht, The Netherlands

C.T.A. Evelo, University of Maastricht, The Netherlands

B.G. Wouters, GROW Research Institute, University of Maastricht, The Netherlands

P. Lambin, GROW Research Institute, University of Maastricht, The Netherlands
Complex diseases such as cancer have multiple origins and are therefore difficult to understand and
cure. Highly parallel technologies such as DNA microarrays are now available. They provide a data
deluge which needs to be mined for relevant information and integrated to existing knowledge at different scales. Systems Biology is a recent field which intends to overcome these challenges by combining
different disciplines and provide an analytical framework. Some of these challenges are discussed in
this chapter.
Chapter XXI
Systems Biology Strategies in Studies of Energy Homeostasis In Vivo ............................................. 354

Matej Orei, VTT Technical Research Centre of Finland, Finland

Antonio Vidal-Puig, Institute of Metabolic Science, Addenbrookes Hospital, UK
In this chapter the authors report on their experience with analysis and modeling of data obtained from
studies of animal models related to obesity and metabolic syndrome. The complex interactions of genetic
and environmental factors contributing to the failure of energy balance that lead to obesity, as well as
tight systemic regulation to maintain energy homeostasis, require application of the systems biology
strategy at the physiological level. In vivo systems offer the possibility of investigating not only the effects of specific genetic modifications or treatments in selected tissues and organs, but also to elucidate
compensatory allostatic mechanisms induced to maintain the homeostasis of the whole system. A key
challenge for systems biology is to characterize different systems responses in the context of activated
pathways. One possible strategy is based on reconstruction of tissue specific pathways using lipidomics,
or metabolomics in general, in combination with proteomic and transcriptomic profiles. This approach
was applied to an obese mouse model and revealed activation of multiple liver pathways that may lead
to metabolic products that may impair insulin sensitivity.
Chapter XXII
Approaching Type 2 Diabetes Mellitus by Systems Biology ............................................................. 361

We acquired new computational and experimental prospects to seek insight and cure for millions of
afflicted persons with an ancient malady. Type 2 diabetes mellitus (T2DM) is a complex disease with a
network of interactions among several tissues and a multifactorial pathogenesis. Research conducted in
human and multiple animal models has strongly focused on genetics so far. High-throughput experimentation techniques, such as microarrays, provide new tools to amend current knowledge. By integrating
those results, the aim is to develop a systems biology model assisting the diagnosis and treatment. Beside
experimentation techniques and platforms, or rather general concepts for a new term in biology and
medicine, this chapter joins the conceptions with a rather actual medical challenge. It outlines current
results and envisions a possible alley to the comprehension of T2DM
Chapter XXIII
Systems Biology and Infectious Diseases .......................................................................................... 377

Alia Benkahla, Institut Pasteur de Tunis, Tunisia

Lamia Guizani-Tabbane, Institut Pasteur de Tunis, Tunisia

Ines Abdeljaoued-Tej, ESSAI-UR Algorithmes et Structures, Tunisia

Slimane Ben Miled, Institut Pasteur de Tunis, Tunisia and ENIT-LAMSIN, Tunisia

Koussay Dellagi, Institut Pasteur de Tunis, Tunisia
This chapter reports a variety of molecular biology informatics and mathematical methods that model
the cell response to pathogens. First, the authors the main steps of the immune response, then list the
high throughput biotechnologies, generating a wealth of information on the infected cell and some of the
immune related databases. Last, they explain how to extract meaningful information from these sources.
The modeling aspect is divided into modeling molecular interaction and regulatory networks, through
dynamic Boolean and Bayesian models, and modeling biochemical networks and regulatory networks,
through Differential/Difference Equations. The interdisciplinary approach explains how to construct a
model that mimics the cells dynamics and can predict the evolution and the outcome of infection.
Chapter XXIV
Systems Biology of Human-Pathogenic Fungi . ................................................................................. 403

Daniela Albrecht, Leibniz Institute for Natural Product Research and Infection Biology

Reinhard Guthke, Leibniz Institute for Natural Product Research and Infection Biology

Olaf Kniemeyer, Leibniz Institute for Natural Product Research and Infection Biology

Axel A. Brakhage, Leibniz Institute for Natural Product Research and Infection Biology
Hans-Knoell-Institute (HKI), Germany and Friedrich Schiller University (FSU), Germany
This chapter describes a holistic approach in order to understand the molecular biology and infection
process of human-pathogenic fungi. It comprises the whole process of analyzing transcriptomic and
proteomic data. Starting with biological background, information on Aspergillus fumigatus and Candida albicans, two of the most important fungal pathogens, are given. Afterwards, techniques to create
transcriptome and proteome data are described. The chapter continues by explaining methods for data
processing and analysis. It shows the need for, and problems with data integration as well as the role
of standards, ontologies, and databases. General aspects of these three major topics are explained and
connected to the research on human-pathogenic fungi. Finally, the near future of this research topic is
highlighted. This chapter aims to provide an overview on analyses of data from different cellular levels
of human-pathogenic fungi. It describes their integration and application of systems biology methodologies.
Volume II
Section VII
Systems Biology and Drug Design
Chapter XXV
Development of Specific Gamma Secretase Inhibitors ...................................................................... 423

Jessica Ahmed, Charit Universitaetsmedizin Berlin, Germany

Julia Hossbach, Charit Universitaetsmedizin Berlin, Germany


Robert Preissner, Charit Universitaetsmedizin Berlin, Germany
Secretases are aspartic proteases, which specifically trim important, medically relevant targets like the
amyloid-precursor protein (APP) or the Notch-receptor. Therefore, changes in their activity can lead to
dramatic diseases like M. Alzheimer caused by aggregation of peptidic fragments. On the other hand, the
secretases are interesting targets for molecular therapy of the multiple myeloma, because in this case, the
over-expressed Notch-receptor does not emerge into the native conformation until the cleavage by the
gamma secretase occurs. In this chatper, the authors focus on a novel methodology of structure-based
drug development that is feasible without prior knowledge of the target structureanalogy modelling.
This combination of similarity screening, fold recognition, ligand-supported modelling, and docking is
exemplarily illustrated for the structure of the membranal gamma secretase and specific inhibitors
Chapter XXVI
In Machina Systems for the Rational De Novo Peptide Design ......................................................... 438

Peptides fulfill many tasks in controlling and regulating cellular functions. They are key molecules in
systems biology. There is a great demand in science and industry for a fast search of innovative peptide structures. In this chapter, a combination of a computer-based guided search of novel peptides in
sequence space with their biological experimental validation is introduced. The computer-based search
uses an evolutionary algorithm, including artificial neural networks, as fitness function and a mutation operator, called the PepHarvester. Optimization occurs during 100 iterations. This system, called
DARWINIZER, is applied in the de novo design of neutralizing peptides against auto antibodies from
DCM (dilatative cardiomyopathy) patients. Another approach is the optimization of peptide sequences
by an ant colony optimization process. This biologically-oriented system identified several novel weak
binding T-cell epitopes.
Chapter XXVII
Applications of Metabolic Flux Balancing in Medicine . ................................................................... 458

Ferda Mavituna, The University of Manchester, UK

Raul Munoz-Hernandez, The University of Manchester, UK

Ana Katerine de Carvalho Lima Lobato, Federal University of Rio Grande do Norte, Brazil

Recently, metabolic flux analysis (MFA) has attracted great interest for researchers in metabolic engineering. The objective of MFA is to identify the factors and mechanisms responsible to improve cell
metabolism or properties. This approach has been widely used for the quantification of intracellular
fluxes in metabolism of bacteria, yeast, filamentous fungi, and animal cells. The ease of formulation,
versatility in use and general spectrum of application, has made the metabolic flux analysis method a
potential approach for the analysis of metabolic physiology and the design of optimal bioprocesses. This
chapter is divided in six topics: Streptomyces and antibiotic production; Review of metabolic engineering
and metabolic flux analysis; Basic concepts of metabolic flux analysis; Reconstruction of Streptomyces
metabolism; Metabolic modelling and Applications of metabolic flux analysis for antibiotic production
in Streptomyces.
Section VIII
Data Integration and Data Mining
Chapter XXVIII
Multi-Level Data Integration and Data Mining in Systems Biology . ................................................ 476

Roberta Alfieri, CNR - Institute for Biomedical Technologies, Italy and CILEA, Italy

Luciano Milanesi, CNR - Institute for Biomedical Technologies, Italy
The availability of a large number of specific resources especially for the experimental researchers is
something difficult for users who tried to explore gene, protein, and pathway data for the first time. This
chapter finally aims to highlight the complexity in the systems biology data, providing an overview of
the data integration and mining approaches in the context of systems biology with some specific example
for the Cell Cycle database and the Cell Cycle models simulation.
Chapter XXIX
Methods for Reverse Engineering of Gene Regulatory Networks ..................................................... 497

Hendrik Hache, Max Planck Institute for Molecular Genetics, Germany
In this chapter we discuss and compare different methods and applications for reverse engineering
of gene regulatory networks developed in recent years. Inferring gene networks from different kinds
of experimental data are a challenging task that emerged, especially with the development of high
throughput technologies. Various computational methods based on diverse principles were introduced
to identify new regulations among genes. The mathematical aspects of the models are highlighted and
the applications are mentioned.
Chapter XXX
Data Integration for Regulatory Gene Module Discovery . ................................................................ 516

Alok Mishra, Imperial College London, UK

Duncan Gillies, Imperial College London, UK
By integrating data from various sources, this chapter introduces the techniques that have been used
to identify the genetic regulatory modules. Data relating to the functioning of individual genes can be
drawn from many different and diverse experimental techniques. Each piece of data provides information
on a specific aspect of the cell regulation process. The chapter argues that integration of these diverse
types of data is essential in order to identify biologically relevant regulatory modules. A concise review
of the different integration techniques is presented, together with a critical discussion of their pros and
cons. A very large number of research papers have been published on this topic, and the authors hope
that this chapter will present the reader with a high-level view of the area, elucidating the research issues
and underlining the importance of data integration in modern bioinformatics.
Chapter XXXI
Discrete Networks as a Suitable Approach for the Analysis of Genetic Regulation .......................... 530

Elizabeth Santiago-Corts, Universidad Nacional Autnoma de Mxico, Mexico

Luis Mendoza, Universidad Nacional Autnoma de Mxico, Mexico
Biological systems are composed of multiple interacting elements; in particular, genetic regulatory networks are formed by genes and their interactions mediated by transcription factors. The establishment of
such networks is critical to guarantee the reliability of transcriptional performance in any organism. The
study of genetic regulatory networks as dynamical systems is a helpful methodology to understand the
transcriptional behavior of the genome. From a number of theoretical studies, it is known that networks
present a complex dynamical behavior that includes stability, redundancy, homeostasis, and multistationarity. In this chapter we present some particular biological processes modeled as discrete networks
to show that the theoretical properties of networks have a clear biological interpretation.
Chapter XXXII
Investigating the Collective Behavior of Neural Networks:
A Review of Signal Processing Approaches . ..................................................................................... 541

A. Maffezzoli, Politecnico di Milano, Italy

F. Esposti, Politecnico di Milano, Italy

M.G. Signorini, Politecnico di Milano, Italy
In this chapter, authors review main methods, approaches, and modelsfor the analysis of neuronal network data. In particular, the analysis concerns data from neurons cultivated on Micro Electrode Arrays
(MEA), a technology that allows the analysis of large ensemble of cells for long period recordings.
The goal is to introduce the reader to the MEA technology and its significance in both theoretical and
practical aspects of neurophysiology. The chapter analyzes two different approaches to the MEA data
analysis: the statistical methods, mainly addressed to the network activity description, and the system
theory methods, more dedicated to the network modeling. Finally authors present two original methods,
introduced independently. The first method involves innovative techniques in order to globally quantify
the degree of synchronization and inter-dependence on the entire neural network. The second is a new
geometrical transformation performing very fast whole-network analysis: this method is useful for singling out collective-network behaviors with a low-cost computational effort. The chapter has the aim
of providing an overview of methods dedicated to the quantitative analysis of neural network activity
measured through MEA technology. Until now many efforts were devoted to biological aspects of this
problem without taking in to account the computational and methodological signal processing questions.
This is precisely what we tried to do by our contribution that we hope could be a starting point in an
interdisciplinary cooperative research approach.
Chapter XXXIII
The System for Population Kinetics: Open Source Software for Population Analysis ...................... 556

Paolo Vicini, University of Washington, USA
Population kinetic analysis (population kinetics) is an increasingly important tool for modeling and
analyzing biomedical kinetic (that is, time-dependent, or time series) data affected by an unfavorable
signal-to-noise ratio and relatively short duration. This chapter describes the philosophy behind the SPK,
its components, and its current implementation as a Web service available at http://spk.rfpk.washington.
edu. The SPK is first and foremost an open source product, and as such, it builds on the availability of
many open source tools. This in turn allows for a very flexible modular structure and rapid deployment
of new features and user documentation. With the open source release of the SPK, it is our hope that
this software tool will turn into a collaborative effort spanning many user communities and developers
associated with population kinetic analysis.
Section IX
Systems Biology in Photochemical Processes
Chapter XXXIV
Photosynthesis: How Proteins Control Excitation Energy Transfer ................................................... 573

Julia Adolphs, Freie Universitt Berlin, Germany
This chapter introduces the theory of optical spectra and excitation energy transfer of light harvesting
complexes in photosynthesis. The light energy absorbed by protein bound pigments in these complexes
is transferred via an exciton mechanism to the photosynthetic reaction center where it drives the photochemical reactions. The protein holds the pigments in optimal orientation for excitation energy transfer
and creates an energy sink by shifting the local transition energies of the pigments. In this way, the
excitation energy is directed with high efficiency (close to 100 %) to the reaction center. In the present chapter, this energy transfer is studied theoretically. Based on crystal structure data the excitonic
couplings are calculated taking into account also the polarizability of the protein. The local transition
energies are obtained by two independent methods and are used to predict the orientation of the FMO
protein relative to the reaction center.
Chapter XXXV
Photodynamic Therapy: A Systems Biology Approach . .................................................................... 588

Michael R. Hamblin, Massachusetts General Hospital - Boston, USA; Harvard Medical

School, USA; and Harvard-MIT Division of Health Sciences and Technology, USA
This chapter focus on studies of PDT that have employed a systems biology approach. Many cell pathways
and signaling systems are engaged after PDT and although many of these cellular changes have been
elucidated by traditional biochemical and cell biology techniques, the newer technologies of -omics
are increasingly being brought to bear on this problem. In particular, these technologies involve the use
of gene-expression micro-arrays. We will cover protective responses induced by PDT that include activation of transcription factors, heat shock proteins, antioxidant enzymes, and antiapoptotic pathways.
Elucidation of these mechanisms might result in the design of more effective combination strategies to
improve the antitumor efficacy of PDT.
Chapter XXXVI
Modeling of Porphyrin Metabolism with PyBioS .............................................................................. 643

Andriani Daskalaki, Max Planck Institute for Molecular Genetics, Germany
Physiological and biochemical evidence indicates that flow of substrates into the porphyrin pathway is
controlled by the synthesis of daminolevulinic acid (ALA), the first committed precursor of the porphyrin pathway. The basis of the selectivity of ALA-based (Photodynamic Therapy) PDT or (Photodynamic Diagnosis) PDD has been correlated with the metabolic rate of the cells or with the differential
expression of enzymes along the heme biosynthetic pathway. Although light is required to trigger the
synthesis of ALA and the differentiation of chloroplasts (Reinbothe and Reinbothe, 1996), a feedback
inhibition of ALA synthesis by an end product of the porphyrin pathway is thought to be involved in
the regulation of influx into the pathway. Both the nature of the product and the mechanism involved in
effecting feedback inhibition remain unknown. Thus, the modeling of the porphyrin pathway may fill
this void and allow researchers to address this question of long-standing importance.
Section X
Chapter XXXVII
Interference Microscopy for Cellular Studies . ................................................................................... 656

Alexey R. Brazhe, Technical University of Denmark, Denmark and Moscow State University,

Russia

Nadezda A. Brazhe, Technical University of Denmark, Denmark and Moscow State

University, Russia

Alexey N. Pavlov, Saratov State University, Russia

Georgy V. Maksimov, Moscow State University, Russia

Erik Mosekilde, Technical University of Denmark, Denmark

Olga V. Sosnovtseva, Technical University of Denmark, Denmark
This chapter describes the application of interference microscopy and double-wavelet analysis to the
non-invasive study of cell structure and function. We present different techniques of phase and interference microscopy and discuss how variations in the intrinsic optical properties of a cell can be related to
the intracellular processes. Particular emphasis is given to the newly developed phase modulation laser
interference microscope. We show how this setup, combined with wavelet analysis of the obtained data
series, can be applied to live cell imaging and to investigate the rhythmic intracellular processes and
their mutual interactions. We hope that the discussion will contribute to the understanding and learning
of new methods for non-invasive investigation of intracellular processes.
Chapter XXXVIII
Fluorescence Imaging of Mitochondrial Long-Term Depolarization in Cancer Cells
Exposed to Heat-Stress ....................................................................................................................... 673

Cathrin Dressler, Laser- und Medizin-Technologie GmbH, Berlin, Germany

Olaf Minet, Charit Universitaetsmedizin Berlin, Germany

Urszula Zabarylo, Charit Universitaetsmedizin Berlin, Germany

Jrgen Beuthan, Charit Universitaetsmedizin Berlin, Germany
This chapter deals with the stress response of mitochondria to heat which is the central agent of thermotherapy. Thermotherapies function by inducing lethal heat inside target tissues. Spatial and temporal
instabilities of temperature distributions in targets require optimized treatment protocols and reliable
temperature-control methods during thermotherapies. Since solid cancers present predominant targets
to thermotherapy, we analyzed hyperthermic stress-induced effects on mitochondrial transmembrane
potentials in breast cancer cells (MX1). Heat sensitivities and stress reactions might be extremely different among different tissue species and tissue dignities; it is very important to investigate tissue-specific
stress responses systematically. Even though this chapter will provide little information, only to the
enlightenment of systemic cellular heat stress mechanisms, it may contribute to deepening the basic
knowledge about systemic stress responses. In addition, the data presented here might support optimizing
of treatment protocols applied during thermotherapy, particularly LITT and hyperthermia.
Section XI
Chapter XXXIX
Protein Interactions and Diseases ....................................................................................................... 694

Athina Theodosiou, Biomedical Research Foundation of the Academy of Athens, Greece

Charalampos Moschopoulos, Biomedical Research Foundation of the Academy of Athens, Greece

Marc Baumann, Biomedicum, Helsinki University, Finland

Sophia Kossida, Biomedical Research Foundation of the Academy of Athens, Greece

The direct connection of proteomics with human diseases is now unquestionable and proteomics have
become a scientific section of great research interest. In this chapter, we present a detailed description
of the nature of protein interactions and describe the more important methodologies that are being
used for their detection. The authors review the mechanisms leading to diseases and involving protein
interactions, and refer to specific diseases such as Huntingtons disease and cancer. Finally, we give an
overview of the most popular computational methods which are used for the prediction or the healing
of the diseases.
Chapter XL
The Breadth and Depth of BioMedical Molecular Networks: The Reactome Perspective ................ 714

Bernard de Bono, European Bioinformatics Institute, UK and University of Malta, Malta
From a genetic perspective, disease can be interpreted in terms of a variation in molecular sequence or
expression (dose) that impairs normal physiological function. To understand thoroughly the knock-on
effect such pathological changes may have, it is crucial to map out the physiological relationship affected genes maintain with their functional neighbors. The goal of the Reactome project is to build such
a network knowledgebase for all human genes. Constructing a map of such extent and scope requires a
considerable range of expertise, so this project collaborates with field experts to integrate their pathway
knowledge into a single quality-checked human model. This resource dataset is systematically crossreferenced to major molecular and literature databases, and is accessible to the community in a number
of well-established formats. As an evolving network systems resource, Reactome is also starting to
provide increasingly powerful and robust tools to investigate tissue-specific biology and steer targeted
drug design.
Section XII
Mathematical Modeling Approaches
Chapter XLI
Entropy and Thermodynamics in Biomolecular Simulation .............................................................. 731

Jorge Numata, Freie Universitt Berlin, Germany
Thermodynamics is one of the best established notions in science. Some recent work in biomolecular
modeling has sacrificed its rigor in favor of trendy empirical methods. Even in cases when physics-based
energy functions are used, entropy is forgotten or left for later versions. This text gives an overview of
the utility of a more rigorous treatment of thermodynamics at the molecular level to understand protein
folding and receptor-ligand binding. It begins with an intuitive explanation of thermodynamics: enthalpy
is the quantity of energy, while entropy stands for its quality. Recent advances in entropy from information theory and physical chemistry are outlined as they apply to biological thermodynamics. A reliable
calculation of equilibrium constants for elementary reactions among biochemical metabolites and kinetic
rates of enzymes from first principles would be an invaluable advance for the field of systems biology.
The methods presented in this chapter carry such potential.
Chapter XLII
Model Development and Decomposition in Physiology .................................................................... 759

Isabel Reinecke, Zuse Institute Berlin, Germany

Peter Deuflhard, Zuse Institute Berlin, Germany; Freie Universitt Berlin, Germany;

Some model development concepts can be used for mathematical modeling in physiology as well as
a graph theoretical approach for a decomposition technique in order to simplify parameter estimation.
These methods are presented on the basis of a complex mathematical model for the human menstrual
cycle. First, some modeling fundamentals are introduced and applied to the model development of the
human menstrual cycle. Then it is shown how a complex mathematical model in physiology can be
handled if a large number of parameters are used in the model where the parameter values are not known
for the most part. A method is presented to divide the model into smaller, disjoint model parts in order
to simplify parameter estimation. At the same time, it is shown how this technique works in the case of
the human menstrual cycle. The principles for model development and decomposition can be used for
other physiological models as well.
Chapter XLIII
A Pandemic Avian Influenza Mathematical Model ............................................................................ 798


Worldwide, seasonal outbreaks of influenza affect millions of people, killing about 500,000 individuals
every year. Human influenza viruses are classified into three serotypes: A, B, and C. Only influenza
A viruses can infect and multiply in avian species. During the last decades, important avian influenza
epidemics have occurred. So far, the epidemics among birds have been transmitted to humans, but the
most feared problem is the risk of pandemics that may be caused by person-to-person transmission.
The present mathematical model deals with the dynamics of human infection by avian influenza both
in birds and in humans. Stability analysis is carried out and the behavior of the disease is illustrated by
simulation with different parameters values.
Chapter XLIV
Dengue Fever: A Mathematical Model with Immunization Program . ............................................... 809


Dengue fever is a re-emergent disease affecting more than 100 countries. Its incidence has increased
fourfold since 1970 and nearly half the world population is now at risk. In the present paper, a mathematical model with immunization is proposed to simulate the succession of two epidemics with variable human populations. Stability of the equilibrium points is carried out and simulation is given for
different parameters settings.
Section XIII
Data Processing in Histopathology
Chapter XLV
Automated Image Analysis Approaches in Histopathology ............................................................... 826

Ross Foley, University College Dublin (UCD), Ireland

Matthew DiFranco, University College Dublin (UCD), Ireland

Kenneth Bryan, University College Dublin (UCD), Ireland

Elton Rexhepaj, University College Dublin (UCD), Ireland

Laoighse Mulrane, University College Dublin (UCD), Ireland

R. William Watson, University College Dublin (UCD), Ireland

Pdraig Cunningham, University College Dublin (UCD), Ireland

William M. Gallagher, University College Dublin (UCD), Ireland
The field of histopathology has encountered a key transition point, with the progressive move towards
the use of digital slides and automated image analysis approaches. This chapter discusses the various
methods and techniques involved in the automation of image analysis in histopathology. Important
concepts and techniques are explained in the five main areas of workflow within image analysis in
histopathology; data acquisition, the digital image, image pre-processing, segmentation and machine
learning. Furthermore, examples of the application of these concepts and techniques in histopathological research are then given.
xxxiii
Foreword
Systems biology has been in the focus of intense public and private research in recent years evoking
high expectations and hopes with regard to the solution of emerging problems in the health care sector.
The notion of systems biology is rather broad agglomerating mathematical and computational methods,
experimental techniques and biomedical applications. At its core, systems biology aims at the explanation of physiology and disease from the level of interacting components such as molecular pathways,
regulatory networks, cells, organs and, ultimately, the entire organism. This is complementary to the
single-protein (-target) approach that had been the primary research paradigm for a long time condensed
in the term reductionism. However, with the increasing amount and heterogeneity of data generated
by modern experimental techniques and the increasing power of computational hardware, it has become
evident that such reductionism can no longer be maintained as the primary research paradigm.
The essential tool of systems biology research is the computer. With the use of computational methods systems biology aims at an understanding of biological processes using mathematical models of
different granularity. The purpose of these models is the generation of in silico predictions, for example
on the state of a particular disease or the effect of the therapy on the individual patient. With the use of
data integration methods systems biology utilises a comprehensive experimental read-out on different
levels of cellular information in order to fit the parameters in these models. And, finally, at the intersection of several key research disciplines systems biology links these mathematical models to practically
relevant research questions and contributes to the generation and testing of hypotheses and the planning
of experiments. These new approaches are about to revolutionize our knowledge on disease mechanisms
and on the interpretation of data from high-throughput technologies.
Systems biology approaches are necessary in several respects. First, with respect to the increasing
complexity of research, it is very likely that systems biology contributes to the formulation and solution
of new paradigms able to describe the underlying complex biological problems. Secondly, with respect
to the increasing complexity of experimental techniques, new problems arise that must be solved by
integration rules. For example, in practice often several laboratories are working with different experimental techniques at the same research question. A fundamental challenge is thus to search through
the exhaustive set of data and extract meaningful information. Thirdly, with respect to these increasing
demands on the mathematical modelling it becomes more and more evident that the development of
computational modelling approaches itself must be connected much closer to the experimental observations in order to proof usefulness and relevant predictive power of these approaches.
Having acknowledged that systems biology holds such high promises for future biomedical research,
I am delighted to write the Foreword to this Handbook of Research on Systems Biology Applications
in Medicine as its scope and content provide both, students and researchers, from various disciplines
with a broad introduction of systems biology methodologies and show their usefulness on a multitude
of applications.
xxxiv
The book targets systems biology on a rather practical level from two directions from an experimental and a methodological direction.
The experimental path contributes articles that highlight applications in important problem domains,
for example human diseases such as cancer, type-2 diabetes mellitus, infectious diseases, influenza and
ageing, among others, as well as in specific processes such as apoptosis and photosynthesis and with
respect to specific experimental techniques such as chip analyses, interference microscopy, proteinprotein interactions, synthetic biology, de novo peptide design and photodynamic therapy. The reader
approaching the book from this path will find in-depth descriptions of these biological phenomena, of
the practical problems in analysing these phenomena along with a description of computational solutions for these analyses.
The methodological path contributes articles that describe multivariate statistical methods such as
clustering, gene expression analysis, normalisation methods as well as analysis methods for kinetic
models such as metabolic flux analysis, metabolic control analysis among others and, additionally data
integration methods in terms of experimental data, pathways and mathematical models. The reader
approaching the book from this path will find introduction and description of relevant computational
methods and a demonstration how these methods are applied to practical problems. Taken together both
paths allow researchers from different disciplines to catch a common basis.
In summary, by presenting such a broad mixture of articles the book gives relevant insights into the
different research disciplines that are touched by systems biology such as mathematics, biology, chemistry, medicine and information theory. Students and researchers from these fields might get interested
in a specialisation towards this new discipline. On the other hand, researchers coming from the systems
biology field will get valuable information on real practical problems and potential approaches to these
problems that could benefit from systems biology methods. Thus, the book gives both sides a good
starting point to walk on further on this fascinating new road.
Ralf Herwig
Group Leader, Max-Planck-Institute for Molecular Genetics
March 2008
Ralf Herwig studied physics and mathematics at the Technical and the Free University Berlin and finished his PhD in 2001
on clustering methods for gene expression data. He was awarded for the Heinz-Billing-Price for Scientific Computation of
the Max-Planck Society in 1999 and was an honor student of the American Academy of Achievement in 2000. Since 2001 he
is group leader at the Max-Planck-Institute for Molecular Genetics. His research focuses on multivariate statistical methods,
data integration systems and computational modelling. Ralf Herwig has contributed to 50 scientific publications and was coauthor of the first textbook on systems biology in 2005.
xxxv
Preface
Systems biology integrates theoretical and experimental research and applies to various areas related to
medicine. However, a few things are known in the medical profession about the theories and techniques
behind systems biology.
In the future, as systems biology techniques progress, it may become possible to study complex
diseases at a multitude of levels within the cell, from transcriptional changes, to changes in metabolic
flux through genetic pathways.
Complete genome sequencing of hundreds of pathogenic and model organisms in the past decade
has provided the information required for studies of gene function. Functional genomics and proteomics
approaches, when combined with computational biology and the emerging discipline of systems biology,
finally allow us to begin comprehensive mapping of cellular and molecular networks and pathways.
However, one of the main difficulties we still face is how best to integrate these disparate data sources
and use them to better understand, diagnose, and treat biological systems during disease.
The systems biology approach of integrating protein expression data with clinical data such as histopathology, clinical functional measurements, medical imaging scores, patient demographics, and clinical
outcome provides a powerful tool for linking biomarker expression with biological processes that can
be segmented and linked to disease presentation.
Systems biology and new technologies enable predictive and preventative medicine. This biology is
revolutionizing the field of medical research and creating a new breed of medical researchers.Systems
biology yields insights that may aid in the treatment of cancer by combining different disciplines and
providing an analytical framework. The existence of heterogeneity of treatment effects is apparent when
evaluating patient response to a drug in clinical trials and in clinical practice. Adverse drug reactions
are being linked to enzymatic deficiencies or mutations. Therefore one of the great challenges for 21st
century medicine is to deliver effective therapies that allow clinicians to choose the correct drug, dose,
or intervention for any patient before the start of therapy. (Meyer and Zanger, 1997; Eichelbaum and
Burk, 2001; Srivastava, 2003, Nicholson J. K., 2006).
The creation of detailed maps of signaling networks linkages between various pathways of genes
and proteins that resemble complicated wiring diagrams (Hahn C and Weinberg R.A., 2002), provide
a better understanding of the disease at a molecular level. Many diseases can be explained by defects
in pathways, and new treatments often involve finding drugs that correct those defects. This approach
can result to a more individualized, and potentially more effective approach to diagnosis and treatment.
Some of these challenges, as well as the development of systems biology techniques and platforms for
translating genomic and pathway research into clinical healthcare are discussed in this the handbook.
Systems biology provides us with a common language for both describing and modelling the integrated action of regulatory networks at many levels of biological organization from the subcellular
through the cell, tissue, and organ, right up to the whole organism. Molecular epidemiology concerns
the measurement of the fundamental biochemical factors that underlie population disease demography
xxxvi
and understanding the health of different nations. This subject naturally lends it to systems biology approaches. Hence, systems biology is certain to play a major role in the future of both the development
of personalized medicine and in molecular epidemiological studies.
To access the latest research related to the applications of systems biology in medicine, I decided
one year ago to launch a handbook project where researchers from all over the world would assist me in
providing the necessary coverage of each respective discipline in systems biology. The primary objective of this project was to define the technologies, terms, and acronyms related to the systems biology
and its medical applications.
The handbook will highlight the use of systems approaches including genomic, cellular, proteomic,
metabolomic, bioinformatics, molecular, and biochemical, to address fundamental questions in complex
diseases like cancer and diabetes, also in ageing.
Organisa ti on of the H andb ook

The handbook is roughly divided into 13 sections.
Section I, Basic Concepts in Medical Systems Biology, introduces the readers to some basic concepts
in the field of systems biology, and the systematic study of complex interactions in biological systems
in order to understand better the entirety of processes that happen in diseases as cancer and diabetes.
A cellular network can be modeled mathematically using methods coming from chemical kinetics and
control theory. Due to the large number of parameters, variables and constraints in cellular networks,
numerical and computational techniques are often used. This Section contains 5 chapters. Chapter I and
II introduce the basic concepts in medical systems biology. Chapter I discusses the use of a pathway biology approach to modelling biological processes, providing a new framework for experimental medicine.
Chapter II presents principles used in medical systems biology and describe systems and control theory
concepts for systems biology and the corresponding implications for medicine. Chapter III describes the
inclusion of time delay in pathway cross talk models. Chapter IV explains how deterministic modelling
is applied to systems biology. Chapter V introduces Synthetic Biology, as an engineering approach to
Systems Biology.
Section II, Advanced Computational Methods for Systems Biology serves as a comprehensive introduction to computational methods supporting systems biology research.This part introduces advanced
computational methods related to systems biology. Chapter VI describes technical developments in the
computational analyses of modern biological data: microarray gene expression data, mass spectrometry
data, and bioimaging. Chapter VII provides a perspective on three important collaborative areas in systems
biology research, macromolecular crystallization, proteomic biomarker discovery from high-throughput
mass spectral technologies and protein structure prediction and complex fold recognition.
Section III Genomics and Bioinformatics for Systems Biology provides examples of genomics
and bioinformatics applications supporting systems biology research. Chapter VIII describes methods
for sequence similarity calculation as well as detection of functional contexts by phylogenetic profiling.
Chapter IX provide insight into the application of computational tools to calculate the coupling specificity of important receptors like G-protein coupled receptors (GPCRs), which could present novel drug
targets. Chapter X describes the importance of identifying Bacterial -barrel outer membrane proteins
in the completely sequenced genomes as these proteins could serve as potential targets for drugs or
vaccines.
Section IV describes Experimental Techniques for Systems Biology Clustering methods are used
to study specific problems in genomics, such as the analysis of time-course experiments. Therefore,
xxxvii
Chapter XI is focused on model-based clusterings of tissue samples and of genes. Chapter XII propose
a novel theoretical framework for data and model reduction of gene expression profiles generated by
microarray experiments. Chapter XIII introduces the BeadArrayTM technology for gene expression profiling, shows possible approaches for data analysis and demonstrates to the reader how the technology
performs in comparison to alternative microarray platforms. The authors in Chapter XIV provide a basic
understanding of the gene expression data processing with the Affymetrix Technology. Chapter XV illustrate the use of Exon Arrays to detect alternative isoforms, and point out potential problems that may
be encountered by researchers using this technology. Chapter XVI focuses on microbial metabolism
from a systems biology perspective. Chapter XVII introduces the aspects of alternative splicing in human disease, and its investigation by means of computational large-scale analyses.
In Section V, Systems Biology and Aging, Chapter XVIII introduces an in vitro model as a means
of studying human hormonal aging. In addition, Chapter XIX provide an overview of the aging process, discuss how it relates to system biological concepts and explain how mathematical modelling can
improve our understanding of biochemical processes involved in the aging process.
Section VI, Systems Biology Applications in Medicine has been streamlined to focus on the topics most relevant to applications of systems biology in medical problems. In Chapter XXI the authors
report on their experience with analysis and modelling of data obtained from studies of animal models
related to obesity and metabolic syndrome. Chapter XXII describes modelling approaches in Type 2
Diabetes mellitus. The authors in Chapter XXIII list the high throughput biotechnologies generating a
wealth of information on the infected cell and some of the immune related databases and finally explain
how to extract meaningful information from these sources. Chapter XXIV describes the integration of
data from different cellular levels of human-pathogenic fungi and the application of systems biology
methodologies.
Section VII, Systems Biology and Drug Design, provides a thorough overview of novel methodologies in medical research. Chapter XXV focus on a novel methodology of structure-based drug
development feasible without prior knowledge of the target structure: analogy modelling. Chapter XXVI
introduce a combination of a computer based guided search of novel peptides in sequence space with
their biological experimental validation. Chapter XXVII describe metabolic modelling and applications
of metabolic flux analysis for antibiotic production.
Section VIII, Data Integration and Data Mining, confer an understanding of data integration processes in systems biology.Chapter XXVIII describe data integration and data mining techniques in the
context of systems biology studies. Chapter XXIX compare different methods and applications for reverse
engineering of gene regulatory networks developed in recent years. Data relating to the functioning of
individual genes can be drawn from many different and diverse experimental techniques. Chapter XXX
introduces the techniques that have been used to identify the genetic regulatory modules by integrating data from various sources.Chapter XXXI presents some particular biological processes modeled as
discrete networks to show that the theoretical properties of networks have a clear biological interpretation. In Chapter XXXII the authors review main methods, approaches and models for the analysis of
neuronal network data. Chapter XXXIII describes the philosophy behind the SPK tool for Population
kinetic analysis (population kinetics) for modelling and analyzing biomedical kinetic, its components
and its current implementation as a web service.
The evolution of photosynthesis is driven by selection of genes for photochemical energy conversion that is robust and, yet effective, in a fluctuating light environment. The dynamic regulation of
photosynthesis relies on an interplay of multiple sensory, transmission and executive modules that can
be studied by the tools of systems biology (Csete and Doyle 2002). Therefore, Section IX, Systems
Biology in Photochemical Processes is focused on Photochemistry and its applications, highlighting
xxxviii
the new understanding of the genetics of PDT. Chapter XXXIV introduces the theory of optical spectra
and excitation energy transfer of light harvesting complexes in photosynthesis. Chapter XXXV focus
on studies of Photodynamic Therapy (PDT) that have employed a systems biology approach. Both the
nature of the product and the mechanism involved in effecting feedback inhibition remain unknown. Thus,
the modelling of porphyrin pathway introduced in Chapter 36 may fill this void and allow researchers
to address these questions of importance.
Section X, Modeling Cellular Physiology deals with the study of cellular microstructures in biology by microscopy. Chapter XXXVII describes the application of interference microscopy and doublewavelet analysis to the non-invasive study of cell structure and function. Chapter XXXVIII deals with
the stress response of mitochondria to heat which is the central agent of thermotherapy.
Section XI, Tools for Molecular Networks, includes two chapters. Chapter XXXIX reviews the
mechanisms leading to diseases and involving protein interactions and refers to specific diseases such as
Huntingtons disease and cancer. Chapter XL introduce a network systems resource, Reactome which is
also starting to provide powerful and robust tools to investigate tissue-specific biology and steer targeted
drug design.
Section XII, Mathematical Modeling Approaches, includes mathematical modeling that allows us
to link epidemiology, physiology and physics to systems biology. Chapter XLI gives an overview of
the utility of a more rigorous treatment of thermodynamics at the molecular level to understand protein
folding and receptor-ligand binding. Chapter XLII introduces some modelling fundamentals that are
applied to the model development of the human menstrual cycle. Chapter XLIII describes mathematical model which deals with the dynamics of human infection by avian influenza both in birds and in
humans. Chapter XLIV presents a mathematical model with immunization is proposed to simulate the
succession of two epidemics with variable human populations.
Section XIII, Data Processing in Histopathology, presents automated image analysis approaches
which can serve as a valuable aide to clinical pathologists and systems biology researchers in the domain
of histopathology. Chapter XLV discusses some of the most important techniques and given examples
of their use in the area up to now.
The handbook covers basic biological and mathematical concepts important for systems biology.
The chapters are oriented to describe the relation between basic science and medical issues. Topics that
are covered in this handbook are (a) Foundations of systems biology (2) Pathophysiology of complex
diseases and systems biology approach to therapy.
The Handbook of Research on Systems Biology Applications in Medicine contains over three hundred
pages of information and more than hundred figures. Besides having the traditional text, this information source also has a glossary of terms and definitions, contributions from more than 90 international
experts, in-depth analysis of issues, concepts, new trends, and advanced technologies. This handbook
allows the inclusion of more than 100 high-quality illustrations. While providing the information that
is critical to an understanding of the basic of systems biology, this edition focuses more directly and
extensively than ever on applications of medical systems biology.
The diverse and comprehensive coverage of multiple disciplines in the field of systems biology
in this handbook will contribute to a better understanding all topics, research, and discoveries in this
evolving, significant field of study. This handbook provides information for both science and biotechnology researchers and also medical doctors in obtaining a greater understanding of the concepts, issues,
problems, trends, challenges and opportunities related to this field of study.
In shaping this book, I committed myself to making the textbook as useful as possible to students and
advanced researchers coping with the demands of modern medical research. I hope will make the Handbook of Research on Systems Biology Applications in Medicine a helpful tool-not only for the student
xxxix
who needs an expert source of basic knowledge in systems biology, but also for the advanced medical
researcher who needs clear, concise, and balanced information on which to conduct his research
Thanks to a very hard-working advisory editorial board of scientists, excellent authors who fulfilled
our invitations, and a very efficient publisher providing clear procedures and practices for a quality
production, readers may now enjoy Chapters on some of the major ideas that have concerned systems
biology applications in medicine.
Andriani Daskalaki
Max Planck Institute, Berlin, Germany
July 2008
R eferences
Csete, M.E., Doyle, J.C. (2002). Reverse Engineering of Biological Complexity. Science, 295: 16641669.
Eichelbaum M, Burk O (2001) CYP3A (2001). Genetics in drug metabolism. Nat Med, 7: 285288.
Hahn C. and Weinberg Robert A. (2002). A Subway Map of Cancer Pathways. Nature Reviews Cancer.
Klipp E. Klipp E., Herwig R., Kowald A., Wierling C., Lehrach H. (2005). Systems Biology in Practice:
Concepts, Implementation and Applications.
Nicholson J. K. (2006). Global systems biology, personalized medicine and molecular epidemiology
Mol Syst Biol, 2, 52
Meyer UA, Zanger UM (1997). Molecular mechanisms of genetic polymorphisms of drug metabolism.
Annu Rev Pharmacol Toxicol, 37, 269296
Srivastava P. (2003). Drug metabolism and individualized medicine. Curr Drug Metab, 4, 3344
xl
Acknowledgment
The editor sincerely acknowledges the help of all persons involved in the collation and review process
of this handbook, without whose support the project would not have been satisfactorily completed. Deep
appreciation and gratitude is due to Prof. Dr. Hans Lehrach, Director of the Department of Vertebrate
Genomics Max-Planck-Institut) for giving me tha opportunity to work in the field of systems biology
and generating the idea of this book.
I have received generous encouragement and assistance from all staff members of the bioinformatics group. Special thanks go to Dr. Ralf Herwig for his generous encouragement and assistance.
I wish to express my appreciation to my colleagues, who, as experts in their fields, have helped us with
constructive criticism and helpful suggestions. I acknowledge especially the contributions of the following individuals: Prof Peter Wellstead, Christoph Wierling, and Elisabeth Maschke-Schutz.
Most of the authors of chapters included in this handbook also served as referees for chapters written by other authors. Thanks go to all those who provided constructive and comprehensive reviews.
However, some of the reviewers must be mentioned as their reviews set the benchmark. Reviewers who
provided the most comprehensive, critical and constructive comments include: Dr. Athina Lazakidou
from the University of Piraeus, Dr. Sophia Kossida from the Foundation of Biomedical Research; and
Dr. Cathrin Dressler from Laser- und Medizin-Technologie GmbH, Berlin.
Special thanks also go to the publishing team at IGI Global, whose contributions throughout the
whole process from inception of the initial idea to final publication have been invaluable. In particular
to Julia Mosemann, who continuously prodded via e-mail for keeping the project on schedule and to
Mehdi Khosrow-Pour, whose enthusiasm motivated me to initially accept his invitation for taking on
this project.
I would also like to thank Dr. Sophia Kossida, who read a semi-final draft of the manuscript and provided helpful suggestions for enhancing its content. And last but not least, my father, Dimitirios Daskalaki,
for his unfailing support and encouragement during the months it took to give birth to this book.
In closing, I wish to thank all of the authors for their insights and excellent contributions to this
handbook.
Andriani Daskalaki, PhD
Max Planck Institute, Berlin, Germany
July 2008
Section I
Basic Concepts in Medical

Systems Biology
Chapter I
Pathway Biology Approach

to Medicine
Peter Ghazal
University of Edinburgh Medical School, Scotland,
and Centre for Systems Biology at Edinburgh, Scotland
abstract
Systems biology provides a new approach to studying, analyzing, and ultimately controlling biological
processes. Biological pathways represent a key sub-system level of organization that seamlessly perform
complex information processing and control tasks. The aim of pathway biology is to map and understand
the cause-effect relationships and dependencies associated with the complex interactions of biological
networks and systems. Drugs that therapeutically modulate the biological processes of disease are often
developed with limited knowledge of the underlying complexity of their specific targets. Considering the
combinatorial complexity from the outset might help identify potential causal relationships that could
lead to a better understanding of the drug-target biology as well as provide new biomarkers for modelling diagnosis and treatment response in patients. This chapter discusses the use of a pathway biology
approach to modelling biological processes and providing a new framework for experimental medicine
in the post-genomic era.
INTR OD UCTI ON
An increasing number of biological experiments and more recently clinical based studies are being
conducted using large-scale genomic, proteomic and metabolomic techniques which generate high-dimensional data sets. Such approaches require the adoption of both hypothesis and data driven strategies
in the analysis and interpretation of results. In particular, data-mining and pattern recognition methodologies have proven particularly useful in this field. The increasing amount of information available from
high-throughput experiments has initiated a move from focussed, single gene and protein investigations
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Pathway Biology Approach to Medicine
to the study of multiple component interactions. Vitally, when the output from high dimensional data
is integrated with the wealth of information from previously published investigations, the assembly of
known and novel network characteristics is possible. The cause-effect association and annotation of
multiple genes and gene products by such methods can aptly be described as pathway biology.
Pathway data is variously categorised in terms of metabolic pathways, molecular interactions, gene
regulatory networks, signalling pathways and which, are often represented differently and in isolation.
In recent years there has been an increasing effort in representing biological pathways using computer
science based methodologies. These efforts include databases that aim to curate pathways, such as KEGG
(Kanehisa et al., 2006), Reactome (Joshi-Tope et al., 2005), aMAZE (Lemer et al., 2004) or PATIKA
(Dogrusoz et al., 2005); databases of experimentally and computationally derived protein interactions,
such as MPPI (Pagel et al., 2005) and DIP (Salwinski et al., 2004); or tools that aim to extract pathway
information from the scientific literature, examples include PathwayAssist (Nikitin et al., 2003) and
Ingenuity Pathway Analysis (Ingenuity Systems). However, a common factor in all these resources is
the need to describe pathways visually in order to understand their complexity
In this regard a biological interaction network based on electronic circuitry diagrams provided an
informative approach and aims to provide a compact description of the entity relationships in a pathway (Kohn, 1999; Kohn et al., 2006). Kitano et al have extended this approach by introducing a simple
notational system for state transitions (Funahashi et al 2003) for the representation of process flow in
signalling pathways (Kitano, 2003, Kitano et al 2005). While considerable advances are being made
there will be an increasing need to ensure that such pathway descriptions remain intuitive to biologists
in general, involving logic and, in particular, an integrative view of molecular interactions, gene regulatory networks and signalling pathways while maintaining compliance with a more formal description
of biological processes. This has been the primary motivation behind the Edinburgh Process Notation
which uses and exends core aspects of the Kitano notation (Moodie et al, 2006). Collectively, these efforts have been and are part of a community wide effort to develop graphical notation standards (www.
sbgn.org). To date much of this work has focussed on experimental cell signalling based systems and
little work has been conducted regarding physiological systems and using clinical data.
Here, we discuss a practical guide to a pathway biology approach in medicine. The intended aim is
to help translate systems biology from bench science to medical research and ultimately toward clinical
use. We particularly put forward the use of logic as a strategy for studying pathways and present arguments for the suitability of logical models for the analysis of clinically derived data.
Litera t ure and da t a mining: st amp c ollecting

The first task at hand is toward the acquisition of pathway information. In this regard information relating to the components and interactions of pathways need to be compiled, integrated and visualised
using research synthesis methodology (Cooper and Hedges 1994). This generally follows a four stage
process:
i.
A literature review should be undertaken to identify relevant pathway components and interactions. This can be performed using standard Entrez PubMed queries involving keywords, author
searches and the use of Boolean operators. A variety of tools can also be used to facilitate this
process, e.g. PDQ wizard (Grimes et al 2006). A manual review of the resultant articles is an es-
sential undertaking to ensure relevance and accuracy. The literature set should be classified on
whether interactions can be attributed to a species (human or mouse), cell type (macrophage or
dendritic cell) and technique used to identify the interaction.
ii. The next step is to extend the literature search using a data mining approach. This can use a combination of open resources and proprietary software. These resources include but not limited to
KEGG (www.genome.jp/kegg/), HPRD (www.hprd.org), Chilibot (www.chilibot.net), Pathway Assist
(www.ariadnegenomics.com) and Ingenuity Pathway Analysis (Ingenuity Systems, USA), and can
be used to reinforce established molecular interactions and identify new signalling components or
interactions. In general they provide an online resource of curated, text-mined and experimental
information on protein-protein and protein-gene interactions for producing networks of molecular
interaction.
iii. A graphical representation of components of the pathway can be achieved by a variety of packages
yEd (www.yworks.com), EPE (Sorokin et al 2006), Cell Designer (Funahashi et al 2003) using
preferably the SBGN notation. It should be noted that the resulting diagrams represent a consensus
view of a pathway and should not be taken as a canonical pathway.
iv. Lastly a database model for the storage, analysis and sharing of pathway data is essential and
under certain cases will need to be linked to anonymised clinical data. Curation of pathway information into a database allows for data browsing, querying, analysing and downloading. Several
approaches to the storage and analysis of network interaction data have been published (Bader at
al 2003, Hermjakob et al 2004, Joshi-Tope et al 2005) however this area is still in its infancy and
will likely see further development in the future. Most notably, none to date have been shown to
be extensible to the integration of clinical data. In this regard it is important to note that clinical
data from subjects should be stored in separate databases that are fully compliant to confidentiality
and data protection law.
Figure 1.
Mode lling: f orma lizing dependencies

The above exercise, particularly the graphical representation, provides an excellent starting point for
capturing known knowledge around pathways. However, these diagrams remain essentially pictorial
in nature and include a varietal range of fine and course grained information. As such they are not yet
sufficiently developed to move seamlessly toward mathematical translation. Thus a key challenge to
systems level studies of pathways still remains with how best to model them and this is not an insignificant step. In general, pathway modeling requires a representation that captures the cause-effect
relationships associated with each interaction, but that this should be in the first instance conceptually
and computationally simple. With simple representation, computational efficient scanning of pathways
for the upstream interactions that control downstream behavior is both tractable and computationally
efficient.
Several modelling schemes have been proposed as suitable representations for pathways. Ordinary
and partial differential equations have proven to be very successful in modeling metabolic pathways
(Kell, 2006, Fell, 1992), although they require precise information on concentration and reaction rates,
which is largely unknown for most pathways. Accordingly approaches that deal with less or more variable data such as stochastic schemes capture very well the random nature of individual molecular events
(McAdams, 1997, Arkin, 1998), but they describe very inefficiently the certainty that accompanies large
numbers of proteins. Petri nets have also been considered and whilst the formulation is computationally
simple, it is also conceptually complex (Hofe 1998, Kuffner 2000). Stochastic petri nets [Goss, 1998]
and hybrid petri nets (Matsuno 2000) have been proposed as extensions. Pi calculus has been used to
describe the pathways as a symbolic construction to which quantitative models can be fitted (Pinney
et al 2003, Regev et al 2004). Finally, a particularly tractable approach is the use of logic to describe
pathways. In this regard the application of Boolean logic as a modelling scheme, is now more recently
receiving increasing interest (see Watterson et al 2008).
Thus logical graphical models can provide an ideal first base for abstracting a range of formal
mathematical models of a pathway. Ultimately these models will require standards for exchange such
as SMBL and version control (Hucka 2003). Formal representation of a biologically validated model
would next follow. Formal models permit the in silico testing of the what if scenarios and that these
predictions should go through a cycle of experimental verification. This cycle enables formal models
to build on an iterative process of refinement.
E xperiment ati on: sing ular pathw ay appr oach

For the reasons of focus and tractability the above pathway biology approach necessarily takes the position
of building or constructing a consensus around a single pathway. This has the advantage of defining a
sub-system making a more efficient approach for both experimental and computational investigation.
However, an assumption is that the current state of literature reflects a sufficient level of understanding about the biology. Thus, while building a graphical representation of a biological pathway from the
literature caution should be taken with regard as to whether the literature represents the true underlying
biology. Accordingly, it is important to consider how best to biologically validate or experimentally
test the constructed pathways. Since the assembled pathways consist of multiple components, gene or
protein analytic approaches would be expensive and time consuming. For this reason reliable multiplex
high-throughput technologies (such as using RNAi or microarrays) would be the method of choice. For
example, a microarray approach would be to identify active connected regions of the network that
show statistically significant changes in expression under different experimental conditions. For this
purpose the method of Ideker and colleagues (2002) can be used together with simulation studies for
rigorous statistical testing against random networks. Importantly, the results from such statistical and
simulation analyses can be informative in revealing both node and edge behaviour and should fit the
anticipated biological response. If the results of such analysis have a poor fit to the biology then this
might be indicative of a not so good fit for validation purposes.
A poor fit might suggest that other key components of the pathway are missing. In this case consideration should be given to further experimental testing. Ideally this would involve a combination of both
high throughput driven exploration and hypothesis based mechanistic studies. In the case of a good fit
it is likely that there exist many unknown or poorly characterised components of the pathway. In this
case, a range of unbiased comprehensive screens such as traditional genetic and more recent molecular
based biochemical and functional loss-of-function and gain-of-function screens. Exemplar approaches
would include but are not restricted to proteomic based studies such as co-complex purification coupled
with mass spectroscopy, yeast two hybrid protein interaction screens, genome-wide RNAi assays and
forward genetic approaches. These approaches have been extensively covered in a number of excellent
reviews on systems biology. Ultimately these assays should lead on to hypothesis testing of mechanism
of action of pathway components.
Mechanism based evaluation of pathways can be readily extrapolated to the analysis of clinical
samples. For these studies it is important that care is taken in defining precisely the clinical phenotype
of the subjects to be studied and the use of robust clinical and laboratory standard operating procedures
are essential.
Cl inica l da t a: tract
abi lity
A key constraint in clinical research is in obtaining sufficient number of subjects, especially the control groups and, the amount and availability of sample material outside the medical treatment path of
the patients. Thus full consideration including ethical issues has to be given on what and how many
samples are available for a particular study. Most importantly it is essential that the clinical phenotype
is clearly defined and minimises potential heterogeneity in the patient population. There can also be
obvious limitations with regard to obtaining invasive biopsy specimens and the preferred option for
both recruitment purposes and the patients would be a non-invasive sampling. A relatively benign noninvasive specimen that can be highly informative is whole blood. Here relatively but readily accessible
small samples can be taken and used for systemic measurements of disease condition and treatment
response (e.g. Smith et al 2007). For these reasons whole blood sampling is an area receiving increasing
interest and is amenable to transcriptomic, proteomic and metabolic screening.
In order to ensure the level of security and confidentiality expected for the storage of this sensitive
data, the following steps are recommended. The data should be kept on a server protected by security
measures at a physical, network and application level. The data should be adequately backed up and
archived. Only authorized collaborators should be allowed use of the system and will be subjected to
effective authentication as well as auditing. Approval from the local Clinical Data Protection Officer
and senior medical officer (following Caldicott Guardian Principles) should be sought to comply with

the confidentiality and data protection requirements. Ultimately for the purpose of integrating pathway,
clinical records and high throughput data a medical bioinformatic (MBI) system will need to be established, requiring application tools to migrate anonymised data from the clinical databases to pathway
database.
Co ncept ua l cha llenge of multip le pathw ay dependencies

medicine
in
As a consequence of measuring patient samples using systems biology approach of high through-put
technologies multiple cellular and molecular pathways are simultaneously recorded. This raises a major
scientific limitation and sets the challenge of how to deal with analysis and understanding of the underlying cross-talk and dependencies in the myriad of potential interrelated networks and pathways.
While a clear way forward is yet to be defined we would like to propose a conceptual framework
aimed at classifying multiple pathway interrelationships and dependencies. In this regard we consider
three types of pathway interdependencies that might help in the future toward a functional classification
of clinical data. The types of pathway interrelationships can be labelled as process, sharing and fit:

In the process type an output or activity of one pathway produces a resource or input used by
another pathway. For instance, the sterol biosynthesis pathway leads to the production of certain
oxy-sterols that form ligands for the activation of the Liver X Receptor (LXR) pathway.
In the case of sharing, a single resource or pathway is used as inputs by a range of different multiple activities or pathways. For example the Janus Kinase-Signal Transducers and Activators of
Transcription (JAK-STAT) signalling pathway is activated by more than 50 cytokines or growth
factors.
Whereas in the case of a pathway fit, different multiple pathway activities produce a single or
common resource. For example a wide range of multiple cytokine, chemokine and pathogen recognition signalling pathways lead to the activation of Nuclear Factor Kappa B (NFKB), a central
transcription factor for the regulation of both innate and adaptive immunity.
All of these types of interrelated activities can be coordinated in a hierarchical, non-hierarchical or

emerging way and are likely to involve cells, genes, proteins and/or metabolites.
Co nc luding R emarks
We hope this report provides a useful starters guide to the application of pathway biology and how it
can be extended to clinical research. Ultimately, we anticipate that a comprehensive understanding of
pathway structures will also allow us to predict potential disease indications or therapeutic complications or side effects that a treatment could incur. Accordingly pathway biology has future potential to
contribute toward a new foundation for the next generation of medicine and clinical practice. Here we
envision the use of clinical data and current best evidence coupled with pathway knowledge to make
decisions about the care of individual patients.
A ckn ow ledgment
I would like to thank all my colleagues in the Division of Pathway Medicine and our sponsors including
the Wellcome Trust, EU FP6 programme, BBSRC and MRC.
R eferences
Akutsu, T., Miyano, S., & Kuhara S. (1999). Identification of genetic networks from a small number of
gene expression patterns under the Boolean network model. Pac Symp Biocomp, 4, 17.
Akutsu, T., Miyano, S., & Kuhara, S. (2000). Inferring qualitative relations in genetic networks and
metabolic pathways. Bioinformatics, 16, 727.
Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error and attack tolerance of complex networks. Nature,
406, 378.
Albert, R. & Othmer, H. (2003). The topology of the regulatory interactions predicts the expression
pattern of the Drosphilia segment polarity genes. Journal of Theoretical Biology, 223, 1.
Arkina, A., Rossb, J. & McAdams, H.H. (1997). Proc Natl Acad Sci USA, 94, 814
Arkin A., Ross, J., & McAdams H.H. (1998). Stochastic kinetic analysis of developmental pathway
bifurcation in phage lambda-infected escherichia coli cells. Genetics, 149, 1633.
Calvano, S.E., Xiao, W., Richards, D.R., Felciano, R.M., Baker, H.V., Cho, R.J., Chen, R.O., Brownstein,
B.H., Cobb, J.P., Tschoeke, S.K., Miller-Graziano, C., Moldawer, L.L., Mindrinos, M.N., Davis, R.W.,
Tompkins, R.G., Lowry, S.F. A network-based analysis of systemic inflammation in humans. Nature,
437, 1032.
Cooper, H., & Hedges, L.V. (1994). The handbook of research synthesis. New York: Russell Sage
Foundation.
Dogrusoz, U., Erson, E.Z., Giral, E., Demir, E., Babur, O., Cetintas, A., & Colak, R. (2005). PATIKAWeb: A Web interface for analyzing biological pathways through advanced querying and visualization. Bioinformatics, bti776.
Fell, D. (1992). Metabolic control analysis: a survey of its theoretical and experimental development,
Biochem J, 286, 313.
Kauffman, S. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. Journal
of theoretical biology, 22, 437.
Kaufman, M., Andris, F., & Leo, O. (1999). A logical analysis of T cell activation and anergy. Proc Nat
Acad Sci USA, 96, 3894.
Kuffner, R., Zimmer, R., & Lengauer, T. (2000). Pathway analysis in metabolic databases via differential
metabolic display. Bioinformatics, 16, 825.
Funahashi, A., Tanimura, N., Morohashi, M., & Kitano, H. (2003). Cell designer: A process diagram
editor for gene-regulatory and biochemical networks. BIOSILICO, 1, 159.
Glass, K., & Kauffman, S. (1973). The logical analysis of continuous, non-linear biochemical control
networks. Journal of Theoretical Biology, 39, 103.
Goss, P., & Peccoud, J. (1998). Quantitative modeling of stochastic systems in molecular biology by
using stochastic Petri Nets. Proc Natl Acad Sci USA, 95, 6750
Grimes, G.R., Wen, T.Q., Mewissen, M., Baxter, R.M., Moodie, S., Beattie, J.S., & Ghazal, P. (2006).
PDQ Wizard: Automated prioritization and characterization of gene and protein lists using biomedical
literature. Bioinformatics, 22, 2055-2057.
Hofestadt, R. a& Thelen, S. (1998). Quantitative modeling of biochemical networks. Silico Biology,
1(1), 39.
Huang, S., (1999). Gene expression profiling, genetic networks and cellular states: an integrating concept
for tumorigenisis and drug discovery. Journal of Molecular Medicine, 77, 469.
Hucka, M. (2003). The systems biology markup language (SBML): A medium for representation and
exchange of biochemical network models. Bioinformatics, 19, 524.
Husmeier, D. (2003). Reverse engineering of genetic networks with Bayesian networks. Biochemical
Society Transactions, 31(6), 1516.
Husmeier, D. (2003). Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics, 19, 2271.
Ideker, T., Ozier, O., Schwikowski, B., & Siegel, A.F. (2002). Discovering regulatory and signalling
circuits in molecular interaction networks. Bioinformatic, 18, S233-240.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., & Barabasi, A.L. (2000). The large-scale organization
of metabolic networks. Nature, 407, 651-4.
Joshi-Tope, G., Gillespie, M., Vastrik, I., DEustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath,
G.R., Wu, G.R., Matthews, L., Lewis, S., Birney, E., & Stein, L. (2005). Reactome: A knowledgebase
of biological pathways. Nucl. Acids Res., 33, D428-432.
Liang, S., Fuhrman, S., & Somogyi, R. (1998). REVEAL: A general reverse engineering algorithm for
inference of genetic network architectures. Pac Symp Biocomput, 3, 18.
Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki,
M., & Hirakawa, M. (2006). From genomics to chemical genomics: New developments in KEGG. Nucl.
Acids Res., 34, D354-357.
Kell, D,. (2006). Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discovery Today, 11, 1085.
Kirkpatrick, S., Gelatt, C.D., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science,
220, 671-680.
Kitano, H. (2003). A graphical notation for biochemical networks. Biosilico, 1, 169.

Kitano, H., Funahashi, A., Matsuoka, Y., Oda, K. (2005). Using process diagrams for the graphical
representation of biological networks. Nat Biotechnol, 23, 961-966.
Kohn, K.W. (1999) Molecular interaction map of the mammalian cell cycle control and DNA repair
systems. Mol. Biol. Cell, 10, 2703-2734.
Kohn, K.W., Aladjem, M.I., Weinstein, J.N., & Pommier, Y. (2006) Molecular interaction maps of bioregulatory networks: A general rubric for systems biology. Mol. Biol. Cell, 17, 1-13.
Laubenbacher, R. & Sigler, B. (2004). A computational algebra approach to the reverse engineering of
gene regulatory networks. Journal of Theoretical Biology, 229, 523.
Lemer, C., Antezana, E., Couche, F., Fays, F., Santolaria, X., Janky, R.S., Deville, Y., Richelle, J., &
Wodak, S.J. (2004). The aMAZE LightBench: A Web interface to a relational database of cellular processes. Nucl. Acids Res., 32, D443-448.
Matsuno, H., Doi, A., Nagasaki, M., & Miyano, S. (2000). Hybrid Petri Net representation of gene
regulatory network. Pacific Symposium on Biocomputing, 5, 341. Singapore: World Scientific Press.
Mendoza, L., Thieffry, D., & Alvarez-Buylla, E. (1999). Genetic control of flower morphogenesis in
Arabidopsis thaliana: a logical analysis. Bioinformatics, 15, 593.
Moodie, S., Sorokin, A., Goryanin, I., & Ghazal, P. (2006). A graphical notion to describe the logical
interactions of biological pathways. Journal of Integrative Bioinformatics, 3.
Nikitin, A., Egorov, S., Daraselia, N., & Mazo, I. (2003). Pathway studio--The analysis and navigation
of molecular networks. Bioinformatics, 19, 2155-2157.
Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach, I., Frishman, G., Montrone, C.,
Mark, P., Stumpflen, V., Mewes, H.-W., Ruepp, A., & Frishman, D. (2005). The MIPS mammalian
protein-protein interaction database. Bioinformatics, 21, 832-834.
Pal. R., Ivanov, I., Datta, A., Bittner, M., & Dougherty, E. (2005). Generating Boolean networks with a
prescribed attractor structure. Bioinformatics, 21, 4021.
Pinney, J., Westhead, D., & McConkey, G. (2003). Petri Net representations in systems biology. Biochem
Soc Trans, 31, 1513.
Regev, A., & Shapiro, E. (2004). The pi-calculus as an abstraction for biomolecular systems. Modelling
in Molecular Biology, Springer.
Rual, J.-F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons,
F.D., Dreze, M., Ayivi-Guedehoussou, N., Klitgord, N., Simon, C., Boxem, M., Milstein, S., Rosenberg,
J., Goldberg, D.S., Zhang, L.V., Wong, S.L., Franklin, G., Li, S., Albala, J.S., Lim, J., Fraughton, C.,
Llamosas, E., Cevik, S., Bex, C., Lamesch, P., Sikorski, R.S., Vandenhaute, J., Zoghbi, H.Y., Smolyar, A.,
Bosak, S., Sequerra, R., Doucette-Stamm, L., Cusick, M.E., Hill, D.E., Roth, F.P., & Vidal, M. (2005).
Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437, 1173.
Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., & Eisenberg, D. (2004). The database
of interacting proteins: 2004 update. Nucl. Acids Res., 32, D449-451.
[SBGN, 2007] Systems Biology Graphical Notation Level 1 Specification, www.sbgn.org
Shmulevich I., Dougherty, E., Kim, S., & Zhang, W. (2002). Probabilistic Boolean networks: A rulebased uncertainty model for gene regulatory networks. Bioinformatics, 18, 261.
Shmulevich, I., Dougherty, E., & Zhang, W. (2002). From Boolean to probabilistic Boolean networks
as models of genetic regulatory networks. Proc IEEE, 90(11), 1778.
Smith, C.L., Dickingson, P., Forster, T., Khondoker, M., Craigon, M., Ross, A.J., Storm, P., Burgess, S.,
Lacaze, P., Stenson, B.J., & Ghazal, P. (2007). Quantitative assessment of human whole blood RNA as
a potential biomarker for infectious disease. Analyst, 132, 1200-1209.
Soon-Hyung Yook, Z.N.O.A.-L.B. (2004). Functional and topological characterization of protein interaction networks. PROTEOMICS, 4, 928-942.
Sorokin, A., Paliy, K., Selkov, A., Demin, O., Dronov, S., Ghazal, P., & Goryanin, I. (2006). The pathway
editor: A tool for managing complex biological networks. IBM J. Res. Dev. 50, 561-573.
Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F.H., Goehler, H., Stroedicke, M., Zenkner,
M., Schoenherr, A., & Koeppen, S. (2005). A human protein-protein interaction network: A resource
for annotating the proteome. Cell, 122, 957.
Watts, D.J., & Strogatz, S.H. (1998). Collective dynamics of `small-world networks. Nature, 393,
440.
Watterson, S., Marshall, S., & Ghazal, P. (2008). Logic models of pathway biology. Drug Discovery
Today, 13, 447-456.
Uetz, P., Dong, Y-A., Zeretzke, C., Atzler, C., Baiker, A., Berger, B., Rajagopala, S.V., Roupelieva, M.,
Rose, D., Fossum, E., & Haas, J. (2006). Herpesviral protein networks and their interaction with the
human proteome. Science, 311, 239-242.
Werhli, A. & Husmeier, D. (2007). Reconstructing gene regulatory networks with Bayesian networks by
combining expression data with multiple sources of prior knowledge. Statistical Application in Genetics
and Molecular Biology, 6, 15
10
11
Chapter II
Systems and Control Theory for

Medical Systems Biology
Peter Wellstead
The Hamilton Institute, National University of Maynooth, Ireland
Sree Sreenath
Case Systems Biology Initiative, Case Western Reserve University, USA
Kwang-Hyun Cho
Korea Advanced Institute of Science and Technology (KAIST), Korea
Olaf Wolkenhauer
University of Rostock, Germany
abstract
In this chapter the authors describe systems and control theory concepts for systems biology and the
corresponding implications for medicine. The context for a systems approach to the life sciences is
outlined, followed by a brief history of systems and control theory. The technical aspects of systems and
control theory are then described in a way oriented toward their biological and medical application.
This description is then used as a reference base against which to indicate specific areas where systems
and control theory aspects of systems biology have strong medical implications. Specifically, two systems
biology projects are described as examples of where methods from systems and control theory play an
important role.
INTR OD UCTI ON
In this chapter the authors give their experiences gained working at the interface between the biological/medical sciences and the physical/engineering systems sciences. In doing so we attempt to convey
Systems and Control Theory for Medical Systems Biology
the contributions that the physical, mathematical and engineering sciences have made, and will continue
to make, to innovations in biology and medicine. In this context we stress the role played by systems
and control theory in the development of general principles for biological systems, and in particular
the understanding of dynamical phenomena in biology and medicine. According to our experiences,
systems methods are influencing the biology research sector through a series of evolutionary scientific
steps, as follows:

Stage 1: High-throughput biochemical instrumentation was (and continues to be) developed to

provide rapid measurement and generation of data.
Stage 2: To meet the need to process data generated in stage 1, data processing methods are being
developed to extract information from very large data records.
Stage 3: The information from stage 2 is used to calibrate mathematical models with which to
visualise an underlying biological process. This is the current evolutionary state in systems biology.
Stage 4: Control and systems theory are applied to the mathematical models of stage 3 to provide
understanding of biological behaviour and underlying principles.
In summary, the sequence goes from:
measurement data information visualisation understanding.

The current state of the art is that the value of in-silico simulation of biological phenomena is becoming appreciated. Even so, most biological measurement techniques are designed to collect static data,
whereas time course data is required to develop mathematical models for visualising system dynamics
by in-silico simulation. It is not always appreciated that, as a result of poor data, the calibration and
structural correctness of mathematical models is often suspect. Likewise, there is currently little appreciation of the fundamental importance of control and systems theory in understanding biological
and physiological phenomena and principles.
On the other hand, the role of systems and control theory is clearly established in the medical community through the understanding that it gives to physiological function. Under the historical influence
of Claude Bernards ideas, as embodied in Cannons concept of homeostasis (Bayliss, 1966, Cannon,
1932), feedback control is central to many aspects of current medical understanding, although this is
usually intuitive and non-theoretical in nature (Tortora, 2003). Since Cannons work in the 1930s, other
researchers have expanded upon the homeostatic feedback principle (Sterling, 2004) in its specific
medical and physiological contexts. In the meantime however, systems and control theory has expanded
scientifically and progressed to become a mature scientific discipline with fundamental relevance to
all areas of scientific endeavour. Throughout this 70-year period of separate development, the medical
concepts of control systems and the mathematical tools of control and systems theory have diverged. The
aim of this chapter is to reconnect the medical ideas of feedback with mainstream theory by explaining
areas where control and systems theory can contribute. We consider this to be vitally important to our
scientific futures. For, as indicated above and documented in the recent report Systems Biology: a vision
for engineering and medicine (Royal Academy, 2007), the use of systems theory and control concepts
will be essential to our understanding of biological systems for medicine.
12
BRIEF
HIST ORY OF C ONTR OL AND SYSTEMS
THE ORY
Control and systems theory have their origins in the 1700s with practical devices designed to regulate
speed in wind, water and steam energy sources. In the hurly burly of the Industrial Revolution, design
changes to improve the performance of feedback regulators rapidly outstripped their designers ability to
predict their dynamical behaviour. However, as the new invention spread to astronomical instruments,
scientists such as Airey and Maxwell became interested. In this context, the appearance of Maxwells
paper On Governors (Maxwell 1867) was a pivotal point in control and systems theory. It gave a mathematical basis for understanding the stability of physical systems and made the mathematical analysis
of engineering systems respectable. This process of technical and theoretical development expanded
rapidly with industrial growth in the 1800s. This era is ably described in the historical account (Bennett 1979).
The need for analysis and design methods for electronic amplifiers, just as mechanical regulators
before, led to the next crucial development in control and systems theory the systematic stability analysis and design in the frequency domain. The 1930s framework set by Nyquist (1932), Bode (1945) and
Shannon (1948), informed theoretical developments for twenty five years until the needs of aerospace
led to an alternative time-domain framework for the theory of signals and systems (Bennett, 1993). The
1950s saw the beginning of a golden age of theoretical and technical advances in control and systems
theory. An age during which research became an international endeavour, with theoretical developments
emerged rapidly and from numerous independent sources. It resulted in what we have now: control and
systems theory as a mature discipline covering: analysis and design of feedback systems, optimal control
theory, multivariable systems theory, modelling, systems identification and much more. The following
paragraphs will outline these as they relate to systems biology for medicine.
OUT LINE OF C ONTR OL AND SYSTEMS
THE ORY METH ODS
Control and systems theory, together with methods from communication theory, offers a unified structure within which to mathematically represent, analyse, understand and potentially modify the dynamical behaviour of systems and signals. The principle of homeostasis means that control is accepted in
physiology and medicine (Tortora, 2003), also there are books upon basic control and systems theory in
medicine (Riggs, 1976). However in the years subsequent to Cannons work control and systems theory
has developed into a complete theoretical and analytical framework for understanding the behaviour of
dynamical systems. In the following we outline the components of this framework.
Mathematical Models for C ontrol and S ystems A nalysis

The primary purpose of mathematical models is to allow the dynamical behaviour of systems to be
analysed for design, performance prediction and control (Crandall, 1968). For linear systems a classical
representation of system dynamics is the transfer function model. Such models, developed from the
analysis of frequency response systems, have theoretical foundations in Laplace transform methods
and complex variables (Smith, 1966). Transfer function methods are not generally used in systems biology applications because they are designed for linear systems. The nonlinear dynamics encountered
13
in biochemical reactions is catered for in the alternative form of control and systems model - the state
space model (Friedland, 2005).
The state space representation is based upon sets of coupled first-order differential equations such
as occur naturally in mathematical models of metabolic pathways and cell signalling. Moreover, the
state space form accommodates nonlinear dynamical features such as those that appear routinely in
biology/physiology. The control and analysis of state space systems is however only fully developed
for the case of linear systems. A range of theoretical methods for stability and performance analysis of
nonlinear state space equations exists (Freeman and Kokotovic, 1996) and there are methods especially
tuned for state space structures found in the life sciences.
In the following, and unless stated, we will assume that a state space model form is required or
used. (Note that hybrid models that combine continuous dynamics (e.g. reaction kinetics) and logical
processes (e.g. in gene regulation), and/or pure time delays (e.g. transport processes) are more complex
to analyse.).
Modelling Methods and C omputer B ased S imulation

The methods for deriving and developing mathematical models of dynamical systems have been thoroughly developed by the control and systems community, (Wellstead, 1979, MacFarlane, 1970). These
physical systems methods, suitably augmented by biochemical reaction methods (Cornish-Bowden,
2004), provide a conceptual basis and practical framework for constructing the differential equation/state
space models needed to describe dynamical phenomena in cell biology and physiology. The rapid and
easy visualisation of the dynamical behaviour of these models is a basic tool in control and systems
analysis and there is a wide range of numerical implementations of dynamical systems simulators for
digital computers. A de facto standard for such simulation (and originally developed by control and
dynamics experts) is MATLAB (Higham and Higham, 2005). This software has numerous accessories
for control and systems analysis and allows the development of specialised toolboxes. One such toolbox
is the Systems Biology Toolbox, which is specifically for state space models of signalling and metabolic
pathways (Schmidt and Jirstrand, 2006).
The ability to easily visualise the behaviour of a complex dynamical system from its mathematical
model has had a particular impact in medicine and physiology (Hunter and Borg, 2003). The multiscale
methods needed by such areas however require more general modelling software to allow spatial as well
as temporal information to model organ and tissue function (Hunter et. al. 2006). In the same spirit, the
simulation of metabolic systems and in particular in-silico pharmacokinetics and pharmacodynamics
have profited from dynamical modelling methods founded in systems and control theory.
System Identi.cation and Data Analysis

Writing down the equations of what describes a metabolic process, signalling pathway, gene regulation
system or physiological process is only one aspect of mathematical modelling in systems biology. The
more demanding aspects are determining (a) whether the model structure is valid (structure identification)
or (b) determining the numerical value of the various model parameters (parameter estimation). These
two tasks are part of the area known in control and systems theory as system identification. A good view
of basic ideas in systems identification is the classic engineering text (Eykhoff, 1974), while statistical
time series approaches are covered in an accessible manner elsewhere (Box and Jenkins, 1970).
14
System identification is key to creating good models in systems biology for medicine, since it is commonplace for coefficients of models to be unknown, and in many cases the structure of the pathways is
uncertain. Indeed some of the most biologically interesting contributions in systems biology have been
where system identification has shown that signalling pathways based on biological deduction were
wrong (Swameye, et. al. 2003). Likewise, the use of parameters from the literature is a poor compromise
when they can be directly estimated for the experimental situation in hand.
Identi.ability
The degree to which a system can be determined from measured data is termed identifiability (Wellstead, 1975) and it is an important concept. Specifically, in some cases it is impossible to uniquely
identify a system uniquely from the available measured data. There are two aspects to this. The first
relates to the structure of the system, whereby for certain interrelations in the system it is impossible
to unambiguously determine particular parameters or distinguish causality. Feedback structures of the
kind found in medical physiology are one such form that gives identifiability problems in distinguishing between forward signal transmission and feedback transmission paths. The further aspect concerns
whether the system modes are sufficiently excited to allow identification from the measured data. This
is of particular relevance to medical systems biology where data is often unsuitable for identification
because of limitations on what input perturbations are possible. The topic of experiment design is
relevant here (Zarrop, 1979).
Random Processes in Control and Systems Theory

Of equal importance to the application of systems identification itself is the thorough analysis and validation of the time course data associated with experimental procedures. All measured data is subject
to error and the correct treatment of such data from dynamical processes is the aim of random data
analysis. A comprehensive and practical reference to these methods is (Bendat and Piersol, 2000). The
key issue here is that all measurement processes are subject to distortion (both systematic and random).
Likewise the underlying processes that drive the system may themselves be stochastic in nature and
require characterisation with the tools of probabilistic data analysis (Papolis and Pillai, 2002).
C ontrol S ystem B asics

Feedback Structures
Control and systems theory gives us a deep understanding of feedback in its practical and theoretical
aspects. From the application of feedback in engineering systems and machines, we have developed a
complete theory of feedback in linear dynamical systems. This has been mainly applied to designing
and building devices that depend upon closed loop feedback control for their performance. In systems
biology, the knowledge won in technological development helps us to understand the role of feedback
loops in biological processes. In particular, experience of design of technological control systems allows
analogies to be found in living systems for such principles as design for regulation against disturbances
(c.f. homeostasis), set-point tracking and feedforward compensation. The better understanding of
biological, metabolic and physiological function that this affords allows us to predict consequences of
15
interventions that disturb physiological and biological loops. This is particularly true where the system
has complex crosstalk between interacting channels in which multi-input, multi-output control system
analysis (Skogestad and Postletwaite, 1996) can predict the unusual responses that can occur under
feedback conditions.
Stability and Transient Response

The analysis of time course (transient) behaviour is basic to control and systems theory, and its transfer
to the understanding of biological dynamics is of paramount value. There is a wide range of theoretical
tools in this area, but the most immediately relevant are drawn from state space analysis. They offer
insights into the possible convergence points (e.g. steady state) of systems, stability properties and
transient performance. Stability analysis of nonlinear systems is particular relevant to systems biology
(Slotine, 1991), since all but the most basic of mathematical models of biological systems are nonlinear
with potentially complex dynamical behaviour. Thus stability methods based upon Lyapunov analysis
(Bacciotti and Rosier, 2005) and the special structures that occur in life science systems provide insights
into behaviour (Angeli and Sontag, 2003).
S ystem Properties
In understanding how to control technological systems, control theorists have formalised a number of
mathematical properties that are important to the understanding of systems in general. Key among these
properties for systems biology in medicine are observability, controllability, sensitivity and robustness.
As follows:

16
Observability. In broad terms, this property relates to the ability to determine states of a system
from measurements at the outputs. In systems biology for medicine, the states are concentrations
of chemicals in a metabolic, signalling or physiological network. Concentrations that cannot be
measured from the available data are said to be unobservable.
Controllability. Similarly in broad terms, controllability is the ability of a control mechanism to
manipulate all the states of a system from the inputs. This has implications in therapies that aim to
adjust biochemical levels by external controls an uncontrollable state cannot be adjusted in this
way. Likewise, in certain signalling pathways it may not be possible from the inputs (e.g. receptor
channels) to modify certain chemical concentrations they are then said to be uncontrollable.
Sensitivity. This property is important in understanding the constraints on performance of a
dynamical systems. It relates to the sensitivity of the overall dynamical system to variations in
different parts of the system. Thus changes in certain parameters (e.g. kinetic coefficients in a
pathway model) may have a big influence on the observable system performance implying a
large sensitivity. Others may have only a small impact implying insensitivity.
Robustness. This property is related to sensitivity, in the sense that one purpose of negative feedback is to make a system robust (that is insensitive) to variations in certain parameters or variables.
For example, one purpose of homeostatic loops is to make the metabolism robust against external
variations a side result is that it will be sensitive to others as part of a robustness/sensitivity trade
off (Dorato, 1998). Robustness is recognised as an important biological principle, albeit with a
slightly less formal definition, (Kitano, 2007).
T ypes of C ontrol S ystems

Related to the properties exhibited by a control system is the purpose that it is intended to serve and
the objective or procedure used in its design. Typically, control systems may have the objective that a
system variable might be required to follow some external variation within certain limits of accuracy
and speed. This would be what is termed classical servo/reference tracking control. Other control systems are set up to maximise (or minimise) some objective function in some optimal way. This is termed
optimal control (Bryson and Ho, 1969). When random disturbances predominate then the controller
design objective is focussed upon the disturbances. This is termed stochastic control and is treated as
branch of optimal control but more importantly uses Kalman filtering (Kalman and Bucy, 1961). The
Kalman/Bucy framework is of importance to systems biology since it provides a state space framework
for understanding and analysing random processes. A further controller type that is relevant to systems
biology is coordinating control, where the control mechanism must combine the dual function of regulating
local behaviour with coordination, (Mesarovic et al., 2004). Beyond this there are many more specialist
control systems that are particular to systems forms and application. The book (Goodwin, Graebe and
Salgado, 2001) is a modern text that covers almost all of control design methods and types.
EXAMP LES
In this section we give two examples of systems and control theory as they occur in medical systems
biology research. The examples are both concerned with the investigation of special control structures
within a system, and where the existence of the structure is medically informative. In Example 1, we
examine interactions between signalling pathways and the implications for positive feedback to trigger
critical state transitions. In Example 2 we examine organisational structures in a complex biological process that may have simplifying implications for investigations of biomarkers and molecular targets.
E xample 1: T he D ynamical R ole of C rosstalk B etween W nt and ERK Pathways

in T umorigenesis
In this example, we consider the use of a mathematical state space model of two cell signalling pathways (Wnt and ERK) to investigate their interaction and its implication for cancer studies. Specifically,
this is an example of how interaction (or crosstalk) between systems can be studied using systems and
control theory tools specifically state space modelling, simulation and structural analysis to identify
hidden feedback loops. The Wnt pathway conveys a signal from Wnt to -catenin such that the catenin level increases by inhibiting GSK-3, which normally induces the ubiquitination of -catenin.
The increased -catenin translocates into the nucleus and induces the expression of various oncogenes
by forming a complex with TCF (the abnormal increase of -catenin commonly occurs for colorectal
cancers) (Behrens J. 2005). On the other hand, ERK pathway conveys a signal from growth factors such
as EGF, PDGF to ERK through the Raf-1MEKERK cascade. The finally activated ERK (ERKpp)
also induces the expression of various proliferation genes (ERK mutants are commonly observed for
about 30% of all human cancers).
17
Figure 1. (A) Crosstalks in the Wnt/ERK pathways and the hidden positive feedback loop formed by
these crosstalks. (B) The phase diagrams for a normal status with stimulations over 100min and 500 min
durations, respectively. (C) The phase diagrams for an abnormal status with stimulations over 100min
and 500 min durations, respectively
W nt
(B)
ERKpp(nM)
(A)
R af -1
TCF
-catenin
(C)
ERKpp(nM)
200
150
100
0
8.5
MEK
ERK
250
50
G S K -3
-catenin
t = 100
t = 500
300
9.5
10
10.5
11
11.5
12
12.5
13
13.5
300
250
200
150
100
50
0
10
10.5
11
11.5
12
12.5
13
13.5
14
14.5
The Wnt and ERK signalling pathways are usually considered independent, but there are reports
of crosstalk between them. These include the direct activation of ERK pathway by Wnt, the activation
of Raf-1 through unknown molecule X which is induced by -catenin /TCF complex (Yun, MS et. al.
2005, Rottinger, E. 2004), and the inhibition of GSK-3 by ERKpp (Almeida M, 2005, Ding Q, 2005).
If these crosstalks are taken together, then a positive feedback loop is revealed embedded in the Wnt/
ERK pathways as illustrated in Figure 1(A). The systems biology question then arises about the role
of this hidden positive feedback loop (Y-K. Kwon, K-H. Cho. 2007, D. Kim. Y-K Kwan, et. al. 2007).
In a normal status, the signalling molecules become activated by external stimulations to respond to
environmental changes and then return to their original inactivated states as the stimulation ceases
(Figure 1(B)). However, if there are some mutations and the hidden positive feedback gets enhanced by
such mutations (abnormal status), then the system can sustain the activated states even after the external
stimulation disappeared (Figure 1(C)). In other words, the hidden positive feedback loop in the Wnt/ERK
pathways can induce an irreversible state change that leads to an oncogenic status.
The state space mathematical model used to produce Figure 1(B) and (C) is shown in Box 1 (D.
Kim, O. Rath, et. al. 2007).
The symbols Raf, Wnt, MEK, ERK, catenin, TCF, etc are system states corresponding to the concentrations of the corresponding proteins in the signalling pathways and the mathematical model was
implemented and simulated in MATLAB.
E xample 2: Using a C omplex S ystems B iology A pproach to Understand

C ellular S ignaling B ehavior in A cute Myelogenous Leukemia (AM L)
In this example we illustrate a systems framework developed to search for the coordinating control
principles mentioned previously (Mesarovic et al., 2004). In a multilevel, hierarchical system, the task of
a coordinator in the upper level is to harmonize the lower-level subsystems by influencing their functions
18
Box 1.
d [ Raf 1]
= Wnt + [ X ] [ Raf 1]
dt
d [ MEK ]
= [ Raf 1] [ MEK ]
dt
d [ ERK ]
= [ MEK ] [ ERK ]
dt
d [GSK 3 ]
= Const Wnt [ ERK ] [GSK 3 ]
dt
d [ catenin]
= Const [GSK ] [ catenin]
dt
d [ catenin / TCF ]
= [ catenin] [ catenin / TCF ]
dt
d[ X ]
= [ catenin / TCF ] [ X ]
dt
such that the overall system goal is advanced or attained (Sreenath et al., 2007). We demonstrate here
a systematic approach to identify a coordinator in a signalling pathway. The identification of a coordinator in a signalling pathway helps in narrowing down molecular targets for further biological study
or as biomarkers (for diagnosis). This example examines a conserved pathway, Janus Kinase Signal
Transducer and Activator of Transcription (JAK-STAT), that has been implicated in Acute Myelogenous
Leukemia (AML) (Yu & Jove, 2004). The pathway is induced by a small protein, Interleukin-3 (IL3),
that effects cell growth and differentiation (Rane & Reddy, 2002). In a healthy cell, IL3 causes transient
activation of STAT5 isoform (i.e. STAT5 phosphorylation), whereas in AML, STAT5 is constitutively
(continuously) active (Yu & Jove, 2004).
Our beginning point is a mathematical model (including numerical values of nominal parameters)
described in (Yamada et al., 2003) that has a different receptor complex, but the same downstream
mechanism. To better describe the IL3-induced JAK-STAT5 mechanism, the receptor complex was
modified to be compatible with the IL3 ligand. The biochemical reactions were transformed into
nonlinear differential equations (Sreenath et al., 2007), resulting in a state space model with 49 states,
118 parameters, 2 outputs and IL3 as an input. We estimated 3 receptor complex parameters using
semi-quantitative data (Chen et al., 2004) that expresses the relative amount of biochemical in terms of
intensity. The biochemical reactions were further modularised using a hybrid method, and represented
in the block diagram (Figure 2).
A series of in silico experiments were performed to identify a subsystem with the coordinator characteristics (Sreenath et al., 2007). Assuming that each hierarchical system has an overall system objective, a coordinator (at a higher level) should display the following properties: (i) lower level subsystems
are functionally independent; (ii) the coordinator can change the lower level subsystems functioning;
and (iii) the coordinator can change the lower level subsystem behaviour such that the overall system
objective is satisfied.
19
Figure 2. Block diagram of the modularized JAK-STAT5 system
Figure 3. In silico simulation results. (A) nominal behavior. (B,C) describes a knockdown (elimination)
of the indicated biochemical (SHP2 and SOCS respectively)
Results show that eliminating the negative regulators SHP2 or SOCS causes a behaviour category different from the nominal behaviour (Figure 3). This implies that SHP2 or SOCS modules are
candidate coordinators since they are capable of changing the system behaviour. If SHP2 subsystem
is a coordinator, the first coordination condition is not satisfied because of the dependency of SOCS
on STAT subsystem. To test if SOCS subsystem is indeed a coordinator, a pathological condition was
simulated and the parameters varied within the SOCS subsystem until the system was within its nominal
behaviour category (Figure 4). This confirmed the existence of parameters in the SOCS subsystem that
promote the overall pathway behaviour to return to normal physiological conditions (Figure 5). Thus, by
20
Figure 4. In silico simulation results with the parameters of example 2 varied within the SOCS subsystem
until the system was within its nominal behaviour category
Figure 5. The SOCS subsystem (module) as a coordinator, with the JAK-STAT5 system represented as
a multilevel hierarchical system.
identifying SOCS subsystem as a coordinator in-silico experiments the number of the molecular drug
targets is potentially reduced to twelve from 118 parameters a factor of ten reduction.
C ONC LUSI ON
Systems and control theory concepts have been crucial to the development of technological systems from
the Industrial Revolution to date. Indeed, modern day technology depends completely upon theoretical
methods of systems and control for its function. The practical evidence is that living systems also use
21
systems and control methods in a structured way to, for example, (a) regulate and organise their performance, (b) achieve certain objectives and (c) resist unwanted external change. Moreover, the systems
and control analysis of biological and physiological processes suggests that nature has evolved methods
that are remarkably similar to systems and control theory principles used by engineers in technological
applications. This in turn means that we can plausibly hope that systems and control theory analysis can
be used in biology and to tease out underlying operational principles that can be of use in medicine. The
homeostatic principle is the shining of example of such a principle with general application and Example
2 in this chapter presents the essence of another more modern principle with similar potential.
R eferences
Academy of Medical Sciences and the Royal Academy of Engineering. (2007). Systems biology: A vision for engineering and medicine. London: Royal Academy of Engineering.
Almeida, M., Han, L., Bellido, T., Manolagas, S. C., & Kousteni, S. (2005). Wnt proteins prevent apoptosis
of both uncommitted osteoblast progenitors and differentiated osteoblasts by beta-catenin-dependent
and -independent signaling cascades involving Src/ERK and phosphatidylinositol 3-kinase/AKT. J Biol
Chem, 280, 41342-41351.
Angeli, D., & Sontag, E. D. (2003). Monotone control systems. IEEE Trans. Automatic Control, 48,
1684-1698.
Bacciotti, A., & Rosier, L. (2005). Liapunov functions and stability in control theory. Berlin: Springer
Verlag.
Bayliss, L.E. (1966). Living control systems. London: English University Press.
Behrens, J. (2005). The role of the Wnt signalling pathway in colorectal tumorigenesis. Biochem Soc
Trans. 33, 672-675.
Bendat, J. S., & Piersol, A. G. (2000). Random data: Analysis and measurement procedures. New York:
John Wiley & Sons.
Bennett, S. (1979). A history of control engineering: 1800-1930. London: Peter Peregrinus.
Bennett, S. (1993). A history of control engineering: 1930-1955. London: Peter Peregrinus.
Bode, H. W. (1945). Network analysis and feedback amplifier design. New York: Van Nostrand.
Box, G. E. P., & Jenkins, G. M. (1970). Times series analysis, forecasting and control. New York:
Holden Day.
Bryson, A., & Ho. YC. (1969). Applied optimal control. London: Taylor Francis
Cannon, W.B. (1932). The wisdom of the body. Chicago: W. W. Norton Press.
Chen, Y., Yu, W., Bunting, K. D., & Qu, C.-K. (2004). A negative role of SHP2 tyrosine phosphatase in
growth factor-dependent hematopoietic cell survival. Oncogene, 23(20), 3659-3669.
22
Crandall, S. H., Karnopp, D. C., Kurtz, E. F., & Pridmore Brown, D.C. (1968). Dynamics of mechanical
and electromechanical systems. New York: McGraw Hill.
Cornish-Bowden, A. (2004). Fundamentals of enzyme kinetics. New York: Portland Press.
Ding, Q., Xia, W., Liu, JC., Yang, JY., Lee, DF., Xia, J. et. al. (2005). Erk associates with and primes
GSK-3beta for its inactivation resulting in upregulation of beta-catenin. Mol Cell, 19. 159-170.
Dorato, P. (1998). Non-fragile controller design: an overview. Proceedings of the American Control
Conference, 2829-2831.
Eykhoff, P. (1974). System identification. New York: John Wiley & Sons.
Freeman, F. A., & Kokotovic, P. V. (1996). Robust nonlinear control design. Berlin: Birkhauser Verlag.
Friedland, B. (2003). Control systems design: State space methods. New York: Dover Books.
Goodwin, G. C., Graebe, S. F., & Salgado, M. E. (2001). Control system design. New Jersey: Prentice
Hall.
Higham, D. J., & Higham, N. J. (2005). Matlab Guide. New York: Society for Industrial and Applied
Mathematics (SIAM).
Hunter, P. J., & Borg, T. K. (2003). Integration from proteins to organs: The physiome project. Nature
Reviews Molecular Cell Biology, 4, 237-243
Hunter, P. J., Li, W.W., McCulloch, A. D., & Noble, D. (2006). Multiscale modelling. Computer, 39,
48-54.
Kalman, R. E., & Bucy, R. (1961). New results in linear filtering and prediction theory. Trans. ASME
J. Basic Engineering, 35, 2-34.
Kitano, H. (2007). A robustness-based approach to systems-oriented drug design. Nature Reviews Drug
Discovery, advance online publication, 23 February | doi :10.1038/nrd2195.
Kim, D., Kwon, Y.-K., & Cho, K-H. (2007). Coupled positive and negative feedback circuits form an
essential building block of cellular signaling pathways. BioEssays, 29 (1), 85-90.
Kim, D., Rath, O., Kolch, W., Cho. K-H. (2007). A hidden oncogenic positive feedback loop caused
by crosstalk between Wnt and ERK Pathways. Oncogene, (online publication 22 January 2007; doi:
10.1038/sj.onc.1210230).
Kwon, Y.-K., & Cho, K.-H. (2007). Boolean dynamics of biological networks with multiple coupled
feedback loops. Biophysical Journal, 92 (8), 2975-2981.
MacFarlane, A. G. J. (1970). Dynamical system models. London: Longman.
Maxwell, J. C. (1867). On governors. Proceedings of the Royal Society, 16, 270-283.
Mesarovic, M. D., Sreenath, S. N., & Keene, J. (2004). Search for organising principles: Understanding
in systems biology. IEE Systems Biology, 1(1), 19-27
23
Nyquist, H. (1932), Regeneration theory. Bell Systems Technical Journal, 11, 126-147.
Papoulis, A., & Pillai, S. U. (2002). Probability, random variables and stochastic processes. New York:
McGraw Hill.
Rane, S. G., & Reddy, E. P. (2002). JAKs, STATs, and Src kinases in hematopoiesis. Oncogene, 21,
3334-3358
Riggs, D. S. (1976). Control theory and physiological feedback mechanisms. New York: Robert Krieger
Publishing.
Rottinger, E., Besnardeau, L., & Lepage, T. (2004). A Raf/MEK/ERK signaling pathwayis required for
development of the sea urchin embryo micromere lineage through phosphorylation of the transcription
factor Ets. Development, 131, 1075-1087.
Schmidt, H., & Jirstrand, M. (2006). Systems biology toolbox for MATLAB: A computational platform
for research in Systems Biology. Bioinformatics, 22 \(4), 514-515.
Shannon, H. (1932). The mathematical theory of communication. Bell System Technical Journal, 27,
379-623.
Skogestad, S., & Postlethwaite, I. (1996). Multivariable feedback control. London: John Wiley and
Sons.
Slotine, J-J., and Li, W. (1991). Applied nonlinear control. London: Prentice Hall.
Smith, M. G. (1966). Laplace transform theory. New York: Van Nostrand Reinhold.
Sreenath, S. N., Soebiyanto, R. P., Mesarovic, M. D., & Wolkenhauer, O. (2007). Coordination of crosstalk between MAPK-PKC Pathways: An Exploratory Study. IET Systems Biology, 1(1), 33-40.
Sterling, P. (2004). Principles of allostasis: Optimal design, predictive regulation, pathophysiology and
rational therapeutics. In J. Schulkin (Ed). Allostasis, Homeostasis, and the Cost of Adaptation, (pp. 3245), Cambrige: Cambridge University Press.
Swameye, I., Mller, T.G., Timmer, J., Sandra, O., & Klingmller, U. (2003). Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by data-based modeling. Proc. National
Academy of Science, 100, 1028-1033.
Tortura, G. J., & Grabowski, S. R. (2003). Principles of anatomy and physiology. New York: John Wiley
and Son.
Wellstead, P. E., & Edmunds, J. M. (1975). Least squares identification of closed-loop systems. International Journal of Control, 21(4), 689-699.
Wellstead, P. E. (1979). Introduction to physical system modelling. Oxford: Academic Press.
Yamada, S., Shiono, S., Joo, A., & Yoshimura, A. (2003). Control mechanism of JAK/STAT signal
transduction. FEBS Lett., 534, 190-196.
Yu, H., & Jove, R. (2004). The STATs of cancer - New molecular targets come of age. Nat. Rev. Cancer,
4, 97-105.
24
Yun, MS., Kim, SE., Jeon, SH., Lee, JS., & Choi, KY. (2005). Both ERK and Wnt/beta-catenin pathways
are involved in Wnt3a-induced proliferation. J. Cell. Sci., 118, 313322.
Zarrop, M. B. (1979). Optimal experiment design for dynamic system identification. Lecture notes in
Control and Inference Science (21). Berlin: Springer.
K ey T erms
Closed Loop Feedback Control: This is the process of continuously measuring the output of a system and using a modified version of the measured output at the systems input so as to alter the overall
performance of the system.
Control Theory: The set of mathematical techniques used to analyse and design control systems.
Data Analysis: The analysis of time course data from a system in order to understand the nature of
the signal generating mechanisms associated with a system. These are often unwanted noise or errors
in the process and are used to modify or correct the mathematical model.
Dynamical System: An assembly of components or sequence of reactions whose performance can
only be completely described by a study of its behaviour over time.
Feedback: The technique of monitoring information from one part of a system and using it to modify
a system element at some point prior to the monitoring point. If the monitored information is used to
add to the system element it is positive feedback, if it is used to subtract from the system element it is
negative feedback.
Frequency Domain: The name given to a mathematical space into which mathematical models are
transformed for systems and control studies using harmonic analysis of the time course data. This is
highly suited for linear systems in medical systems biology most systems are non-linear.
In-silico Simulation: The use of a special computer programme to solve the equations of a mathematical model and produce a set of plots of model parameters over time.
Linearity: Is the property of a system where if two inputs sequences Xa and Xb produce responses Ya
and Y b, then Xa+Xb will produce the response Ya+Y b . The system is said to be linear most biological
and medical systems do not satisfy this criteria and are said to be non-linear.
Mathematical Model: A set of equations, usually ordinary differential equations, the solution of
which gives the time course behaviour of a dynamical system. The set of equations for example 1 is an
example of a mathematical model.
MATLAB: The name of a widely used proprietory software package that is especially suited to
the simulation of dynamical system models and their analysis. It is produced by MathWorks Inc. It is
adapted from a public domain package of the same name public domain equivalents are available as
Octave and Scilab.
Pharmacokinetics: This refers to the dynamical mechanism by which a drug is absorbed, and
processed by the body
25
Pharmacodynamics: This refers to the analysis of the biochemical and physiological effects of drugs
and the mechanisms in which they work.
Stability Analysis: That part of systems and control theory which is used to study and predict the
stability or instability characteristics of a system from a knowledge of the mathematical model.
State Space: The name given to the mathematical space into which mathematical models are put
for systems and control studies using temporal analysis of the time course data. State space (or time
domain) analysis is suitable for linear or non-linear systems analysis. This is therefore highly suited to
medical and systems biological analysis.
System Identification: The analysis of time course data from a system in order to deduce the nature
of the system and the values of parameters that could be used in a mathematical model to reproduce
the time course data in simulation.
Systems Theory: The set of mathematical techniques used to analyse and understand the (dynamical) behaviour of systems.
Transfer Function: The name given to the frequency domain representation of a functional system
module with distinct input and output points.
26
27
Chapter III
Mathematical Description of
Time Delays in Pathways Cross
Talk
S. Nikolov
Institute of Mechanics and Biomechanics, Bulgaria
V. Petrov
V. Kotev
G. Georgiev
abstract
In this chapter we investigate how the inclusion of time delay alters the dynamic properties of (a) delayed protein cross talk model, (b) time delay model of RNA silencing (also known as RNA interference),
and (c) time delay in ERK and STAT interaction. The consequences of a time delay on the dynamics
of those systems are analysed using Hopfs theorem and Lyapunov-Andronov theory. Our analytical
calculations predict that time delay acts as a key bifurcation parameter. This is confirmed by numerical
simulations.
INTR OD UCTI ON
The aim of this review is to give an extended analytical consideration of the role of time delay in the
behaviour associated with dynamical models: (i) delayed protein cross talk model; (ii) time delay model
of RNA silencing and (iii) time delay in ERK and STAT interaction.
Mathematical Description of Time Delays in Pathways Cross Talk
Some of the results presented here are obtained and published in the papers (Nikolov, Kotev, Georgiev, & Petrov, 2006; Nikolov, Kotev, & Petrov, 2006a; Nikolov, Kotev, & Petrov, 2006b; Nikolov, Vera,
Wolkenhauer, Yankulova, & Petrov, 2007; Nikolov & Petrov, 2007; Nikolov, Vera, Kotev, Wolkenhauer,
& Petrov, 2008), but new considerations and improvements are also made. The investigations, conducted
on time-delay mathematical models, examine how the time-delay influences the processes of protein
synthesis, the RNA silencing and the interaction of the ERK and STAT proteins. Using the LyapunovAndronov theory and the Hopf theorem, the bifurcation values of the time delay are discovered, the
zones of stability and instability are determined, and from there the zones of norm and pathology
(cancer) for each process. Thus, the greatest advantage of such an approach is revealed, namely the
theoretical forecast (prediction) of various diseases, including cancer.

D ynamical A spects of Protein C ross T alk and T ime D elay
The notion cross talk is introduced in the intracellular kinetics to denote the mutual interaction between
signalling pathways (Wolkenhauer, Ullah, Wellstead, & Cho, 2005; Wolkenhauer, Streenath, Wellstead,
Ullah, & Cho, 2005). It is realized by corresponding cross talk of the pathways elements, i.e. proteins. So
the study of protein cross talk is necessary step in investigating the nature of pathways cross talk. The
last is also necessary to analyze more complex networks of pathways. In some cases authors talk just
about protein interactions having in view interaction between at least two proteins (Pircher, Petersen,
Gustafsson, & Haldosen, 1999). On the other hand, in terms of systems theory, the protein interaction
can be defined as feedback between two proteins. Thats why this type of interaction is also called
feedback loop.
Let us consider a simple hypothetical interaction between two proteins X and Y presented by the
following kinetic equations:
dx
= k1 y
dt
(1.1.1)
dy
= k2 x k1 y
dt
(1.1.2)
where x and y are the concentrations of the proteins X and Y respectively. The kinetic sense of the system
(1.1.1)-(2) consists in the following two processes: (i) The protein
Y spontaneously degrades in a protein X with a reaction rate constant k1; (ii) The protein X activates
the degradation of protein Y with a proportionality coefficient k2. The system (1.1.1)-(2) can be written
in the form of following linear oscillator with attenuation:
d2x
dx
+ k1 + k1k2 x = 0
2
dt
dt
(1.1.3)
Here the rate constant k1 plays role of a friction coefficient of the oscillator.
Let us further suppose that the protein X needs some time to activate the degradation of protein Y.
That means the rate of degradation
dy (t )
of Y in the moment t, is proportional to the concentration x(t
dt
) of X in a previous moment t . Thus instead of (1.1.1)-(2), we should write:
28
dx
= k1 y
dt

dy
= k2 x(t ) k1 y
dt
(1.1.4)
(1.1.5)
The equations (1.1.4)-(5) are called differential equations with time delay.
If the time delay is sufficiently small, we can apply appropriate procedure to analyze the role of
in the qualitative behavior of the system (1.1.4)-(5). For this purpose we develop the function x(t )
in Taylors expansion:
x(t ) = x(t )
2
d 2 x(t )
dx(t )
+
+ ...
dt
2 dt 2

(1.1.6)
and retain only the first, second and third (i.e. up to quadratic one with respect to ) terms.
It is well known that, replacing x(t ) with higher order approximation of a Taylors series is not
better than applying lower order approximation (Elsgoltz, 1957; Driver, 1977). The reason of this paradoxical phenomenon consists in the circumstance that the higher order terms in (1.1.6) are in principle
not small. For example, by applying a step-by-step method (Elsgoltz, 1957; Driver, 1977) to solve a time
delay system in Taylors presentation with higher order terms, a small parameter at the higher derivatives appears. On the contrary, in lower order terms approximation similar parameter does not appear.
How detailed the statement of this circumstance is depends on the order of the system. In the present
two-order case, Taylors series can be applied only up to quadratic approximation. Certainly, linear
approximation is also admissible.
Further we replace (1.1.4) in the linear approximation of (1.1.6) and write:
x(t ) = x(t ) k1 y
(1.1.7)
After substituting (1.1.7) into (1.1.4)-(5), the system becomes:
dx
= k1 y
dt
(1.1.8)
dy
= k2 x(t ) + k2 y k1 y
dt
(1.1.9)
We consider the last system is approximately equivalent to (1.1.4)-(5) when is sufficiently small.
In a form of linear oscillator with attenuation this system looks like:
d2x
dx
+ (k1 k2 ) + k1k2 x = 0
2

dt
dt
(1.1.10)
It is seen that the time delay in this case creates a negative friction in the oscillator. The corresponding phase plot of (1.1.10) could become of type focus if it was a knot before. That means the
29
attenuation decreases and oscillator behavior approaches to harmonic one. When = k1/k2 the resultant
friction disappears and the oscillator is harmonic. For > k1/k2 the amplitude of the oscillations amplifies
with the time. If there are appropriate nonlinear terms in (1.1.1)-(2), (respectively (1.1.10)), the amplitude
amplification could be restricted and stable self-oscillations would be observed.
Mathematical Modelling and C oncept D ynamical D iseases

It is generally accepted that signalling and cell function are both dynamic processes. Analysis of such
systems is primarily a matter of finding the number of steady states, their nature (stable/unstable) and
of characterising the transitions between them (structural stability analysis). Pathways, understood as
biochemical reaction networks, are complex systems. The complexity arises from both the presence of
feedback loops in the cell (Wolkenhauer, Streenath, Wellstead, Ullah, & Cho, 2005), a relatively large
number of molecules involved, and the nonlinear nature of interactions between molecules (Swameye,
Mueller, Timmer, Sandra, & Klingmueller, 2003).
The concept dynamical diseases is introduced by (Glass & Mackey, 1988) and it is used to characterize an anomalous temporal organization. The indication of a dynamical disease is the change in
the dynamics of a given variable. Three types of qualitative changes in the dynamics are possible: 1)
variables (which in norm dont change) under the impact of weak random fluctuations (disturbances) lead
to periodical oscillations with a big amplitude. Thus, in the regulation of a given physiological system
(which in norm doesnt possess rhythmical processes) regular oscillations may appear; 2) new periodicities may appear (occur) in a periodical process; 3) rhythmical processes may disappear and be replaced
with a constant or aperiodical (chaotic) dynamics. The construction of theoretical (mathematical) models
of physiological systems is a powerful tool of understanding the physiological dynamics. Certainly, the
modelling should have a concrete application in the experimental and the clinical systems. The basic
advantage of the theoretical models of the dynamical diseases is that they allow for the execution of
systematic manipulations that are impossible to be carried out in practice. The shortcoming of these
models is the existence of a dynamics (behaviour) similar to that of the clinically observed disease, but
due to other factors. The cause is hidden in the limited possible number of the types of bifurcations of
the stable equilibrium or the oscillating states, whereupon it is possible to achieve a qualitatively similar dynamics in various ways. Nevertheless, the investigations of the dynamical diseases with the help
of theoretical models play a significant role. It consists in the development of practical methods in the
diagnostics of pathological diseases and a choice of rational therapeutic strategies for their treatment.
With the advance in the technologies and the design of the experiments that generate quantitative
data sets, there is an increasing demand for mathematical tools to elucidate the dynamic behavior of
cells. For example, (i) the influence of time delays, that are a consequence of transport, in particular
processes between the nucleus and the cytoplasm in a cell are particularly relevant to the cyclic changes
observed in experiments (Wang, Zhou, Jing, & Chen, 2004; Chen, Wang, Kobayashi, & Aihara, 2004);
(ii) the use of S-systems, that are a wider framework to model complex biochemical interactions with
significant computational and analytical advantages (Voit, 2000). Oscillations have long been known
in metabolic pathways (Heinrich & Schuster, 1996; Heinrich, Neel, & Rapoport, 2002) and are now
also considered an important aspect of cell signalling (Nikolov, Yankulova, Nikolova, & Petrov, 2006;
Nikolov, Yankulova, Wolkenhauer, & Petrov, 2007).
In further detail, dynamical processes in the cell may play determining role in distinguishing norm
and pathology of cell functions. This is due to the fact that the corresponding physiological rhythms of
30
the cell present a set of various dynamical processes depending on multiple parameters. The normal cell
is characterised by complex ensemble of rhythms in the different sub-systems, some time distinguishing by more or less expressed aperiodicity. Against this pseudo-chaotic back-ground, some abnormal
periodicity could appear, what would be an indication of pathology (in our case carcinoma). A sign of
dynamical carcinoma would appear in the form of an unregulated cell division cycle. If we have a theory
for controlling cell cycle dynamics, with this expanding knowledge, it would become increasingly apparent how the carcinoma can be medically treated. In recent years, studies of the Cell Division Cycle
(CDC) have uncovered many of the genes and proteins that drive and regulate cell division (Elledge,
1996; King., Deshaies, Peters, & Kirschner, 1996; Murray & Kirschner, 1989; Sherr, 1996; Stillman,
1996). Typically, such manipulations could be achieved by introducing mutations into the genes that
regulate the cycle. However, these mutations usually result in uncontrolled cell division or complete
suppression of cell division, or cause the cell commit fatal errors during the cell cycle. That is why it is
necessary to develop methods for gaining more precise control of the CDC by using our understanding
of the dynamics of the CDC oscillator. Specifically, we need to model a mechanism that can stop and
restart cell division, modulate the frequency of cell division and control the size of dividing cells (possibly breast ones). This control scheme would require only the expression of a protein that binds to and
inhibits any one of the CDC proteins. Because similar control scheme would be general and requiring
only the expression of a single protein, it would provide a practical means for tuning the characteristics
of the cell cycle in vivo.
Physiological and B iochemical A spects of T ime D elay and F eedback Loops:

T ime D elay
In the analysis of gene regulatory networks one major difficulty can be often encountered (Bratsun,
Volfson, Tsimring, & Hasty, 2005 ). It is the vast separation of time scales between what are typically
the fast reactions (dimerisation, protein-DNA binding/unbinding) and the slow reactions (transcription,
translation, degradation). In this regards, it is important to note that transcriptional and translational
processes are not just slow but also are compound multistage reactions involving the sequential assembly of long molecules. Thus, these processes should be connected with a certain characteristic mean
delay time. When delays in biochemical reactions are small compared with other significant time scales
characterizing the genetic system, one can safely ignore them in simulations. Furthermore, time delays
usually do not affect the quasiequilibrium behavior of gene regulatory networks or the mean values of
corresponding observables. However, if indeed the time delays are of the order of other processes or
longer, and the feedback loops associated with these delays are strong. Taking the delays into account
can be crucial for description of transient processes.
The fact that delayed-induced stochastic oscillations can occur during transcriptional regulation
is supported by recent studies of circadian oscillations in Neurospora, Drosophila, and others. It is
widely accepted now that these oscillations are caused by delays in certain elements of gene regulation
networks (Schepper, Klinkenberg, Pennartz, & Van Pelt, 1999; Lema, Golombek, & Echave, 2000;
Smolen, Baxter, & Byrne, 2001). It is plausible that the role of time delays in circadian rhythms has
come to light because the delays in the corresponding reactions are particularly long (several hours)
in comparison with other characteristic times of the system. Hence, the shorter delays present in other
systems also can have a significant impact on dynamics but they may be more difficult to detect with
currently available experimental methods.
31
The Influence of Feedbacks

The feedback can be defined as a capability of the system to adapt its exit in response to its control.
Feedbacks are used in biology at the regulation of cell growth by inhibitor factors, produced by the cell
itself, and the representation of theoretical models of the pattern structure (Glass & Mackey, 1988; Turing, 1952; Gierer & Meinhardt, 1972; Meinhardt, 1994; Murray, 2002; Miguez, Izus, & Minuzuri, 2006).
They can be of two types: positive and negative. Negative feedback (NFB) occursr when a signal is
caused by the expression of its inhibitor, serving for the decay or the limitation of the signal (see Figure
1). Positive feedback (PFB) (or autocatalysis) occurs when more inhibitors, or other molecules, which
amplify the initial signal and lead to the stabilization of the amplitude, or the increase in the signals
duration, cause a signal. The wide use of feedbacks and the change in their succession makes them an
important factor in cell signals regulation.
The time to complete transcription and translation introduces time delay in the following differential
equations:
.
x1 = k1 f (x2 ) 1 x1 ,
.
x 2 = k2 x1
2 2
(1.3.1)
where k1,k 2 are the production rate constants, 1,2 are the degradation rate constants and
x1 (t ) = x1 (t 1 ), 1 > 0; x2 (t ) = x2 (t 2 ), 2 > 0 are time delays. It is well known that time delay
feedback system (1.3.1) may exhibit oscillatory behaviour and negative feedback is important for homeostasis, maintenance of system near a desired state (Thomas & dAri, 1990; de Jong, 2002).
Usually, NFB is used for limiting the signals time length. The simplest kind is, when the signal excites
its regulator, which, in its turn, at the reaching of a definite threshold, interrupts the signal, which, from a
dynamical point of view, can be related with the appearance of a bifurcation, or stability of the equilibrium
state. An example can be given with the control of the cytokine signalisation in the JAK-STAT signal
pathway (Starr & Tracy, 1997; Swameye, Mueller, Timmer, Sandra, & Klingmueller, 2003; Timmer,
Figure 1. Negative feedback system. In this case the gene encodes a protein inhibiting its own expression
mRNA
2
protein
32
Mueller, Swameye, Sandra & Klingmueller, 2004), where the regulator is the erythropoietin (EpoR A).
JAK are soluble tyrosine kinases, which bind themselves with the cytokine receptors, transducing the
signal to the STAT proteins (Sasaki et al., 2001). Beyond a definite EpoR A concentration (threshold),
this pathway becomes unstable (which could be explained with the formation of new structures). After
a reverse decrease in its concentration, the pathway becomes stable related with homeostasis. After
1997, the family of discovered cytokine-inducible proteins (such as SOCS, SSI, JAB and CIS) (Freeman,
2000), which impede the determining of the cytokine signalisation, increases. These proteins take part
in NFB. The physiological importance of the negative feedback can be revealed through SOCS1, which
causes and inhibits an interferon signal. When the gene for SOCS1 is exhausted, a lethal outcome is
reached. Other transgenetic experiments related with the exhaustion of a given gene show that SOCS
is necessary for the control of erythropoiesis. The control of the erythropoietin reply is accomplished
by means of NFB. SOCS3 takes part also in the NFB control of metabolic factors, such as: leptin and
the hormone growth.
Compared to NFB, PFB is less used in the signal kinetics control. It has to do with the lengthening
of the signal time duration.
Besides for the signal time length control, feedbacks are also applied for spatial regulation control
(Hoffman, Levchenko, Scott, & Baltimore, 2002). An example can be given with the expression of the
homeotic genes, controlling the development in live organisms. Their specific expression has to be
determined within the organisms development process. This occurs long after the signal is localized,
and is expressed as late as in the death. NFB contributes to its determining in many cases. This antiregulation of the homeotic genes appears, as they activate their own transcription, causing stability of
their initially expressed structure.
The including of feedbacks in the cellular molecular structure is related with:
1.
2.
3.
4.
5.
PFB and NFB create a left-right symmetry in the vertebrate embryos.

PFB can coordinate separate signals.
NFB can limit the arrangement of the connections.
NFB generates stability. This homeostatic function of NFB has an important role in the cells
defence from uncontrolled growth and development, leading to cancer formations.
PFB can generate instability, very often expressed in tumours. For instance, the RAS-MAP pathway is activated by an EGF receptor (hyperactive in cancer formations), which is induced in it.
In Figure 2, a more complex feedback system is shown. In this case the genes encodes a protein
activating synthesis of another protein inhibiting expression of gene: positive and negative feedback.
A ndronov-H opf B ifurcation C alculations in D elayed S ystems: G eneral T heory

Time Delayed Systems
Delay differential equations (DDEs) are infinite-dimensional systems which find application in many
phenomena of physical and biological interest. It is well known that dynamical systems with distributed
delay are more general than those with discrete delay. This is because the distributed delay becomes
discrete when the delay kernel is a delta function at a certain time. Dynamical systems with distributed
time delay have been found in population dynamics and neural networks (Belair & Dufour, 1996; Go-
33
Figure 2. More complex feedback system. Now, genes encodes a protein activating synthesis of another
protein inhibiting expression of gene, i.e. it is an example for positive and negative feedback.
gene b
mRNAb
Protein B
gene a
mRNAa
Protein A
palsamy & Leung, 1997). In both biological and artificial neural networks, time delays arise as a result
of the finite processing time of information. Usually, fixed time delays in models of delayed feedback
systems can sufficiently approximate simple circuits having only a small number of cells. However,
due to the spatial nature of the dynamical system resulting from the parallel pathways of a variety of
system states, it is desirable to model them using distributed delays.
In dynamical systems with delay the rate of change of the present state depends on the past state of the
system (Orosz, 2004). Time development of these systems can be described by the following DDE:
.
x (t ) = H (xt ,
(1.4.1)
where the overdot denotes the differentiation with respect to time t, the state variable is x : R R n ,
while the function xt : R X R n is defined by the shift xt ( ) = x (t + ), [ , 0]. Here we note that
+
the length of the delay R is assumed to be finite. The nonlinear functional H : X R n R R n
acts on the function space X R n of R R n functions. For the sake of simplicity, we consider a scalar
bifurcation parameter, that is, R, and assume that H is a near-zero functional in xt for any :
H (0,
34
) = 0
(1.4.2)
Hence, DDE (1.4.1) has the trivial solution:

x(t) = 0
(1.4.3)
which exists for all the values of the bifurcation parameter . Since the function space X R n is infinitedimensional and the dimension of the DDE (1.4.1) phase space also becomes infinite.
Using a particular form for the functional H, we obtain the equation:
.
0
x (t ) = h d
( ) (x (t + ));
(1.4.4)
n n
n
n
where h, : R R R , h (0, ) = 0, and the n n matrix : [ , 0] R is a function of the
variation .
The measure can be concentrated on some particular values:
( ) = ( )+ (
i =1
) I
(1.4.5)
where i (0, ], i = 1, ..., m, m N , the non-delayed term is formally separated from the delayed terms,
and the n n identity matrix is indicated by I. Substituting measure (1.4.5) into (1.4.4) results in:
.
x (t ) = h
(1.4.6)
), ..., x (t m ); )
(1.4.7)
(x (t )), (x (t ));
i
i =1
that is:
.
x (t ) = f (x (t ), x (t
where f : R n ... R n R R n and f (0, 0, ..., 0,

m delays. If m = 1, then we have the form:
.
x (t ) = f (x (t ), x (t
) = 0. Thus, (1.4.7) is the general form of DDE with
); )
(1.4.8)
Stability and Bifurcations

According to the Riesz Representation Theorem, the linearization of functional H with respect to xt
is defined by a Stieltjes integral, that is the variational system of (1.4.1) is given as:
.
x (t ) =
d (
) x (t + )
Note that (1.4.9) can also be obtained from (1.4.4) by considering

part of the function h.
(1.4.9)
(x ) = x and taking the linear
35
Similarly to the case of linear ODEs, one may substitute the trial solution x(t ) = k c t into Eq. (1.4.9)
with a constant vector k C n characteristic exponent C. It results in the characteristic equation:
L( ,
) = det
) = 0
(1.4.10)
which has infinitely many solutions for the characteristic exponent .

The trivial solution (1.4.3) of the nonlinear DDE (1.4.1) is asymptotically stable (which in Lyapunov
sense is stable, too) for the bifurcation parameter if all the infinitely many characteristic exponents
are situated on the left hand side of the imaginary axis. Andronov-Hopf bifurcation takes place at the
critical parameter value b if there exists a complex conjugate pair of pure imaginary characteristic
exponents:
1,2
( b ) = i
(1.4.11)
In the parameter space of the DDE, the corresponding stability boundaries are described by the
so-called L-curves:
R(
) = Re L (i ),
S(
) = Im L (i )
(1.4.12)
that are parameterised by the frequency R + referring to the imaginary part of the critical characteristic exponents (1.4.11). Since Eq. (1.4.10) has infinitely many solutions for , an infinite-dimensional
version of the Routh-Hurwitz criterion is needed to decide on which side of the L-curves the steady
state is stable or unstable. According to (Kolnanovskii & Nosov, 1986; Stepan, 1989) these kind of
criteria can be determined by calculating complex integrals around the characteristic exponents. Thus,
in case when not only one but two pairs of pure imaginary characteristic exponents (with two different
frequencies) coexist at b then a co-dimension two double Andronov-Hopf bifurcation occurs. If, a zero
exponent and a pair of pure imaginary exponent coexist at b then a fold bifurcation occurs together
with an Andronov-Hopf bifurcation (Sieber & Krauskopf, 2004).
These is another condition for the existence of an Andronov-Hopf bifurcation: the critical characteristic exponents 1,2 (1.4.11) have to cross the imaginary axis with a non-zero speed due to the variation
of the bifurcation parameter :
d
(
Re 1,2
d
L ( ;
= Re
) / L (
) 0

(1.4.13)
where the first equality can be verified by implicit differentiation of the characteristic function (1.4.10). The
above conditions (1.4.11) and (1.4.13) can be checked using the variational system (1.4.9). Contrarily, the
super- or subcritical nature of the Andronov-Hopf bifurcation , i.e., the stability and estimated amplitudes
of the periodic solutions arising about the stable or unstable trivial solution (1.4.3) can be determined
only by the investigation of the third degree power series of the original nonlinear DDE (1.4.1).
36
TIME DE LAY DYNAMICA L MODE LLING OF PHYSI OLOGICA L SYSTEMS

N ORM AND PATH OLOGY
IN
D elayed Protein C ross T alk Model

Time delay emerges in some cases as a constitutive property of pathways (Timmer, Mueller, Swameye,
Sandra, & Klingmueller, 2004). The nature of time delay in biochemical system models is twofold. In
some cases the time delay is related to processes that take an intrinsic discrete time to be accomplished
(for instance, the synthesis of mRNA), while in other cases it is a consequence of the modelling approach used, in which complex sequences of events, which are not represented in detail, provoke the
emergence of an apparent time delay. The processes related to gene expression induce very often time
delays in the biochemical systems. Smolen et al. (2001) describes a time delay associated to the translocation of proteins and mRNAs between 50 and 100 minutes. Rateitschak et al. (2007) describes a
time delay for gene transcription between 10 and 40 minutes. Finally, in Swameye et al. (2003) a time
delay around seven minutes is defined and estimated for nucleocytoplasmic shuttling, which describes
the delay associated with a pool of processes not described in detail in the model. Various authors have
previously considered biochemical oscillators with time delay (Fall, Marland, Wagner, & Tyson, 2002;
Nikolov, Kotev, Georgiev, & Petrov, 2006; Pircher, Petersen, Gustafsson, & Haldosen, 1999; Rateitschak
& Wolkenhauer, 2007). Their analyses show that the introduction of a large enough time delay can
sometimes change the unique equilibrium of the system and induce periodic solutions (self-oscillation),
which arise from the equilibrium through an Andronov-Hopf bifurcation.
The aim of this section is to elucidate how the dynamics of the interaction between pathways via
cross talk is affected by the time delay associated with gene expression. The role that cooperativity of
the end product repression could play in the stability of the system is also analysed.
Case Study
In order to study the role of time delay in the cross talk between enzymes and repressors in protein
synthesis, we use the model proposed by Jacob and Monod (Jacob & Monod, 1961) for the activation
of the operon lactose in E. coli, which is considered a classical case study of gene regulatory networks.
Figure 3 shows the structure of the operon and how the presence or absence of lactose induces different
responses of the system.
In essence, the system acts as a feedback loop where the lac repressor protein, y3, controls the synthesis of the enzyme b-galactosidase, y2, through the repression of the mRNA, y1, production. At the
same time y3 is regulated by the effect of b-galactosidase, y2, in the reduction of lactose concentration.
In a simple form, the system can be represented by the following mathematical model in time delayed
differential equations (Nikolov, Kotev, Georgiev, & Petrov, 2006; Nikolov, Vera, Kotev, Wolkenhauer,
& Petrov, 2008):
dy1
k1
=
k4 y1 ,
dt k2 + k3 y3p1
dy
dt
dy3
= k7 y2 k8 y3 .
dt
2
= k5 y1 (t ) k6 y2 ,

(2.1.1)
37
Figure 3. Scheme of the lactose operon in E. coli proposed by Jacob and Monod (Jacob & Monod,
1961). The system contains three structural genes (Lac Z, Lac Y and Lac A) and a regulatory gene (Lac
I). In absence of lactose (a) the regulatory gene induce the production of a repressor, y3, which blocks
the activation of the structural genes, and therefore the synthesis of b-galactosidase, y2. However, in
presence of lactose (b), the repressor associates to lactose and changes its configuration; afterwards,
the repressor is not able to block the activation of the structural genes, which allows the synthesis of
mRNA, y1. After a series of structural processes including translocation to the cytosol and conformational
changes, y1 induces the synthesis of b-galactosidase and lactose permease. Finally, these translocation
and conformational changes provoke a time delay in the transmission of the signal between the nucleus
and cytosol ().
(a)
(b)
where ki (i = 1,2,..,8) are the kinetic rate constants and p1 measures the cooperativity of the end product
repression. There are several mathematical models available in the literature describing the lac operon
(Chen, Wang, Kobayashi, & Aihara, 2004; Thomas, 1998; Yildirim & Mackey, 2003; Yildirim, Santillan,
Horike, & Mackey, 2004. We have chosen this particular model because of its structural simplicity and
reduced number of equations, which allow the use of analytical tools for our study. In the present work,
38
we consider two representative values for p1 : a) p1 = 1.0 , which means non-cooperative repression
(Nikolov, Kotev, Georgiev, & Petrov, 2006) and b) p1 = 2.0 , which represents cooperative repression.
The delayed function k5 y1 (t ) encodes the assumption that the rate of enzyme synthesis is proportional to the mRNA concentration in the nucleus, y1, in the moment (t ). Tau, , represents the time
that takes the translocation of the mRNA to the cytosol and their configurational changes required to
start the synthesis of b-galactosidase, y2. Figure 4 contains a scheme of the model discussed with the
feedback loop and mechanism clearly indicated.
The fixed points of the system represented by Equation 2.1.1 are defined by the following set of
algebraic equations, including the rate constants of the model:
_ p1 +1
y3
kk k
k2 _
y3 1 5 7 = 0,
k3
k3 k4 k6 k8
y1 =
k6 k8 _
y3 ,
k5 k 7
y2 =
k8 _
y3
k7
(2.1.2)
The first equation in (2.1.2) has always only one real positive root, which ensures that the system
has only one physiologically feasible fixed point. In case of non-cooperativity, the equation describing
the stationary values of y3 is simpler:
Figure 4. Simplified scheme of the lactose operon in E. coli. Legend: y1: mRNA; y2: b-galactosidase;
y3: lac repressor protein. Dashed arrows represent activation, while dashed lines with a bar represent
inhibition. Solid lines represent synthesis (when starting with a circle and curved line, ) or degradation
(). The clock symbol represents a time-delay in the process. In absence of lactose, y3 is accumulated
repressing the production of mRNA, y1, and ultimately the synthesis of b-galactosidase, y2. When there
is a significant concentration of lactose, the formation of the complex lactose-lac repressor reduces the
available free y3. In contrast, an increase in the concentration of b-galactosidase reduces the intracellular concentration of lactose, which provokes an increase in the effective amount of y3.
39
_
k
4k k k
1 k
y3 = 2 + 2 + 1 5 7
2 k3
k
k
3 k 4 k6 k8
3
(2.1.3)
Andronov-Hopf Bifurcation of the Time Delay Model

In this section, we consider the system when the cooperativity in the end-product repression, p1 , is
equal to one and all constants of the model are real positive numbers. Let E = y1 , y2 , y3 denote the
_
equilibrium point of the system. We use Andronov-Hopf bifurcation analysis and consider the time delay
as a bifurcation parameter. The first step is to obtain the characteristic equation for the linearisation
of the system near the equilibrium E. Let us consider a small perturbation about the equilibrium level
defined as:
_
y1 = y1 + x
y2 = y2 + y
y3 = y3 + z
In the general case, the function
k1
=
k2 + k3 y3p1
(2.1.4)
k1
can be written as a MacLaurin series:
k 2 + k 3 y3p1
k1
k1
k k
=
= 1 1 3
+ k3
k3
+ 1
k
+ 3

2
2
k
3

3
3
+ ...
(2.1.5)
_ p1
where = k2 + k3 y 3 and is a polynomial of z. If we take only linear, square and cubic terms from
(2.1.5) and after substitution of (2.1.4) into differential equations (2.1.1) we have:
kk
k k2
k k3
dx
= k4 x 1 23 z + 1 33 z 2 1 43 z 3 ,
dt
dy
= k5 x k 6 y ,
dt
dz
= k7 y k8 z

dt
(2.1.6)
where = k2 + k3 y3 when p1=1. If we neglect terms of second and third order, the stability matrix leads
to the following characteristic equation:
3
+p
+ q + r1 = A
(2.1.7)
where
p = k4 + k6 + k8 , q = k4 k6 + k4 k8 + k6 k8 ,
r1 = k4 k6 k8 , A =
40
k1k3 k5 k7
2
(2.1.8)
This characteristic equation (2.1.7), which is a transcendental equation, cannot be solved analytically
and has an indefinite number of roots (Elsgolz & Norkin, 1974; Khan & Greenhagh, 1999). In essence,
we have two main tools besides a direct numerical investigation: linear stability analysis, which is valid
in case of small time delays, and the Hopf bifurcation theorem (Cai, 2005). In the following section we
analyse both cases.
Effects of Small Time Delays Using Linear Stability Analysis

For a small time delay (i.e., 1 min) the method of linear stability analysis is a very convenient approach to find the bifurcation point of the system. In this case let 1 ; then the characteristic
equation (2.1.7) becomes:
3
+p
+ (q A
+ r = 0
(2.1.9)
where r = r1 + A . By applying the Hopf bifurcation theorem and the Routh-Hurwitz criteria, an Andronov-Hopf bifurcation occurs at a value = b where the following conditions are satisfied:
p > 0, q A
R = p (q A
b
b
> 0, r > 0,
) r = 0
(2.1.10)
Let us define the function h, which represent the characteristic function in case of small time delays:
h( ,
)=
+p
+ (q A
+ r
(2.1.11)
If we evaluate the roots of h at = b, we obtain the following values that represent the eigenvalues
of the system in the approximation of small time delay:
1
= p = (k4 + k6 + k8 ) < 0,
2,3
= ik = i q A
(2.1.12)
where i is the imaginary unit. In order to clarify the properties of the system in the Hopf bifurcation
( = b), we analyse the derivative of the eigenvalues around this value of time delay. If we differentiate
implicitly h ( ( ), ) we obtain:
h h
d
=
=
/

d
3
A
+ 2 p + q A
(2.1.13)
Then, we evaluate the required derivatives of h at b. The two roots 2 and 3 are complex complementary and therefore have identical real part. Thus, the result for 2 is identical to the result for 3. In
the particular case of 2, we obtain:
41
( b ) = ikA (3k 2 + q A b 2 pki )

L2 + I 2
(2.1.14)
where L = 3k 2 + q A b , I = 2 pk . The real part of (2.1.14) has the form:
d (
Re 2
d
) = 2 pk 2 A > 0
L2 + I 2
(2.1.15)
The inequality stated in (2.1.15) is sufficient to ensure that the real part of the eigenvalue 2() at
= b has a positive slope. In this case, the use of the Hopf bifurcation theorem predicts that the system
will have a limit cycle for a time delay with the critic value = b when the approximation for small
time delays is used.
Effects of Large Time Delays Using Hopf Bifurcation Analysis

For larger time delays , the linear stability analysis of the previous section is no longer valid and we
need to use an alternative approach. If we define = m + in and rewrite (2.1.7) in terms of its real and
imaginary parts we obtain:
m3 3mn 2 + pm 2 pn 2 + qm + r1 + A m cos n = 0,
3m 2 n n3 + 2 pmn + qn + A m sin n = 0
(2.1.16)
In order to find the first bifurcation point, we set m = 0 in the equations. Then, the above two equations reduce to the following:
pn 2 + r1 + A cos n = 0,
n3 + qn + A sin n = 0
(2.1.17)
These two equations in (2.1.17) can be solved numerically, leading to (nb0 , b0 ), the first bifurcation
point. The subsequent bifurcation points (nb, b ) satisfy the following relation:
nb
= nb0
0
b
+ 2l , l =1,2,...
(2.1.18)
We can guarantee that (2.1.17) has at least one positive root, that is, there is at least one bifurcation
point of the system. By squaring the two equations in (2.1.17), adding them, and using properties of
trigonometric functions, it follows that:
n 6 + (p 2 2q )n 4 + (q 2 2 pr1 )n 2 + r12 A2 = 0
(2.1.19)
Here we note that this is a cubic equation in n 2 that describes the stability of the system. If the condition r12 < A 2 is satisfied, the left-hand side of the equation is positive for large values of n 2 and negative
for n = 0 . This means that (2.1.19) has at least one positive real root. Moreover, when we introduce the
42
variable z = nb2 (2.1.19) reduces to:
g (z ) = z 3 + (p 2 2q )z 2 + (q 2 2 pr1 )z + r12 A2 = 0
The derivative of this equation in z has the following value:
g ' (z ) = 3z 2 + 2 (p 2 2q )z + q 2 2 pr1
'
The interesting point is that g (z ) = b > 0 if nb is the least positive simple root of the equation (2.1.19).
This property will be used in the following demonstration. Let us denote the characteristic equation
without linear approximation:
H( ,
)=
+p
+ q + r1 + A
Again, in order to clarify the properties of the system in the studied fixed point ( = b), we analyse
the derivative of the eigenvalues around this value of time delay:
A
2
+ 2 p + q A
H H
d
=
=
/
d
3
If we evaluate the real part of this equation at the fixed point = b and set
dm
d
d
= Re
(2.1.20)
= inb, we obtain:
nb2 3nb4 + 2 (p 2 2q )nb2 2 pr1 + q 2

L12 + I12
where L1 = 3nb2 + q A b cos nb b and I1 = 2 pnb + A b sin nb b. The expression between brackets in the
previous equation coincides with g ' (z ) = 3z 2 + 2 (p 2 2q )z + q 2 2 pr1, and then we can guarantee that
g ' (z ) = b > 0. This means that this derivative is positive:
dm
d
d
= Re
nb2 g ' (nb2 )

L12 + I12
>0

(2.1.21)
Under these assumptions and according to the Hopf bifurcation theorem (Marsden & McCracken,
1976), we can guarantee that an Andronov-Hopf bifurcation occurs as passes through b (Cai, 2005).
The system presents an Andronov-Hopf bifurcation in the case of long time delay. Under reasonably
generic assumptions, we can expect to see a small amplitude limit cycle emerging from the fixed point
when the value of the time delay changes, which will provoke oscillations of the concentration of proteins
and mRNA of the operon with reduced amplitude around a steady-state solution. Only with the information generated until now, it is not possible to decide whether these oscillations will be a sustained or
a transient response of the system in a small area near the boundary of stability. Hence, it is necessary
to calculate the so-called first Lyapunov value at the boundary of stability region R=0 of the system
(2.1.1) to determine: i) the character (stable or unstable) of equilibrium state at R=0; ii) the stability (or
instability) of this limit cycle at transition from R<0 to R>0.
43
Bifurcation Analysis Using First Lyapunov Value

Here, we study the advantages of the use of the first Lyapunov value (Andronov, Witt, & Chaikin, 1966;
Bautin, 1984, Nikolov, 2004, Nikolov & Petrov, 2004; Shilnikov, Shilnikov, Turaev, & Chua, 2001) for
investigating in detail the qualitative properties of the bifurcation behaviour for the system with respect
to a time delay. For the purposes of our analysis, we derive an approximation of the model (2.1.1) (for
p1 = 1; 2 ) in which y1(t ) is expanded as a Taylors series of the time delay:
dy1
k1
=
k4 y1 ,
dt k2 + k3 y3p1
dy2
= k5 y1 (t
dt
) k6 y2 k5 y1
dy3
= k7 y2 k8 y3
dt
y1 +
y1 k6 y2 ,
2
2 ..
(2.1.22)
where we have retained only the first, second and third terms (i.e. up to quadratic one with respect to
). Hence, the role of time delay in the qualitative behaviour of system (2.1.1) can be analysed. It is well
known that replacing y1(t ) with higher order approximations of a Taylor series is not better than
applying lower order approximation (Elsgoltz, 1957; Driver, 1977). The reason for this paradoxical
phenomenon consists in the circumstance that the higher order terms in the second equation of (2.1.22)
are, in principle, not small. For example, by applying a step by step method (Elsgoltz, 1957; Driver,
1977) to solve a time delay system in Taylors presentation with higher order terms, a small parameter
at the higher derivatives appears. On the contrary, in lower order terms approximation a similar parameter does not appear. The detailed statement of this circumstance depends on the order of system.
In the present case, Taylors series can be applied only up to quadratic approximation. Certainly, linear
approximation is also admissible.
Following (Bautin, 1984; Nikolov, 2004; Nikolov, 2005), we calculate the so-called first Lyapunov
value L1 (see the appendix in (Nikolov & Petrov, 2004) or for a detailed discussion (Andronov, Witt,
& Chaikin, 1966; Shilnikov, Shilnikov, Turaev, & Chua, 2001)) at the boundary of the stability region
R=0 of the system (2.1.22). In accordance with the Lyapunov-Andronov theory we have: i) the sign of
the Lyapunovs value determines the character (stable or unstable) of the equilibrium state at R = 0;
ii) the character of the equilibrium state, at R = 0, qualitatively determines the reconstruction of the
phase portrait (including the stability or instability of the limit cycle) at the transition from R < 0 to
R > 0 (Andronov, Witt, & Chaikin, 1966; Bautin, 1984). When the system without cooperativity is
considered, it is not difficult to obtain that the approximate canonical form of (2.1.22):
dx
= k4 x c1 z + c2 z 2 c3 z 3 ,
dt
dy
= c4 x k6 y + c5 z c6 z 2 + c7 z 3 ,
dt
dz1
= k7 y k8 z

dt
44
(2.1.23)
where:
c1 =
c5 =
k1k3
, c2 =
k1k3 k5
k1k32
3
, c6 =
, c3 =
k1k32 k5
3
k1k33
4
, c4 = k5 (1 + k4 ),
, c7 =
k1k33 k5
4
(2.1.24)
Hence, the Routh-Hurwitz conditions for stability of the steady state, defined by (2.1.2), are:
p = k4 + k6 + k8 > 0
(2.1.25)
q = k4 k6 + k4 k8 + k6 k8 c5 k7 > 0
(2.1.26)
r = k4 k6 k8 + c1c4 k7 c5 k4 k7 > 0
(2.1.27)
R = pq r = (k6 + k8 )(k42 + k6 k8 + k4 (k6 + k8 ) c5 k7 ) c1c4 k7 > 0
(2.1.28)
When conditions (2.1.27) or (2.1.28) are not valid, the steady state (2.1.3) becomes unstable. This
means that there are several different values for some of the rate constants, k i (i = 1, 2, 3, ..., 8), and time
delay, , that make the coefficients R and r pass through bifurcation boundaries in the parametric space
(p > 0, q > 0, r ), (p > 0, q > 0, R ) or (k1 , k 2 , ..., k8 ) where the steady state (2.1.3) could change its
character (stable or unstable):
R = pq r = 0
(2.1.29)
r = 0
(2.1.30)
Following (Bautin, 1984), we call L1 the first Lyapunov value at boundary R = 0 and l1 at the
boundary r = 0 . After accomplishing some algebraic operations we calculate the first Lyapunov value
at the boundary R = 0 . Thus, we obtain:
L1 (
0 )=
31
4q
4
32
2
0
q ( pB1 + B3 ) + (3 p 2 + 8q )B2 B1 B2
32

(2.1.31)
where:
B1 = 2 (
0 = det
'
31 2
c ), B2 =
'
32 6
11
12
21
22
23
31
32
33
31 6
c , B3 = 3 (
23 2
23 3
13 7
13
(2.1.32)
45
Here we note that:

11
= c1 (k4 + k8 ),
31
= (k6 + k8 )(k4 + k8 ),
12
22
= c4 k8 ,
= (k6 + k8 ) q ,
23
= c4 q ,
32
21
= c1c4 + c5 (k6 + k8 ) ,
= c4 k7 ,
33
13
= k4 (k6 + k8 ),
=0
(2.1.33)
(2.1.34)
Consequently, (2.1.32) follows from:

'
31
11
22
12
21
'
32
11
32
12
31
From (2.1.31) it is easy to see that in this case the first Lyapunov value can be positive or negative,
i.e. hard and soft loss stability can take place. Therefore, time delay is a key factor in the bifurcation
behaviour of the model (2.1.22). In other words, in the case of safe boundaries, L1 < 0 , a slow drift of
the parameters back into the stability region brings a system back into the original response, whereas
in the dangerous case, L1 > 0 , this is generally impossible. Obviously, safe and dangerous boundaries
are distinguished mainly by the stability or instability of the corresponding equilibrium state, or periodic trajectory, on the boundary (Shilnikov, Shilnikov, Turaev, & Chua, 2001). Here we could note that
at the boundary of stability r = 0 (positive feedback loop), two cases occur (Bautin, 1984; Shilnikov,
Shilnikov, Turaev, & Chua, 2001):
If l1 is different to zero, then in case of a transition from negative values to positive ones the equilibrium state becomes unstable double point, the system has irreversible behaviour and the boundary
r = 0 is dangerous. Also, from the sign of the added condition:
* = p 2 q 2 + 4q 3
(2.1.35)
We can have two cases: (i.1) if * < 0 , then the equilibrium state becomes saddle-knot; (i.2) if * > 0 ,
then the equilibrium state becomes saddle-focus.
If the first Lyapunovs value l1 is zero, then the equilibrium state is stable.
In the terms of model (2.1.22), using (2.1.29), we obtain the bifurcation value of time delay, b:
b
(k4 k6 + k4 k8 + k6 k8 )
k1k3 k5 k7
k4 k6 k8 2
1
1 +
k4 + k6 + k8
k1k3 k5 k7
(2.1.36)
As a consequence of our analysis, we can predict that a limit cycle will emerge if the time delay is
higher than b, while the cycle limit will vanish if the time delay is smaller. In other words, as a result
of the evidence obtained through (2.1.25)-(2.1.28) and (2.1.36) we may conclude that in this case the
time delay has a destabilizing role because it changes drastically the properties of the system when pass
through the bifurcation point provoking the emergence of a cycle limit. For time delays longer than the
bifurcation value b, the lactose operon would present sustained oscillations with coupled periodic variations on the concentration of both proteins and the mRNA. In contrast, a time delay smaller than b will
provoke only transient oscillations of the species integrating the operon around a stable steady-state.
46
Numerical Analysis
The values chosen for the parameters and used in the numerical analysis were selected according to
(Jacob & Monod, 1961; Bliss, Painter, & Marr, 1982; Pircher, Petersen, Gustafsson, & Haldosen, 1999;
Fall, Marland, Wagner, & Tyson, 2002; Timmer, Mueller, Swameye, Sandra, & Klingmueller U., 2004).
The analytical results stated in previous sections permit us to predict how the properties of the system
vary when the parameters in the model are modified. In Figure 5 we show the dependence on k3 and k5
of the critic value of time delay in which the system suffers the transition to a stable limit cycle, b.
The value of b is quite sensitive to changes in both parameters, especially k3. For example, a change
of 50% in the value of k3 multiplies by three the value of the critical time delay, and a change of 100%
in k3 increases the value of the critical time delay up to 27 minutes. This means that for very high values
on k3, the transition to stable limit cycles will not appear in the system because it requires too long,
biologically unfeasible, time delays (much longer than half an hour).
In order to compare the predictions with numerical results, the governing equations of the model,
represented by (1), were solved numerically using MATLAB (Mathworks, 2007). The use of (2.1.36)
permitted us to compute a predicted value for the critic time delay of b = 4.17 minwhen we fixed the
parameters k1 to k8. In Figure 6 we illustrate the dependence of the model behaviour on the value for
the time delay (around this predicted critic value b ). Figure 6a shows a simulation of the dynamics
for y1, y2 and y3 when time delay is lower than the critic value b ( = 2.5). After several physiologically
acceptable fluctuations, the variables describing mRNA, y1, enzyme, y2 and repressor, y3 approach to
constant values that describe a steady-state of the system. In other words, in this case the Routh-Hurwitz
condition for stability is valid and the system (2.1.1) lies in a stable zone of its parametric space. Notice
that in this case R = 0.5168 > 0 . On the other hand, Figure 6b depicts the dynamics for the case of time
delays higher than the critic value ( = 6.5) where R is negative ( R = 0.2397 ). Mathematically, this
Figure 5. Prediction of values of b (as function of k 3 and k 5 ) using analytical formula (2.1.36)
0

Taub
0

0

0

.
0.
.
0.
K5
.
0.
K3
47
state corresponds to loss of stability. According to the analytical results we can conclude that a stable
limit cycle (self-oscillations) occurs.
Unfortunately, the information supplied by Figure 6b cannot fully illustrate the self-oscillation behaviour of the dynamical model (2.1.1). Hence, in Figures 7a and 7b we show the self-oscillation solutions for y1 and y3. Although the stable limit cycle occurs, which coincides with the predictions of our
qualitative analysis, we can see that the amplitudes of the oscillations for = 6.5 min are very small for
the different variables (approximately 1% of the total values in the analysed state).
These results shown in Figure 7 need additional discussion. The existence of cooperativity in protein-gene interactions results in large changes in activation with small changes in protein concentration
(Wolkenhauer, Sreenath, Wellstead, Ullah, & Cho, 2005. On the other hand, the minimum cooperativity
required for oscillations becomes small when the length of the feedback loop increases. Since in the
case analysed we supposed low cooperativity and a short length feedback loop, the limit cycle has small
amplitude (Fall, Marland, Wagner, & Tyson, 2002; Agnati, Tarakanov, & Guidolin, 2005). On the other
Figure 6. Stable (a) and unstable (b) solution of the system (1) (p1=1) at = 2.5 min and = 6.5 min
y1
(a)
y1
y3
(b)
y3
Figure 7. Periodic solutions for mRNA (a) and repressor (b) at = 6.5. The amplitude of the oscillation
is much reduced and it represents only 1% of the values of both variables in the analysed state
y1
y3
(a)
48
(b)
hand, the case when cooperativity is considered (p1 = 2) is shown in Figure 8. For the same values of
the rate constants from k1 to k8 and time delay (see (2.1.37)) the stable limit cycle also occurs but now
the system (1) has a periodic solution with amplitude much larger than the previous case (amplitude of
the oscillation around 100% of the average value in different variables).
In Figures 9a and 9b L1(0) (calculated on the boundary of stability R = 0 in (2.1.31)) is shown for
different values of the bifurcation parameters , k5 and k3 (which are respectively, the rate of synthesis
of b-galactosidase, y2, and the rate of repression of mRNA, y1, by lac repressor protein, y3). In Figure
9a (k1 = 248, k2 = 0.05, k3 = 1.1, k4 = 0.3, k6 = k7 = 0.2 and k8 = 1.1) both and k5 are considered bifurcation parameters. In this case, L1(0) is only negative, which ensures soft stability loss. In Figure 9b
(k1 = 248, k2 = 0.05, k4 = 0.3, k5 = 0.1, k6 = k7 = 0.2 and k8 = 1.1) k3 and are the bifurcation parameters
considered. The first Lyapunov value is also negative, and soft stability loss also takes place. This
Figure 8. Periodic (self-oscillation) solutions (a), (b) (c) and phase portrait (d) of the system (2.1.1) (p1=2)
at k1 = 250, k2 = 0.05; k3 = 1.1, k4 = 0.3, k5 = 0.4, k6 = 0.62, k7 = 0.2, k8 = 0.65,
= 6.5
y2
y1
(a)
(b)
y3
(c)
(d)
49
behaviour is shown in Figures 9c-g. A stable limit cycle with small amplitude occurs, which is in accordance with the assumption of no cooperativity (p1 = 1). However, the system (2.1.22) has a very long
transient regime (Figures 9d-e).
Figure 10 demonstrates the dependence of the oscillation magnitude on the cooperativity p1 (considering the same numerical values of rate constants and the time delay than in case represented in Figure 9).
Comparing Figure 9d-g and Figures 10a-b, we conclude that for a larger value of cooperativity (p1 = 2)
the oscillation magnitudes are also larger and their frequency (during 400 min the system makes 5
spikes) is smaller than the value obtained at p1 = 1.
Figure 9. Analysis of the graphs of L1 versus the bifurcation parameters k5 [0.001,15], [8.,8.4897]
(a) and k3 [1,10],
[8.01, 8.05] (b). Stable solutions for k5 = 0.1 and = 2, small time-delay (figure
9c), and a sustained oscillation with reduced amplitude at k5 = 0.1 and = 8.5 (Figures 9d-g)
(a)
y1
y2
y3
y1
y2
y3
(c)
50
(b)
(d)
Figure 9. (continued)
y1
y2
y3
y1
(e)
(f)
y2
(g)
Time Delay Model of RNA Silencing

Some previous results about numerical investigations of time delay model of RNA silencing were presented in Paper D5.2 of project COSBICS. Also, in the Paper D4.3 of same project a retroviral based
expression system for the effective delivery of shRNAs (small hairpin RNAs) which can induce RNAi
response in BaF3 cells is established. There, it is proposed design of optimal RNA hairpin construct
by employing the web application E-RNAi (Alziman, Horn, & Bourtos, 2004). As a result silencing
efficiencies up to 37% of CIS protein levels are achieved. Here, we give the final form of this theoretical investigation.
RNA interference (RNAi) is a relatively new mechanism for selectively silencing of genes in a variety of organisms, including plants, fungi and worms (Boese, Scaringe, & Marshall, 2003; Berezhna,
51
Figure 10. Periodic solution (a) and phase portrait (b) of the system (2.1.28) when: p1 = 2, = 8.5,
k1 = 248, k2 = 0.05, k3 = 1.1, k4 = 0.3, k6 = k7 = 0.2 and k8 = 1.1
y1
y2
y3
(a)
(b)
Supekova, Supek, Schultz, & Deniz, 2006; Hannon, 2002). It is rapidly advancing as both a target tool
of validation in drug discovery and as a potential therapeutic one. As systematic or therapeutic application RNAi can be triggered by the delivery of double-stranded RNA (dsRNA), small hairpin RNA
(shRNA) or micro RNA (miRNA) (Storz, 2002; Hobert, 2004). The specificity of siRNA consists in
the fact that together with the corresponding proteins it forms the RNA-induced silencing complex
(RISC). The last is capable of recognizing the target mRNA by hibridization and induces endocleolytic
cleavage (Figure 11).
As it is known, genes provide cells with instructions for making specific proteins that are encoded
by that gene. By silencing a gene we can refer to stopping or reducing significantly the production of
the specified protein encoded by the target gene (Rozema & Lewis, 2003). The cell makes a copy of the
gene that encodes for the particular protein in order to initiate the protein production. It is important to
note that this copy is not made of DNA, but rather of ribonucleic acid, or RNA. Moreover, it is referred
to as messenger RNA, or mRNA. It is precisely this mRNA, manufactured in the cell nucleus that
travels into the cytoplasm of a cell to the organelles responsible for protein synthesis. There, it directs
the production of a protein based on the DNA sequence carried by the mRNA. When this process works
uninterrupted and the protein is produced, the gene is said to be expressed.
The duration of gene silencing lasted for ~1 week in rapidly dividing cells but longer than 3 weeks
in nondividing cells both in vitro and in vivo, supporting the hypothesis that dilution due to cell division is the major factor controlling the duration of luciferase knockdown in rapidly dividing cells. The
gene silencing duration by siRNA can be longer than that achieved with other nucleic acid-based gene
inhibition strategies, whose knockdown as a rule lasts only in the order of 1-2 days.
Modeling in this area (gene expression and RNA silencing) is occurring at a rapid pace (Munroe &
Zhu, 2006). In the last decades the different mathematical models of a RNA silencing have been proposed with the goal to understand biological processes that are regulated by the dynamical properties
of mRNA (Bergstrom, McKittrick, & Antia, 2003; Bartlett & Davis, 2006; Gafney & Monk, 2006).
Generally speaking, the advantages of mathematical modeling are as follows: (i) the development of
52
Figure 11. Schematic diagram of the basic RNA interference (taken from www.calandopharma.com)
mathematical models leads to a decrease in the number of experiments using expensive biological material, and makes it possible to predict the results of many experiments with a great precision (Edissonov
& Nikolov, 2001); (ii) the use of the theory of dynamical systems leads to the creation of base models
of the complex biological processes, such as: gene regulation (transcription), gene expression, protein
synthesis, interaction between various molecules, etc., and allows investigation of new mechanisms of
the RNA interference in qualitative ways; (iii) the results of the modeling can be presented visually and
used by biologists, biochemists and immunologists in clinical practice and therapy.
To our knowledge, there are only a few published mathematical models of studies regarding the
kinetics of the intracellular RNAi process (Bergstrom, McKittrick, & Antia, 2003; Arciero, Jackson,
& Kirschner, 2004; Raab & Stephanopoulos, 2004; Groenenboom, Maree, Hogeweg, 2005; Bartlett
& Davis, 2006). Of these models, none has combined the delivery process and the interaction with
the RNAi machinery in mammalian cells. For example, in (Groenenboom, Maree, Hogeweg, 2005),
Groenenboom and co-authors proposed a mathematical model that contained several extensions to the
core RNAi pathway, providing for siRNA degradation by Dicer (RNase III family of endonucleases)
as well as primed amplification.
Models
A Basic Model of RNA Silencing
In (Bergstrom, McKittrick, & Antia, 2003), a mathematical description of a conceptual model of RNA
silencing process is presented. In Figure 12 a schematic outline of the basic elements comprising this
model is provided.
We can model the steps denoted in Figure 12 by using the following autonomous system of ordinary
differential equations solved with respect to the derivatives:
53
Figure 12. Schematic diagram of the basic RNA-silencing model according to (Bergstrom, McKittrick,
& Antia, 2003)
Degradation
Synthesis
Complex
dsRNA
Cleavage
and
Association
mRNA
RISC
dD
= a.D + g .C ,
dt
dR
= an.D d R .R b.RM ,
dt
dC
= (g + dC ).C + b.RM ,
dt
dM
= h d M M b.RM
dt

(2.2.1)
In this dynamical system, the state variables D,R,C,M represent the concentrations of the dsRNA,
RISC, RISC-mRNA complex, and mRNA at time t.
Here, we present an initial analysis of the system (2.2.1), which differs from the results presented in
Bergstrom, McKittrick, and Amtia (2003). The nonzero steady state values of D,R,C,M in analytical
form are:
D=
_
( g + dC )d R
d d
g
h
C ,R =
C ,C =
M R ,M =
a
dR
g+dC
b
b

where
(2.2.2)
= [g (n 1) d c ]. For parameter values:
a = 10; b = 0.1; h = 2; g = 0.1; d M = 0.5; d R = 0.1; d C = 0.05; n = 5
(2.2.3)
The steady state values (2.2.3) are positive and have physical sense. Then their stability can be analysed by linearizing the system (2.2.1) around the steady state (2.2.2) and computing the corresponding
Routh-Hurwitz coefficients. In this case they all are positive and have the form:
54
p = a + d R + bM + g + d C + bR + d M
(2.2.4)
q = a (d R + bM + g + dC + d M + bR ) + (d R + bM )( g + d C ) +
+ (d R + bM )(d M + bR ) b 2 MR + ( g + d C )(d M + bR )
(2.2.5)
r = a (d R + bM )( g + d C ) abgnM + a (d R + bM )(d M + bR )
ab 2 MR + a (d M + bR )( g + dC ) + bRM ( g + dC ) + bRM ( g + dC )
(2.2.6)
s = a ( g + dC ) d R (d M + bR ) abgnd M M
(2.2.7)
l = pqr sp 2 r 2
(2.2.8)
It can be shown that when the steady state values C , D, M , R are positive, then the Routh-Hurwitz
coefficients are positive, too. Thus, all physically meaningful steady state values are stable (in accordance with the corresponding Routh-Hurwitz criterion of stability) (Bautin, 1984), and the model (2.2.1)
can be considered as realistic at least in qualitative sense.
Model with Time Delay

In this section, we investigate the following system:
dD
= a.D + g .C (t ) ,
dt
dR
= an.D d R .R b.RM ,
dt
dC
= b.RM (g + d C ).C (t
dt
dM
= h d M M b.RM .
dt
),

(2.2.9)
where the delay function C(t ) expresses the assumption that the net rate of degradation of dsRNA
by Dicer and background process, and the net rate of loss of dsRNA are proportional to the triggers
process of binding of mRNA to form the RISC-mRNA complex in the moment (t ). This is in view
of the consideration that the regeneration (or degradation) of the RISC-mRNA complex needs a finite
time . Of course, the finite time of regeneration can be different from this of degeneration. Because,
here, in order to make the analytical investigation of system (2.2.9) easier, we shall
assume
that_ the two
_
_
_
h
)
times are equal. Hence, the system (2.2.9) has also two steady states- the trivial ( D = C = R = 0, M =
dM
and see (2.2.2).
Further, we investigate the bifurcation structure, particularly the Andronov-Hopf bifurcation for the
system (2.2.9), using time delay as the bifurcation parameter. Firstly, we obtain the characteristic equa_
_
_
_
_
tion for the linearization of the system (2.2.9) near the equilibrium E D > 0, R > 0, C > 0, M > 0 (i.e.
all are positive and the silencing reaction controls the level of mRNA
below
its
normal
level).
Next,
we
_
_
_
_
consider a small perturbation about the equilibrium level, i.e., D = D + x, R = R + y, C = C + z , M = M + w.
55
Substituting these into the differential equations (2.2.9) we have:
dx
= ax + g z ,
dt
dy
= anx a1 y a2 w byw,
dt
dz
= a3 y a4 z + a2 w + byw,
dt
dw
= a3 y a5 w byw

dt
(2.2.10)
where
_
a2 = b R, a3 = b M , a4 = g + d C , a5 = d M + b R
a1 = d R + b M ,
(2.2.11)
Hence, we obtain the stability matrix in the form:

a
0
an a1
0
a3
0
a3
0
a2
0
a4
a2
a5
(2.2.12)
The stability matrix (2.2.12) leads to the following characteristic equation:

4
+ K1
+ K2
+ K3 =
(T
+ T2
+ T3 + T4 )
(2.2.13)
where
K1 = a + a1 + a5 , K 2 = a (a1 + a5 ) + a1a5 a2 a3 , K 3 = a (a1a5 a2 a3 ),
T1 = a4 , T2 = K1a4 , T3 = a (a1a4 nga3 + a4 a5 ) + a4 (a1a5 a2 a3 ) ,
T4 = a a4 (a1a5 a2 a3 ) + gna3 (a2 a5 )

(2.2.14)
Generally speaking, the transcendental equation (2.2.13) cannot be solved analytically and has an
indefinite number of roots. In essence, we have two main tools besides direct numerical investigation;
firstly, the linear stability analysis, especially in the case of small time delay (i.e. t < 1 ), and secondly,
the Hopf bifurcation theorem for larger time delay.
Andronov-Hopf Bifurcation of the Time Delay Model: Linear Stability Analysis

For a small delay ( < 1), the method of linear stability analysis is much convenient to find the bifurcation point. Thus, let 1 , then the eigenvalue equation becomes:
56
+p
+q
+ r + s = 0
(2.2.15)
By the Hopf bifurcation theorem and the Routh-Hurwitz criteria, an Andronov-Hopf bifurcation
occurs at a value = b where:
p=
a4 + (1 a4
1 a4
s=
T4
1 a4
)K1 > 0,
)=
K 2 + K1a4 K 4
1 a4 b
> 0, r =
K 3 T3 + T4
1 a4 b
> 0, l = pqr sp 2 r 2 = 0
+p
+q
> 0,
(2.2.16)
+ r + s
(2.2.17)
where the condition a4

h( ,
q=
1 is valid. Let:
b
2
Evaluating h at = b yields:
h( ,
)=
where k 2 =
1,2
+p
+q
+ k 2 p + k 2 (q k 2 )
(2.2.18)
r
. The eigenvalues of (2.2.15) at b are:
p
r
p
= ik =
(2.2.19)
sp p
and the type of the other pair roots depend on the sign of the equality 1 = . Here we note that i
r 4
is imaginary unit. If (i) 1 > 0 then:
3,4
p
2i
2

2
where 2 =
3,4
(2.2.20)
sp p 2
( 2 > 0 ); (ii) 1 < 0 then:

r
4
p
2

2
where now 2 = 1 . Differentiating implicitly h (
(2.2.21)
( ), ) yields:
d
p' 3 + q' 2 + r ' + s'
h h
=
= 3
/

d
4 + 3 p 2 + 2q + k 2 p
(2.2.22)
where:
a (K T ) + T
a4
a42 K1 + a4 K 2 + T3
a4T4
'
p =
, r ' = 4 3 3 2 4 , s' =
, q =
2
(1 a4 )
(1 a4
(1 a4 )
1 a4
2
'
)
2
(2.2.23)
57
Evaluating the required derivatives of h at b, we obtain:
( b ) = s
'
q ' k 2 + ( p ' k 2 + r ' )ki 2 pk 2 2k (q 2 k2 ) i

2
L +I
(2.2.24)
(2.2.25)
or:
( b ) = 2k
N + 2k pk 2 (r ' p ' k 2 )+ (q 2k 2 )(s ' q ' k 2 ) i

L2 + I 2
where L = 2 pk 2 , I = 2k (q 2k 2 )i and N = p (s ' q ' k 2 )+ (q 2k 2 )(p ' k 2 r ' ). The real part of
(2.2.25) has the form:
d (
Re 1
d
) =
2k 2 N
L2 + I 2
(2.2.26)
The real part (see (2.2.26)) is always positive if N > 0, i.e., if the following conditions are valid:
s' > q'k 2 ,

q > 2k 2 , p ' k 2 > r ' or
q < 2k 2 , p ' k 2 < r '
(2.2.27)
So inequalities (2.2.27) are sufficient to have positive slope of the real part of the eigenvalue 1().
This fact (according to the Hopf bifurcation theorem (Marsden & McCracken, 1976)) guarantees the
bifurcation to a limit cycle for = b.
Hopf Bifurcation Analysis

It is well known that for a larger time delay , the linear stability analysis of the previous section is no
longer effective and we need to use another approach (Marsden & McCracken, 1976; Galach, 2003;
Cai, 2005; Kavasseri, 2005). The stability of (2.2.2) depends on the sign of the real parts of the roots
of Eq. (2.2.15). We let = m + in (m,n R) and rewrite Eq. (2.2.15) in terms of its real and imaginary
parts as:
m 4 + n 4 6m 2 n 2 + K1 (m3 3mn 2 )+ K 2 (m 2 n 2 )+ K 3 m =
= m T1 (m3 3mn 2 )cos n + (3m 2 n n3 )sin n +

+ T2 (m 2 n 2 )cos n + 2mn sin n + T3 [m cos n + n sin n ]+T4 cos n },
4mn (m 2 n 2 )+ K1 (3m 2 n n3 )+ K 2 (2mn ) + K 3 n =
= m T1 (3mn 2 m3 )sin n + (3m 2 n n3 )cos n +

+ T2 (n 2 m 2 )sin n + 2mn cos n + T3 [n cos n m sin n ]T4 sin n
58
(2.2.28)
To find the first bifurcation point, we set m = 0. Then the above two equations reduce to:
n 4 K 2 n 2 = (T2 n 2 + T4 )cos n + (T1n3 + T3 n )sin n ,
K1n3 + K 3 n = (T1n3 + T3 n )cos n + (T2 n 2 T4 )sin n
(2.2.29)
These two equations into (2.2.29) can be solved numerically. If the first bifurcation point is (nb ,
then the other bifurcation points (nb , b ) satisfy:
0
nb
= nb0
0
b
v = 1, 2, ...
+ 2v ,
0
b
),
(2.2.30)
By squaring the two equations into (2.2.29) and then adding them, it follows that:
n8 + (K12 2 K 2 T12 )n 6 + (K 22 2 K1 K 3 + 2T1T3 T22 )n 4 + (K 32 T32 + 2T2T4 )n 2 T42 = 0
(2.2.31)
Here we note that this is a quartic equation about n2 and the left side is positive for large values of
2
n2 and negative for n = 0 because T4 is always negative, i.e., (2.2.31) has at least one positive real root.
Moreover, to apply the Hopf bifurcation theorem, according (Khan & Greenhagh, 1999), the following
theorem in this situation applies.
Theorem 1. Suppose that nb is the last positive simple root of Eq. (2.2.31). Then in ( b ) = inb is a simple
root of Eq.(2.2.13) and m ( ) + in ( ) is differentiable with respect to in a neighbourhood of = b.
To establish Andronov-Hopf bifurcation at = b, we need to show that:
dm
d
Hence, if we denote H ( ,
)=
d
H H
/
=
=
d
+ 3K1
+ 2 K 2 + K3 +
+ K1
(T
(T
1
1
+ K2
+ T2
+ T2
+ K3
+ T3 + T4 )
d
= Re
+ T3 + T4 )
Evaluating the real part of this equation at = b and setting
dm
dt
(T
+ T2
(3T
+ T3 + T4 ), then:
+ 2T2 + T3 ) (2.2.33)
= inb yield:
n 4n + 3n (K 2 K 2 T12 )+ 2nb2 (K 22 2 K1 K 3 + 2T1T3 T22 )+ K 32 T32 + 2T2T4

2
b
6
b
4
b
2
1
2
1
(2.2.32)
2
1
L +I
59
where:
L1 = K 3 3K1nb2 +
(n
2
b
K 2 )nb2 + (3T1nb2 T3 )cos nb
2T2 nb sin nb
and
I1 = 2 K 2 nb + 4nb3
Let
g(
=4
3 b
K1nb3 )+ 2T2 nb cos nb
+ (K12 2 K 2 T12 )
Then for g
(K n
+ (3T1nb2 T3 )sin nb
= nb2, then (2.2.31) reduces to:
)=
g' (
'
dg
d
+ (K 22 2 K1 K 3 + 2T1T3 T22 )
+ (K 32 T32 + 2T2T4 ) T42 = 0
( ) we have:
=
=
+ 3 (K12 2 K 2 T12 )
+ 2 (K 22 2 K1 K 3 + 2T1T3 T22 ) + K 32 T32 + 2T2T4
If nb is the least positive simple root of the Eq. (2.2.31) then:
dg
d
= nb2
>0

(2.2.34)
Hence:
dm
d
d
= Re
=
=
nb2 g ' (nb2 )

L12 + I12
>0

(2.2.35)
According to the Hopf bifurcation theorem (Marsden & McCracken, 1976) we define the following
main result in this section.
Theorem 2. If nb is the least positive root of Eq.(2.2.31), then an Andronov-Hopf bifurcation occurs
as passes through b.
Here we note that a similar phenomenon appeared in the epidemic model with a time delay in vaccination for nonsexually transmitted diseases (Marsden & McCracken, 1976).
N umerical A nalysis of the T ime D elay Model

In this section, we analyze numerically the time delay model constituted by Eq. (2.2.9) for the concentrations of the double-stranded RNA (dsRNA)-D(t), the RNA-induced silencing complex (RISC)-R(t),
RISC-mRNA complex-C(t), and mRNA at time t, respectively. The corresponding numerical values
of the model parameters are those in (2.2.3). However, we do not know the exact time point at which
the RISC-mRNA complex begins to regenerate (or degrade) and assume that it takes a little bit longer;
60
hence we set = 10 or 13 hours. The governing equations of the model, represented by Eq.(2.2.9), were
solved numerically using MATLAB (Mathworks, 2007). The following Figures 13 and 14 demonstrate
the dependence of the model behavior on the parameter . We fix the model parameters (see (2.2.3)) and
vary the time delay , which from the point of view of qualitative theory plays the role of a bifurcation
parameter. In Figure 13, the stable solutions for dsRNA, RISC, RISC-mRNA complex and mRNA are
shown for = 10. It is evident that after several physiologically acceptable fluctuations, the concentrations of the dsRNA, RISC, RISC-mRNA complex and the mRNA approach constant values (equilibrium
states). In other words, in this case the system (2.2.9) lies in a stable zone of its parametric space.
In Figure 14 the case when = 13 is shown. It is seen that for the same values of the rate constants,
after the Andronov-Hopf bifurcation, the stable limit cycle with period one occurs and the system (2.2.9)
has periodic solutions. Thus, it means that time delay amplifies the instability of the steady state. In
other words, as a result of the evidence obtained in Figures 13 and 14, we may conclude that the time
delay has destabilization role in the RNA silencing process.
Figure 13. Stable solutions of the system (2.2.9) at = 10. The time is in hours
61
Figure 14. Unstable periodic solutions of the system (9) at = 13 after introduction of large (high) dose
of dsRNA
Figure 15 depicts the case when the initial dose of dsRNA is small. Many experimental studies as
(Parrish, Fleenor, Xu, Mello, & Fire, 2000; Lipardi, Wei, & Paterson, 2001; Giordano, Rendina, Peluso,
& Furia, 2002) suggest that larger initial doses of dsRNA engender larger silencing reactions or induce
silencing more effectively. For example, in Ref. 81, the authors suggest that effective gene silencing
by RNAi depends on a number of important parameters, including the dynamics of gene expression
and the RNA dose. Their experiments demonstrate that different levels of silencing can be attained
by modulating the dose level of RNA and the time of transfection and illustrate the importance of a
dynamic analysis in designing robust silencing protocols. Thus, we expect dosage dependence to be
useful in avoiding self-directed reactions. Comparing Figure 14 and Figure 15, we conclude that the
time for formation of stable limit cycle in case with a small dose is larger than this with a high dose.
But, on the other hand the amplitude of oscillations of dsRNA (for small initial dsRNA concentration)
is longer. Note that frequency is also higher. Because greater quantities of dsRNA typically are a more
reliable indicator of the presence of non-self, dosage reduces the impact of mistaken reactions to selfderived genetic material.
62
Figure 15. Unstable periodic solutions of the system (9) at = 13 after introduction of a small dose of
dsRNA. The time is in hours
Time Delay in ERK and STAT Interaction

From the results described in the work (Pircher, Petersen, Gustafsson, & Haldosen, 1999), a model
for interaction between ERK and STAT5a in CHOA cells can be derived. As it is proved in (Pircher,
Petersen, Gustafsson, & Haldosen, 1999), in unstimulated cells STAT5a is complexed with inactive
ERK that binds to STAT5a via its C-terminal substrate recognition domain to an unknown region on
STAT5a. Then via its active site it binds to the C-terminal ERK recognition sequence in STAT5a. On
the other hand, upon GH stimulation, MEK activates ERK through phosphorilation of specific threonine and tyrosine residues in ER K. The active ERK phosphorilates serine 780 in STAT5a, resulting
in decreased affinity between the two proteins and dissociation of the complex. From the biochemical diagram in (Pircher, Petersen, Gustafsson, & Haldosen, 1999), we can write the following system
of ordinary differential equations for the kinetics of STAT5a/S phosphorylation and ERK activation,
63
described by concentration variables e1,e2,s1,s2 denoting concentrations of ERK-inactive, ERK-active,

STAT- and STAT-phosphorylated respectively. It has the form:
de 1
= k0 e1s1 + k2 e2 I ,
dt
de 2
= k0 e1 (t ) s1 (t ) k2 e2 + I ,
dt
ds 1
= k1e1 s1 + k3 s2 + A,
dt
ds 2
= k1e1 (t ) s1 (t ) k3 s2 A
dt
(2.3.1)
where k1 is proportional to the frequency of collisions of ERK and STAT protein molecules and present
rate constant of reactions of associations; k2 and k3 are constants of exponential growths and disintegrations; I > 0 and A > 0 are inhibitor and activator sources respectively, is time delay of ERK and STAT
interaction (i.e. ERK activation and STAT phosphorylation). The source I inhibits the inactivation of
active ERK, and A activates the dephosphorylation of phosphorylated STAT5a. The terms I and A can
be also considered as some effective (apparent) inhibitor and activator, under condition that they present
really some in-flux and out-flux of the active ERK and phosphorylated STAT5a respectively. Generally
said, by introducing I and A we take into account the natural circumstance that ERK and STAT interaction is not isolated but open process in the intracellular space. We can consider I and A as algebraic
values having different signs (plus or minus) and determining one or another type of behavior of the
system (2.3.1). In this way (2.3.1) can be used as dynamical model of ERK and STAT interaction. A
more concrete interpretation of the inhibitor I and activator A can be given in connection with the role
of the SOCS proteins in linking JAK/STAT and MEK/ERK pathways. Biological responses elicited by
the JAK/STAT pathway are modulated by inhibition of JAK (and respective attenuation of STAT) by a
member of the Suppressors of Cytokine Signalling (SOCS) proteins. On the other hand over-expression
of some SOCS family members induces activation of ERK, which can phosphorylate STATs.
Further we apply the linear time delay approximation explained in one from the previous sections.
As a result the relations:
e1 + e2 = e0 k0 e1s1
(2.3.2)
s1 + s2 = s0 k1 e1s1 (here it follows s1 s1 > 0)
(2.3.3)
can be obtained. By replacing them in the second and fourth equation of (2.3.1-2), the last is reduced to
the following two-dimensional system of ordinary differential equations (without time delays):
de1
= k0 (1 + k2 )e1s1 k2 e1 + k2 e0 I
dt
(2.3.4)
ds1
= k1 (1 + k3 )e1s1 k3 s1 + k3 s0 + A
dt
(2.3.5)
64
The system (2.3.4-5) (together with the formulas (2.3.2-3)) is considered to be equivalent to the time
delay system (2.3.1) when is sufficiently small. Certainly, the behavior of the variables e1 and s1 (inactive ERK and non-phosphorylated STAT respectively) is leading one with respect to that of e2 (active
ERK) and s2 (phosphorylated STAT). Thus we can qualify the system (2.3.4-5) as a driver of ERK and
STAT cross talk and the variables e1 and s1 can be called driving variables. This is the first conclusion
from the qualitative analysis of the time delay system (2.3.1) in linear approximation.
The steady state value of e1 is
e10 = e0
k (1 + k2 )
I k0 k3 (1 + k2 )
A
( s0 s10 ) 0
k2 k1k2 (1 + k3 )
k1k2 (1 + k3 )
(2.3.6)
where s1 is determined as a positive root of quadratic equation we obtain after substituting (2.3.6) in
the steady state equations of (2.3.4-5), i.e. the right hand sides of (2.3.4-5) equated to zero. To study the
nature (stable or not) of the steady state we introduce in (2.3.4-5) the substitutions:
e1 = e10 + x , s1 = s10 + y
(2.3.7)
where x and y are variations (small disturbances, perturbations). After corresponding linearization the
variation system for x and y takes the form:
dx
= k2 + k0 s10 (1 + k2 ) x k0 (1 + k2 )e10 y k0 (1 + k2 ) xy,
dt
dy
= k1 (1 + k3 ) s10 x k3 + k1e10 (1 + k3 ) y k1 (1 + k3 ) xy

dt
(2.3.8)
The Routh-Hurwitz conditions for stability of the steady state (2.3.6-7) are:
k2 + k3 + k0 s10 (1 + k2 ) + k1e10 (1 + k3 ) > 0
(2.3.9)
k2 k3 + k1k2 e10 (1 + k3 ) + k0 k3 s10 (1 + k2 ) > 0
(2.3.10)
It is evident from (2.3.9-10) that in the presence of time delay the inequalities are stronger than in
the absence of . That means the time delay amplifies the stability of the steady state of the driver system
(2.3.4-5) as well as of the whole time delay model (2.3.1) of ERK and STAT cross talk, at the condition
of sufficiently small , i.e. in linear approximation.
Co nc lusi on
In this chapter we present some results obtained from us in (Nikolov, Kotev, Georgiev, & Petrov, 2006;
Nikolov, Kotev, & Petrov, 2006a; Nikolov, Kotev, & Petrov, 2006b; Nikolov, Vera, Wolkenhauer, Yankulova, & Petrov, 2007; Nikolov & Petrov, 2007; Nikolov, Vera, Kotev, Wolkenhauer, & Petrov, 2008). The
main conclusions from section 2.1, where we investigate whether the inclusion of time delays alters the
65
dynamical properties of the Jacob-Monod model (which describes the control of the b-galactosidase
synthesis by the lac repressor protein in E.coli), are:

The basic view that the time delay is a key factor in the dynamical behaviour of the system has
been confirmed by the analytical calculations and numerical simulations. From the qualitative
theory of DDE and ODE viewpoint, time delay appears as a bifurcation parameter on whose values
depend the altered (stable or unstable) behaviour of the model. When no delay is considered in
the synthesis of b-galactosidase, only a soft loss of stability take place, and changes of time delay
through the critic value b has reversible behaviour. This means that at the transition of R through
the boundary R = 0 from positive values to negative ones a stable limit cycle emerges, i.e., selfoscillations of the system appear. Inversely, at the transition of R from negative values to positive
ones the stable limit cycle disappears, i.e., the self-oscillations cease. However, when a time delay
is considered the properties of the system changes drastically, and hard loss of stability emerges.
For time delays longer than b, the lactose operon would present sustained oscillations with coupled periodic variations on the concentration of both proteins and the mRNA. In contrast, a time
delay smaller than b will provoke only transient oscillations of the species integrating the operon
around a stable steady-state. We can say that in this situation time delay has a destabilizing role.
When simple (non-cooperative) inhibition of mRNA production by the lac repressor protein is
considered (p1=1.0), oscillations in values of concentrations for proteins and mRNA appear but
the amplitude of such oscillations is much reduced (around 1% of the average concentration in
the cases shown in Figures 7 and 9). If the system is not able to distinguish these fine-tuning oscillations, the lac operon would act as if an effective quasi steady-state exists. From a biological
perspective, it could be reasonable to think that the system presents at least local robustness with
respect the concentration of lactose (which is a primary carbon source of the system), and then this
reduced amplitude oscillations would not provoke a differentiated response of the system during
the different phases of the oscillation. In contrast, when cooperativity in the repression of mRNA
synthesis is considered (p1=2.0), the oscillations also occurs but the magnitude of this oscillations
in the concentration of proteins and mRNA is actually significant. In some cases (Figures 8 and
10) the cooperativity provokes oscillations with changes all-nothing for the concentration of the
proteins during the period of the oscillations. Therefore, the oscillations induced for time delays
higher than the critic value b in a system with cooperavity could provoke clearly differentiated
responses of the system during the period of the oscillations. The conclusion of our analysis is
that in the case of the lac operon analysed, not only time delay in galactosidase synthesis but also
cooperativity in the end product repression is necessary to induce a regime of effective sustained
oscillations.
From a physiological viewpoint, the hard (irreversible) loss of stability might be related to the
emergence of new configurations in the regulatory gene circuit that could lead the system into a
pathologic state. Similar behaviour has been already detected and discussed in other biological
systems with different nature such as cardiac pulsations and ocular system (Petrov & Nikolov,
1998; Petrov & Nikolov, 1999), but this conclusion that we point here for hard stability loss is suggested here by the first time for protein synthesis systems.
The main conclusions from investigations in section 2.2 of time delay model of RNA silencing are
as follows.
66
Besides the stability, the original model (2.2.1) captures the following features of RNA-silencing
dynamics from empirical studies (As it is argued in paper (Bergstrom, McKittrick, Antia, 2003)):

The model detects and degradates the initial dose of dsRNA by essential initial drop in dsRNA
concentration.
It demonstrates a rapid generation of sequence-specific siRNA by rapid rise of RISC in the early
stages of reaction.
It shows an amplification of the response, producing secondary dsRNA and siRNA molecules
by the increase in dsRNA concentration.
In other words, if the system possesses a stable equilibrium state, then this corresponds to a normal
silencing process. On the other hand, the existence of unstable equilibrium states, stable limit cycles
(self-oscillations) or chaotic attractors in this case corresponds to a pathology, i.e. an abnormal RNA
silencing process. To our knowledge, in the literature up until now, time delay models describing gene
silencing have not been suggested or investigated. Therefore, this investigation of ours is a new outlook
on the problems and the factors influencing silencing. From the accomplished analytical and numerical calculations it becomes clear that time delay is a key factor in the behavior of model (2.2.9). In
this case it has a destabilizing role on the silencing process. In terms of dynamical systems plays
the role of a bifurcation parameter. If (i.e. the time necessary for the regeneration (or degradation)
of the RISC-mRNA complex)) is greater than a certain (bifurcation) value in model (2.2.9), through
Andronrov-Hopf bifurcation appears a self-oscillation related with an abnormal silencing mechanism.
A possible cause is that the longer time brings about a production of a longer dsRNA, different from
the one necessary for silencing. The appearance of periodical solutions in the time delay model (2.2.9)
makes its behavior much richer compared to the original model proposed by Bergstrom and co-workers
in (Bergstrom, McKittrick, Antia, 2003). This allows for investigating more properties and regimes of
the silencing mechanism.
From the simulations made in Figure 15 it is seen that at smaller initial doses of dsRNA the time
for establishing of the stable limit cycle is longer, i.e. we have a longer transition period. On the other
hand, however, the appearing self-oscillations are with a greater amplitude and frequency than those at
a bigger initial dose of dsRNA. In particular, only the amplitude of D (dsRNA) is altered. It is held in
the literature (Bergstrom, McKittrick, & Antia, 2003; Raab & Stephanopoulos, 2004; Groenenboom,
Maree, Hogeweg, 2005; Bartlett & Davis, 2006), that bigger initial doses of dsRNA cause stronger silencing reactions or make silencing more effective. Moreover, it may be expected that the dependence
on dosing is useful for avoiding self-directed reactions.
In conclusion it may be noted that in practice time delay for generation can be different from this for
degeneration. This in its turn makes the analytical investigation harder. For that reason it is assumed
here that the two times are equal. From the numerical simulations (which are not shown) it became clear
that the two times are very similar. For some of our conclusions an empirical verification is needed.
The last section 2.3 of this chapter presents original elaboration of the ERK and STAT interaction
model in the form of time delay of ERK activation and STAT phosphorylation has essential dynamical
role. The basic view is that time delay in the RNA silencing is a key factor in the dynamical behaviour
of model (2.3.1) confirmed by appropriate analytical calculations. From the qualitative theory of DDE
viewpoint, time delay in this case has stabilization role.
67
R eferences
Agnati, L., Tarakanov, A., & Guidolin, D. (2005). A simple mathematical model of cooperativeness in
receptor mosaics based on the symmetry rule. Biosystems, 80(2), 165-177.
Andronov, A., Witt, A., & Chaikin, S. (1966). Theory of oscillations. Reading, MA: Addison-Wesley.
Arciero, J., Jackson, T., & Kirschner, D. (2004). A mathematical model of tumor-immune evasion and
siRNA treatment. Discrete and Continuous Dynamical Systems, 4, 39-58.
Arziman, Z., Horn, T., & Bourtos, M. (2004). E-RNAi: A Web application to design optimized RNAi
constructs. Nucleic Acids Res., 33, W582-W588.
Bartlett, D., & Davis, M. (2006). Insights into the kinetics of siRNA-mediated gene silencing from livecell and live-animal bioluminescent imaging. Nucleic Acids Research, 34, 322-333.
Bautin, N. (1984). Behavior of dynamical systems near the boundary of stability. Moscow: Nauka
Belair, J., & Dufour, S. (1996). Stability in a three-dimensional system of delay-differential equations.
Canadian Applied Mathematics Quarterly, 4, 135-156.
Berezhna, S, Supekova, L, Supek, F, Schultz, P, & Deniz, A. (2006). siRNA in human selectively localizes to target RNA sites. Proc. Natl. Acad. Sci. USA, 103(20), 7682-7687.
Bergstrom, C., McKittrick, E., & Antia, R. (2003). Mathematical models of RNA silencing: Unidirectional
amplification limits accidental self-directed reactions. Proc. Natl. Acad. Sci. USA, 100, 11511-11516.
Bliss, R., Painter, P., & Marr, A. (1982). Role of feedback inhibition in stabilizing the classical operon.
J. Theor. Biol., 97, 177-193.
Boese, Q, Scaringe, S., & Marshall, W. (2003). siRNA as a tool for streamlining functional genomic
studies. Targets, 2(3), 93-99.
Bratsun, D., Volfson, D., Tsimring, L., & Hasty, J. (2005). Delay-induced stochastic oscillations in gene
regulation. Proc. Natl. Acad. Sci. USA, 102, 14593-14598.
Cai, H. (2005). Hopf bifurcation in the IS-LM business cycle model with time delay. Electronic Journal
of Differential Equations, 15, 1-6.
Chen, L., Wang, R., Kobayashi, T., & Aihara, R. (2004). Dynamics of gene regulatory networks with
cell division cycle. Phys Rev E, 70, 011909.
de Jong, H. (2002). Modelling and simulation of genetic regulatory systems: A literature review. J.
Comput Biol., 9(1), 69-105.
Driver, R. (1977). Ordinary and delay differential equations. New York: Springer-Verlag.
Edissonov, I., & Nikolov, S. (2001). Mathematical modelling and phase analysis of HIV infection. Systems Analysis Modelling Simulation, 40, 87-98.
Elledge, S.J. (1996). Mathematical models of protein kinase signal transduction. Science, 274, 16641672.
68
Elsgoltz, L. E. (1957). Differential equations. Moscow: Gosizdat (in Russian)

Elsgolz, L., & Norkin, S. (1974). Introduction in time delay equations. Moscow: Nauka (in Russian).
Fall, C., Marland, E., Wagner, J., & Tyson, J. (2002). Computational cell biology. New York: Springer.
Freeman, M. (2000). Feedback control of intercellular signalling in development. Nature, 408, 313319.
Galach, M. (2003).Dynamics of the tumor-immune system competition-the effectof time delay. Int J
Appl Math Comput Sci, 13(3), 395-406.
Gafney, E., & Monk, N. (2006). Gene expression time delay and Turing pattern formation systems.
Bulletin of Math Biology, 68, 99-130.
Glass, L., & Mackey, M. (1988). From clocks to chaos. The rhythms of life. Princeton University
Press.
Gierer, A., & Meinhardt, H. (1972). A theory of biological pattern formation. Kybernetik, 12, 30-39.
Giordano, E., Rendina, R., Peluso, I., & Furia, M. (2002). RNAi triggered by symmetrically transcribed
transgenes in Dros. melanogaster. Genetics, 160, 637-648.
Gopalsamy, K., & Leung, I. (1997). Convergence under dynamical thresholds with delays. IEEE Transactions on Neural Networks, 8, 341-348.
Groenenboom, M., Maree, A., & Hogeweg, P. (2005). The RNA silencing pathway: The bits and pieces
that matter. PLoS Comput. Biol., 1, 155-165.
Hannon, G. (2002). RNA interference. Nature, 418, 244-251.
Heinrich, R., & Schuster, S. (1996). The regulation of cellular systems. New York: Chapman and
Hall.
Heinrich, R, Neel, B, & Rapoport, T. (2002). Mathematical models of protein kinase signal transduction.
Molecullar Cell, 9, 957-970.
Hobert, O. (2004). Common logic of transcription factor and microRNA action. Trends in Bioch Sci,
29(9), 462-468.
Hoffman, A., Levchenko, A., Scott, M., & Baltimore, D. (2002). The IkB-NFkB signalling module:
Temporal control and selective gene activation. Science, 298, 1241-1245.
Jacob, F., & Monod, F. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J. Molecular
Biology, 3, 318-356.
Kavasseri, R. (2005). Delay induced oscillations in a fundamental power model Nonlinear Phenomena
and Complex Systems, 8(1), 62-67.
King, R.W., Deshaies, R.J., Peters, J., & Kirschner, M. (1996). How proteolysis drives the cell cycle.
Science, 274, 1652-1659.
69
Khan, Q., & Greenhagh, D. (1999). Hopf bifurcation in epidemic models with a time delay in vaccination. IMA Journal of Mathematics Applied in Medicine and Biology, 16, 113-142.
Kolnanovskii, V., & Nosov, V. (1986). Stability of functional differential equations. Mathmatics in Science and Engineering, 180. London: Academic Press, INC.
Lema, M., Golombek, D., & Echave, J. (2000). Delay model of the circadian pacemaker. J. Theor. Biol.,
204, 565-573.
Lipardi, C., Wei, Q., & Paterson, B. (2001). RNAi as random degradative PCR: siRNA primers convert
mRNA into dsRNA that are degraded to generate new siRNAs. Cell, 107, 297-307.
Marsden, J., & McCracken, M. (1976). The Hopf bifurcation and its applications. New York: SpringerVerlag.
Matlab, (2007). The MathWorks, Inc. Natick, MA, USA www.mathworks.com.
Meinhardt, H. (1994). Biological pattern formation: New observations provide support for theoretical
predictions. Bioessays, 16, 627-632.
Miguez, D., Izus, G., & Minuzuri, A. (2006). Robustness and stability of flow and diffusion structures.
Phys. Rev. E, 73, 016207-13.
Munroe, S., & Zhu J. (2006). Overlapping transcripts, double stranded RNA and antisense regulation:
A genomic perspective. Cellular and Molecular Life Science, 63, 2102-2118.
Murray, A.W., & Kirschner, M.W. (1989). Dominoes and clocks: The union of two views of the cell
cycle. Science, 246, 614-621.
Murray, J. (2002). Mathematical biology, I. An introduction. Third Edition, New York: Springer-Verlag.
Nikolov, S., & Petrov, V. (2004). New results about route to chaos in Rossler system. Int. J. of Bifurcation and Chaos, 14(1), 293-308.
Nikolov, S. (2004). First Lyapunov value and bifurcation behaviour of specific class three-dimensional
systems. Int. J. of Bifurcation and Chaos, 14(8), 2811-2823.
Nikolov, S. (2005). An alternative bifurcation analysis of the Rose-Hindmarsh model. Chaos, Solitons
& Fractals, 23(5), 1643-1649.
Nikolov, S., Kotev, V., Georgiev, G., & Petrov, V. (2006). The dynamical roles of time delays in protein
cross talk models, Comptes rendus de lAcademie bulgare des Sciences, 59(3), 261-268.
Nikolov, S., Kotev, V., & Petrov V. (2006a, October 24-25). Influence of time delay on bifurcation behavior in the protein synthesis model: BioPS06, Sofia, III.37-III.46.
Nikolov, S., Kotev, V., & Petrov, V. (2006b, October 24-25). Bifurcation behavior of a time delay model
of enzyme and repressor cross talk: BioPS06, Sofia, III.47-III.56.
Nikolov, S., Yankulova, E., Nikolova, A., & Petrov, V. (2006). Stability and structural stability (robustness) in computational systems biology. Journal of the Bulgarian Academy of Sciences, 69(6), 21-29.
70
Nikolov, S., Yankulova, E., Wolkenhauer, O., & Petrov, V. (2007). Principal difference between stability and structural stability (robustness) as used in Systems Biology. Nonlinear Dynamics, Psychology,
and Life Sciences, 11(4), 413-433.
Nikolov, S., Vera, J., Wolkenhauer, O., Yankulova, E., & Petrov, V. (2007). Chaos in a delayed protein
cross talk model with periodic forcing. Comptes rendus de lAcademie bulgare des Sciences, 60(2),
127-132.
Nikolov, S., & Petrov, V. (2007). Time delay model of RNA silencing. Journal of Mechanics in Medicine
and Biology, 7(3), 297-314.
Nikolov, S., Vera, J., Kotev, V., Wolkenhauer, O., & Petrov, V. (2008). Dynamic properties of a delayed
protein cross talk model. BioSystems, 91, 51-68.
Orosz, G., (2004). Hopf bifurcation calculations in delayed systems. Periodica Polytechnica Ser. Mech,
Eng., 48(2), 189-200.
Raab, R., & Stephanopoulos, G. (2004). Dynamics of gene silencing by RNA interference. Biotechnol.
Bioeng. 88(1), 121-132.
Rateitschak, K., & Wolkenhouer, O. (2007). Intracellular delay limits cyclic changes in gene expression.
Mathematical Biosciences, 205, 163-179.
Rozema, D., & Lewis, D. (2003). siRNA delivery technologies for mammalian systems. Targets, 2,
253-259.
Parrish, S., Fleenor, J., Xu, S., Mello, C., & Fire, A. (2000). Functional anatomy of a dsRNA trigger:
Differential requirement for the two trigger strands in RNA interference. Moll Cell, 6, 1077-1087.
Petrov, V., & Nikolov, S. (1998). Valuation of the extraocular effective elastance on the base of dynamical model. Nonlinear Dynamics, Psychology, and Life Sciences, 2(1), 1-20.
Petrov V., & Nikolov, S. (1999). Rheodynamic model of cardiac pressure pulsations. Mathematical
Biosciences, 157(1-2), 237-252.
Pircher, T.J., Petersen, H., Gustafsson, J.A., & Haldosen, L.A. (1999). ERK interacts with STAT5a.
Molecular Endocrinology, 13, 555-565.
Sasaki, J., Sasaki, T., Matsumoto, W., Opavsky, A., Cheng, M., et. al. (2001). CD45 is a JAK phosphatase
and negative regulates cytokine receptor signaling. Nature, 409, 349-354.
Schepper, T., Klinkenberg, D., Pennartz, C., & Van Pelt, J. (1999). A mathematical model for the intracellular circadian rhythm generator. J. Neurosci., 19(1), 40-47.
Sieber, J., & Krauskopf, B. (2004). Bifurcation analysis of an inverted pendulum with delayed feedback
control near a triple-zero eigenvalue. Nonlinearity, 17, 85-104.
Sherr, C.J. (1996). Cancer cell cycles. Science, 274, 1672-1677.
Shilnikov, L., Shilnikov, A., Turaev, D., & Chua, L. (2001). Methods of qualitative theory in nonlinear
dynamics, Part II. World Scientific.
71
Smolen, P., Baxter, D., & Byrne, J. (2001). Modeling circadian oscillations with interlocking positive
and negative feedback loops. J. Neorosci., 21, C6644-C6656.
Starr, R., Willson, T., Viney, E., Murray, L., Rayner, J., et al. (1997). A family of cytokine inducible
inhibitors of signalling. Nature, 387, 917-921.
Stepan, G. (1989). Retarded dynamical systems: Stability and characteristic functions. Pitman research
Notes in Mathematics, 210. Longman, Essex.
Stillman, B. (1996). Cell cycle control of DNA replication. Science, 274, 1659-1677.
Storz, G. (2002). An expending universe of noncoding RNAs. Science, 296, 1260-1262.
Swameye, I., Mueller, T., Timmer, J., Sandra, O., & Klingmueller, U. (2003). Proc. Natl. Acad. Sci.
USA, 100, 1028-1033.
Thomas, R., & dAri, R. (1990). Biological feedback. CRC Press.
Thomas, R. (1998). Laws for the dynamics of regulatory networks. Int. J. Dev. Biol., 42, 479-485.
Timmer, J., Mueller, T, Swameye, I., Sandra, O., & Klingmueller U. (2004). Modeling the nonlinear
dynamics of cellular signal transduction. Int. J. of Bifurcation and Chaos, 14, 2069-2079.
Turing, A. (1952). The chemical basis of morphogenesis. Philos. Trans. R. Soc. Lond., B237, 37-72.
Voit, E. (2000). Computational analysis of biochemical systems. A practical guide for biochemists and
molecular biologists. Cambridge: Cambridge University Press, UK.
Wang, R., Zhou, L., Jing, T., Chen, L. (2004). Modelling periodic oscillation of biological systems with
multiple time scale networks. Systems Biology, 1(1), 71-84.
Wolkenhauer, O., Sreenath, S, Wellstead, P., Ullah, M., & Cho, K. (2005). A systems and signal oriented
approach to intracellular dynamics. Biochemical Society Transactions, 33(3), 507-515.
Wolkenhauer, O., Ullah, M., Wellstead, P., & Cho, K. (2005). The dynamic systems approach to control
and regulation of intracellular networks. FEBS Letters, 579, 1846-1853.
Yildirim, N., & Mackey, M. (2003). Feedback regulation in the Lactose operon: A mathematical modelling study and comparison with experimental data. Biophysical Journal, 84, 2841-2851.
Yildirim, N., Santillan, M., Horike, D., & Mackey, M. (2004). Dynamics and bistability in a reduced
model of lac operon. Chaos, 14(2), 279-292.
K ey T erms
Andronov-Hopf Bifurcation: From a mathematical point of view, the onset of sustained oscillations generally correspond to the passage through an Andronov-Hopf bifurcation point. Obviously, for
a critical value of a control parameter (named bifurcation), the system displays damped oscillations
and eventually reaches the steady state- stable focus. Beyond the bifurcation point, a stable solution
72
arises in the form of a small-amplitude limit cycle surrounding the unstable steady state [Golbeter, A.,
Nature, 420:238-245, 2002]. This bifurcation is very typical for biological systems.
Chaotic Dynamics: Chaotic motions are based on homoclinic (heteroclinic) structures which
instability accompanied by local divergence and global contraction. Meanwhile, the transition from
stability to instability requires the vanishing of stable equilibrium states and of stable periodic motions
or sufficiently large increase in the periodic ones.
DDEs: Delay differential equations. Those equations contain in addition derivatives which depend
on the solution at previous times. Also they are infinite-dimensional systems which find application in
control systems, biology, chemical kinetics, and other areas.
ERK: Extracelullar Signal-Regulated Kinase. Cell signalling protein activated STAT through serine
phosporilation.
NFB: Negative feedback. In this case a signal is caused by the expression of its inhibitor.
PFB: Positive feedback (or autocatalysis). Now, more inhibitors, or other molecules amplify the initial
signal and lead to the stabilization of the amplitude, or the increase in the signals duration.
RNA Silencing: Or also known as RNA interference similar to immune system guards against exploitive parasitic elements by (i) identifying non-self-elements; (ii) generating target-specific responses
against these foreign elements, and (iii) rapidly amplifying these responses to clear or otherwise inactive the threat.
STAT: Signal Transducer and Activator of Transcription is a family of latent cytoplasmic proteins that
are activated to participate in gene control when cells encounter various extracellular polypeptides.
Time Delay: Past memory (history).
73
74
Chapter IV
Deterministic Modeling
in Medicine
Elisabeth Maschke-Dutz
Max Planck Institute for Molecular Genetics, Germany
abstract
In this chapter basic mathematical methods for the deterministic kinetic modeling of biochemical systems
are described. Mathematical analysis methods, the respective algorithms, and appropriate tools and
resources, as well as established standards for data exchange, model representations and definitions
are presented. The methods comprise time-course simulations, steady state search, parameter scanning,
and metabolic control analysis among others. An application is demonstrated using a test case model
that describes parts of the extrinsic apoptosis pathway and a small example network demonstrates an
implementation of metabolic control analysis.
INTR OD UCTI ON
We can observe the molecular background of complex disease processes with systems biology. Today,
drug development plays a central role in the interdisciplinary area of systems biology. Scientists from
various disciplines, such as biology, bioinformatics, medical research, chemistry, physics, computer
science and mathematics, work together to analyze complex disease processes by studying molecular
interaction networks. The aim is to understand and analyze the complex behavior of human diseases.
Through this collaborative research we can obtain specific methods and results that can lead to new
predictions and assumptions about the observed diseases and processes.
Metabolic pathways like the citric acid cycle and glycolysis, for example, are important processes
that may be used to analyze the complex mechanism in living biological systems. The modeling of
these pathways provides appropriate structures for observing the behavior of metabolic diseases, i.e.
diabetes. Gene regulatory networks describe protein-DNA interactions or indirectly as the interaction
Deterministic Modeling in Medicine
between DNA and DNA. Here, we can also examine the regulation of the activity of single genes and
the behavior of single molecules, as well as observe the interaction and regulation between different
genes and proteins in complex structures. The normal function of single genes is affected in several
diseases. A lot of information is available about the functionality of single genes in different diseases
and also the roles of mutations in these genes are comprehensively described (Weinberg 1994). Many
project studies about this topic are currently underway, and we expect more interesting results in the
future (Futreal et al. 2004). Multiprotein complexes are the result of numerous protein-protein interactions. These interactions are essential for physiological processes. In biological networks, diffusion and
the molecular transport across cell membranes are also important physiological processes. In different
compartments of the cell the function of a protein can change, because the functionality of a protein
depends also on the existing targets in the appropriate compartments. For example the p53 protein acts
in the nucleus as a transcription factor of apoptosis. The protein Mdm2 in turn can bind to p53 and initiates its ubiquitination and subsequent degradation in the cytoplasm. In the cytoplasm p53 can not act as
a transcription factor and instead the degradation of p53 is initialized in this compartment. Signaling
pathways play an important role in many diseases. For example the cell signaling mechanism regulates
cell proliferation and cell differentiation. The main structures that are responsible for the progression
of cancer are interferences in signaling pathways (Cui et al. 2007). The reasons for these inferences in
turn are founded in mutated proteins.
Using mathematical models we can describe complex cellular processes. This chapter gives an
overview of the possibilities available to analyze these molecular, cellular and physiological processes
with mathematical modeling, and outlines how we can integrate experimental data. We demonstrate the
utilization of the mathematical model analysis in two examples. Time-course simulation and parameter
scanning are applied to a model for the extrinsic apoptosis pathways and metabolic control analysis is
applied to a small sample network.
MATHEMATICA L MODE LING

We can describe biological systems with Boolean models, stochastic models and deterministic models.
Boolean networks are based on the Boolean logic (Kaufmann 1993) and can be used to describe gene
regulatory interactions. There are two Boolean states: expressed and not expressed. The expression
level of a gene is represented in one of the two binary states, 0 and 1. In 1977 Gillespie introduced
the exact stochastic simulation algorithms called direct method and first reaction method (Gillespie
1977). The stochastic modeling process deals with discrete variables that are numbers of molecules.
Up to now, the stochastic modeling methods are improved and extended (Gibson & Bruck 2000).This
chapter will point out the deterministic, kinetic modeling of biochemical reaction networks. This kind
of modeling uses continuous variables and allows the representation of the structure and behavior of
the modeling system in a very detailed and arbitrarily complex way. The knowledge of complex diseases according to their molecular background, and the use of drugs and predictions resulting from the
interpretation of experimental data, provides information about the constitution of the mathematical
model. We can include this information into the model structure in analogous mathematical terms and
equations. The introduced deterministic approach is based on a system of ordinary differential equations (ODE system).
75
Many useful analysis methods exist. The presented methods analyze model behavior in different
states and under different conditions.
T opology and S toichiometry

In this paragraph we describe the techniques used to build reactions between the specific components
of the model. A simple stoichiometric equation has the form:
n1 A + n2 B n3C
(1)
In this equation A and B are called the reactants and C is called the product of the reaction. The ni
with i=1,2,3 are the stoichiometric coefficients. A reaction can either be reversible A B or irreversible A B. In the reversible form equation (1) is denoted by:
n1 A + n2 B n3C
(2)
All stoichiometric coefficients in the mathematical model can be represented in a so called stoichiometric matrix. We can represent the reactions vj j=1,..,m and the functions of the concentrations Si i=
1,..,n of the participating substances in vector form. The stoichiometric matrix has this form:
n11 n1m S1
N=
nn1 nnm S n

v v
1
(3)
Note: there is no unique mapping from a stoichiometric matrix to a reaction system. More than one
system can be consistent with a single stoichiometrie.
B alance E quation
The rate equation of a reaction is also called kinetic law and is the mathematical formula for the velocity
of the reaction. Depending on a local point p, the time t and Si i=1,..,n as a function, that describes the
concentration of a substance, the rate equation v has the formula:
v ( p, t ) = v S ( p, t ), t
(4)
S is the vector of the function Si i=1,..,n and v describes the vector of the reaction rates. The velocity
vj j=1,..,m also denotes the flux of the reaction. In the following paragraph, we introduce different well
known mathematical formulations of kinetic laws.
The change in concentration of a single substance can be described by the balance equation. For the
function Si i=1,..,n the balance equation reads:
m
Si
= nij v j
t
j =1

76
(5)
where vj j=1,..,m are the reaction rates and nij i=1,..,n and j=1,..,m are the stoichiometric coefficients.
The according matrix notation is:
S
= Nv
t

(6)
If the system only depends on time, we call the system autonomous, in this case equation (4) shows
the formula:
v(t) = v(S(t))
(7)
Equation (5) can then be formulated by:

dS (t )
dt
= Nv (S (t )) = f (S (t ))

(8)
Here f(S(t)) is a vector of functions f i i=1,..,n that depend on the time-dependent concentrations Si
i=1,..,n.
In this way, a system of ordinary differential equations which represent the deterministic mathematical model of the according network is built. It is called an ODE system.
R eaction K inetics
The rate equation of a reaction is described in detail with mathematical terms using kinetic laws. The
calculation rule of the reaction velocities depends on real-value kinetic parameters and the concentration
of several components. These components can be the reactants themselves and also the products of the
reactions. The concentration of other components in the model can also be part of the calculation rule.
These components are called the modifiers of the reactions and their concentration can also control the
fluxes , however, there is no mass flow of these components in the reaction. An example for a modifier
in a reaction can be an enzyme that catalyzes a reaction, but is neither a reactant nor a product of the
reaction. In a general formulation, the kinetic law can be written thus:
v = v(k,R,P,M)
(9)
In this equation k = (k1,...,kl ) are the real-value kinetic parameters, R = (R1,...,Rr ) and P = (P1,...,Ps )
are the concentrations of the reactants and products, respectively. In respect to this context, the complexity and the variety of the reaction inhibitions and activations can be involved in the calculation
rule by the modifiers M = (M1,...,Mt ) and so the calculation rule of the kinetic reaction could also be a
very sizable formula.
Well known kinetic laws are for example the Law of Mass Action or the Michaelis Menten kinetics.
In the 19th century Cato Maximilian Guldberg and Peter Waage introduced the Law of Mass Action
(Guldberg and Waage 1879). This law defines a rate equation for elementary reactions and also presents
an equilibrium constant that is derived by using kinetic data and the proposed rate equation.
77
For the simple reaction:

A + B AB
(10)
According to the Law of Mass Action, the two kinetic constants k1 and k1 are established and equation (10) can be written as:
A+ B
k1
AB

k1
(11)
Now the reaction rate is defined by:

v = v1 v1 = k1 * [A]* [B ] k1 * [AB ]
(12)
The equilibrium constant of this reaction is defined by:
keq =
[AB ]
k1
=
k1 [A][B ]
(13)
Note: the directionality of equation (11) defines the association equilibrium constant in equation
(13). We can define the dissociation equilibrium constant from the reciprocal value of the association
equilibrium constant.
The changes in the concentrations by the reaction rate defined in (12) are characterized by the following ODEs:
d [A] d [B ]
=
= v1 + v1 = v
dt
dt
and
d [AB ]
dt
= v1 v1 = v
(14)
Using the information given by the Mass Action Law, we can describe the decay of a reactant:
A
(15)
with the reaction rate:

v = k*[A]
(16)
Brown (1902) was the first to propose the control an enzyme has on a reaction. The considered reaction is an irreversible reaction and has one reactant and one product. No other effectors take part in
this reaction. In the according reaction scheme the enzyme E is included, S denotes the reactant and P
denotes the product:
k1
k2
E + P
E+S
ES

k1
78
(17)
The formation of the enzyme-reactant complex is reversible but the release of the product from this
complex is an irreversible reaction. Now the changes in the concentrations of the components over time
are:
d [ES ]
dt
d [S ]
dt
d [E ]
dt
= k1 * [E ]* [S ] (k1 + k2 )* [ES ] ,
d [P ]
dt
= k2 * [ES ] ,
= k1 * [E ]* [S ]+ k1 * [ES ]
(18)
A rate law for enzymatic reactions, based on this principle but using more complex reaction structures is given by the Michaelis-Menten kinetics (Michaelis and Menten 1913).
This and various other enzyme kinetic rate laws are described in Klipp et al. (2005). In particular
various forms of interactions: for example, different types of inhibition can be included in these calculation rules.
MODE LING OF AP OPT OSIS

Apoptosis, the programmed death of a cell, has different causes. During embryonic development, apoptosis can be initiated from immune system reactions or DNA damage, for example. Apoptosis plays an
important role in cancer disease. Abnormal cells divide and grow without control and these cells lose
the ability to regulate apoptosis. The (signaling) pathway of apoptosis in general is well known (Hanahan and Weinberg, 2000; MacFarlane and Williams, 2004) and allows us to analyze the characteristics
of apoptotic mechanism in mathematical models. There are two principal pathways of apoptosis. The
intrinsic (Bcl-2 inhabitable or mitochondrial) pathway of apoptosis is induced by various forms of intercellular stress, developmental cues or external stimuli and the extrinsic or Caspase 8/10 dependent
pathway of apoptosis is initiated by the engagement of death receptors. A very detailed description of
cancer relevant pathways including the extrinsic and intrinsic apoptosis pathways, with all the important
processes is described in Weinberg (2006).
To illustrate the possibilities and structures of this kind of modeling in a medical context, two previously published apoptosis models are briefly described in the following sections.
A n E xtrinsic A poptosis Model

The first model is published and characterized in Eissing et al. (2004). This model is also used in this
chapter to explain and demonstrate graphically the results of mathematical calculations and analysis
methods. A part of the extrinsic apoptosis pathway in a caspase activation model for receptor-induced
apoptosis is shown. In this model the functionalities of Caspase 8 and Caspase 3 are integrated. Caspases
belong to the enzyme family of the proteases. They are very important enzymes in apoptosis and they
cleave other proteins. Two different types of caspases exist. Caspase 8 is an initiator (apical) caspase.
These caspases are activated to trigger apoptosis and they activate (cleave) inactive pro-caspases of effector caspases. The effector caspases in turn cleave other proteins. This process finally results in cell
death. Caspase 8 activates the inactive pro-form of the effector Caspase 3. BAR, a bifunctional apopto-
79
sis regulator, binds itself to activated caspase proteins and plays an important role in the regulation of
apoptotic mechanism. When an activated caspase protein binds to the regulator, the caspase protein can
not perform its function and does not induce and support the apoptotic process. The Caspase 8 BAR
complex (C8a-BAR) represents this compound. The family of the IAP (inhibitor of apoptosis proteins)
has the functionality to inhibit and to avoid cell death. In the considered model the IAP protein binds
activated Caspase 3 to an IAP-Caspase 3 complex (C3a-IAP). Activated Caspase 3 creates a positive
feedback loop in the activation of Caspase 8, and Caspase 8 in turn, provides a positive feedback in the
activation of Caspase 3. Another positive feedback loop of activated Caspase 3 acts upon the degradation of IAP.
The activation of Caspase 3 finally results in the controlled cell death. Figure 1 illustrates and displays
the components of the model and the reactions.
A n Intrinsic A poptosis Model

The rapid kinetics of effector caspase activation in the intrinsic apoptosis pathway and the role of the
formation of the apoptosome, the activations of effector Caspases 3, initiator Caspase 9 and the behaviour
of XIAP are the main studies of the model published and presented in Rehm et al. (2006). The release
of cytochrome c triggers the formation of a multi-protein complex, the aptosome, that also comprises
the apoptotic protease-activation factor 1 (Apaf-1) and Procaspase 9. Thus, Procaspase 9 passes into
the activated Caspase 9 and activates effector Caspase 3. Positive feedback loops of both the activated
Caspase 3 and the activated Caspase 9 in the activation of Caspase 3 and the activation of Caspase 9, are
part of the modelled signalling network. The X-linked-inhibitor-of-apoptosis-protein (XIAP), a member
Figure 1. Caspase activation model for receptor-induced apoptosis. The reactions and components of
extrinsic apoptosis are graphically shown. Red, dashed arrows indicate a modifier influence and not a
mass flow. This model is published and presented in Eissing et al. (2004)
80
of IAP family proteins, acts as an inhibitor of Caspase 3 and Caspase 9. Smac counteracts XIAP and
displaces the caspases from their XIAP interaction sides. Bir1-2 and Bir3-RING fragments result from
the Caspase 3 dependent cleavage of XIAP. The bindings of Caspase 9 to Bir3-RING and Caspase 3 to
Bir1-2 are also part of the model. A special mathematical feature of this model is the Smac release and
the cytochrome c initiated apoptosome formation through time and half-time dependent kinetics, which
are reflected by the accordingly defined analytic functions. The computational model of this process
is based on 53 reactions and the main aim of this model is to verify effector caspase activation and the
control of this activation by XIAP.
In addition to the results and predictions that are made in the analysis of these models, there is one
aspect that we observe in the results of both models. If enough activated caspase 3 is in the system
apoptosis is initialized and the cell starts to die. Both models show this switch or breakpoint, the so
called point of no return.
A LG ORITHMS
AND MODE L ANA LYSIS METH ODS
T ime-C ourse S imulation

Quantitative time-course simulation describes the changes over time in the concentrations of the model
components. In addition to the time-course concentration time-course fluxes can also be calculated.
Both courses give a good general overview, the time-course concentration indicates the quantitative
behavior of the substances in the network, and the reaction rates clarify the flux through the according
network system. The calculations can be performed by using appropriate numerical solvers for ODE
systems. LSODA (Hindmarch 1983, Petzold 1983) a variant of the LSODE package (Hindmarch 1980)
can be used. The algorithm switches automatically between stiff and non-stiff methods and offers a
good opportunity for solving systems with different problem definitions. The algorithm is available in
different programming languages, in easy to use software packages. For example, SciPy the Scientific
Tools for Python (Jones et al. 2001 -) offers an implementation of this algorithm.
An alternative choice is Limex (Deuflhard and Nowak 1987), a solver that uses an extrapolation method
to solve linearly-implicit differential-algebraic systems (DAEs). An implicit one step method is combined
with stepped extrapolation, offering adaptive control of step size and order. This algorithm is written
in the programming language Fortran and the source code is available at the CodeLib library of the the
Zuse Institute Berlin via the URL http://www.zib.de/Numerik/numsoft/CodeLib/ivpode.en.html. The
possibility to transform the source code to another programming language or to embed the source code
in another programming language is offered. Several graphical tools provide interfaces for the currently
used programming languages and they can easily provide a graphical representation of the simulated
time-course output. For example, the powerful graphical tool Gnuplot is freely available at http://www.
gnuplot.info/ and runs on commonly used computer platforms. A Gnuplot application interface for the
programming language C++ is available at http://www.suiri.tsukuba.ac.jp/~asanuma/gnuplot++/ and a
Python package that interfaces with Gnuplot (Gnuplot.py) also exists and is available at http://gnuplotpy.sourceforge.net/. Time-course simulations can also be performed with modeling tools, outlined in
this chapter under the sub-title Modeling Tools and Model Repositories.
81
E xample of T ime-C ourse S imulations

The apoptosis model published in Eissing et al. (2004) exemplifies this method. Figure 2 shows a graphical representation of the calculated time-course results. The time-course calculations and the graphical
output results are generated by the tool PyBioS (Wierling et al. 2007, Klipp et al. 2005), a Web-based
modeling and simulation system. The curves represent the changes in concentration for the duration of
the model components. The abbreviation C3 refers to Caspase 3, C3a refers to the activated Caspase 3,
C8 to Caspase 8 and C8a to Caspase 8 in the activated form. It is quite obvious that there is a switch
from the life state to the death state of the cell. Figure 2 illustrates the behavior of the Caspase 3 IAP
- complex. Initially, the inhibitor protein IAP binds the activated Caspase3 and avoids the initialization
of the apoptosis process. After some time there is a peak in the curve that indicates the point of no return
and subsequently the concentration of this complex doesnt change with time. When the concentration
of the activated Caspase 3 has reached a significant level, the apoptosis inhibitor protein IAP has no
chance to stop the apoptotic process.
Parameter S canning
Parameter Scanning gives us information about the influence of the kinetic parameters defined in the
kinetic rules on systems behavior. Under the assumption that the system reaches a stable steady state
we can perform this sensitivity analysis. The system reaches a steady state or stationary state, when the
quantitative values like the concentration and the fluxes no longer change with time. This characteristic
is equivalent to reaching the root of the ODE system and fulfilled by the balance equation (5):
dSi
=0

dt
i=1,... ,n
(19)
In a nonlinear system this equation can have several solutions. An important precondition to make
predictions about a systems behavior in the steady state is the stability of the steady state. In this case,
stability means that the system stays in this state and each system that starts in a neighborhood arbitrary
close to this state, also results in this steady state. A detailed theory about the stability of a steady state
is described in Walter (1998) and also in Klipp et al. (2005). Through linearization of the autonomous
system, we can perform a linear stability analysis to gain information about the stability of the system.
Thus the steady state S 0 = (S10 , , S n0 ) can be called stable if all eigenvalues of the Jacobian matrix of
the system of functions in equation (8) are negative at this point. The Jacobian matrix of this function
in the state S0 is defined by:
82
f1 (S 0 ) f1 (S 0 )
f1 (S 0 )
S 2
S n
S1
0
0
f 2 (S 0 )
f 2 (S ) f 2 (S )
S 2
S n
f (S )= S1
:
:
:
:
0
0
0
f n (S )
f n (S ) f n (S )
S 2
S n
S1
(20)
Figure 2. Time-course simulation of different components of the apoptosis model in (Eissing et al. 2004).
The switch between the life state and the death state of the cell is explicitly visible
The roots of the ODE system (19) can be calculated using the MINPACK subroutine HYBRID1,
available at Netlib Library, a collection of mathematical software, papers, and databases. For more details see http://www.netlib.org/. This numerical algorithm is a modification of the method described in
Powell (1970). The Jacobian matrix can be determined by the forward difference formulas:
83
f j (S 0 )
Si
f j (S 0 (i )) f j (S 0 )

i =1,..,n , j=1,..,n
(21)
with a sufficiently small real value > 0. The vector S 0 (i ) in this equation is defined through:
S10

0
S (i ):= Si0 +

S0
n
(22)
We can compute the eigenvalues of the system by using an appropriate routine from the LAPACK
package included in the Netlib Library. If the Jacobian matrix (20) of the system is singular, the numerical root finding method cannot find a unique solution and fails. In this case we can determine the steady
state of the system with a time-course simulation. We perform time-course simulation until the changes
in the values of the concentrations in S (t ) = (S1 (t ), , S n (t )) are under a given threshold and stay under
this threshold. For two vectors of the concentration values S1 and S2 this condition can be proved by:
|| S1 S 2 ||2 <
(23)
In this equation || . ||2 denotes the Euclidian vector norm and is a sufficiently small positive real
value.
In the case of a stable steady state, a determined kinetic parameter scans a predefined interval and
the according values of the regarded components of the model in the steady state are calculated. This
enables us to analyze the influence of the scanning kinetic parameter against the behavior of the model.
Controlled quantities are in general the concentration of the substances and the values of the reaction
rates.
Furthermore, we can analyze the sensitivity of the concentration of a model component against a
single reaction, and analyze the influence of an enzyme on a reaction rate. In this manner we can apply
several analysis approaches.
E xample of a Parameter S can

In this paragraph we apply a parameter scan to analyze the behavior of the apoptosis model described
in Eissing et al. (2004). The activation of the effector Caspase 3 plays an important role in this extrinsic apoptosis pathway. The parameter k1 controls this activation in the reaction model. The according
kinetic reaction is:
v1 = k1 *[Caspase8a]*[Caspase3]
(24)
The concentration of activated Caspase 8 serves as a modifier and Caspase 3 serves as a reactant
in this reaction.
84
Figure 3 illustrates the surprisingly fast influence of this reaction on the components of the model.
A minor amount of the parameter k1 in the range of 10 -5 is enough to start apoptosis. We implemented
parameter scanning and the according graphical results with PyBioS. The green horizontal bar at the
top of the figures denotes that the found steady states are stable.
In the shortcuts of the implemented mode C3 represents Caspase 3, C3a represents the activated form
of Caspase 3, abbreviation C8 stands for Caspase 8 and C8a for the activated form of Caspase 8.
Metabolic C ontrol A nalysis

Metabolic Control Analysis (MCA) defines a quantitative framework for characterizing the interactions
between particular reactions, kinetic constants and values of fluxes and concentrations. The local and
global behavior of the network system under the influence of small (infinitesimal) changes is described.
Controlled variables could be the rate laws or the concentrations of substances. Controlling parameters
can be kinetic constants or the concentration of an enzyme, for example. More detailed comprehensive
theories of stationary control states are presented in numerous instances e. g. (Fell 1992; Schuster &
Heinrich 1992; Cornish-Bowden 1995).
There are two local control coefficients that can always be calculated, because there is no need for
a steady state. The local control coefficients are the elasticity coefficients. These coefficients can be
v
calculated at a certain time point or at a steady-state. The -elasticity = S describes the sensitivity
of the rate vk of a reaction to the change of the concentration Si. The -elasticity = vp describes the
sensitivity of the rate vk against a small change in a kinetic parameter pl :
v
:=
Si vk
vk Si i=1,..,n; k=1,..,m.
v
p
:=
pl vk
vk pl l=1,..,r; k=1,..,m.
(25)
Under the assumption that the system reaches a stable steady state we can also calculate global control
and response coefficients. The control coefficients show the influence of a small disturbance in the rate
of the steady state fluxes or the steady state concentrations. These control coefficients are the flux-control coefficient Ckj for the control of the rate vk over the steady state flux Jj and the concentration-control
coefficient Cki that describes the impact of the rate vk on the steady state concentration Si :
:=
k
vk J j
J j vk j=1,..,m; k=1,..,m.
i
k
:=
vk Si
Si vk i=1,..,n; k=1,..,m.
(26)
The response coefficients show the resonance of the steady state fluxes and the steady state concentrations against small changes in a certain parameter. These coefficients do not only clarify direct
influence of controlled quantities but also show indirect influences. This is useful to understand system
coherences which are not obviously visible. Rmj shows the response of the steady-state flux to a small
change in the parameter pm and Rmi expresses the response of the steady-state concentration Si to a small
modification in the parameter pm:
j
l
:=
pl J j
J j pl j=1,..,m; l=1,..,r.
pl Si
i=1,..,n; l=1,..,r.
i pl
R := S
i
l
(27)
85
Figure 3. The influence of Caspase 3 activation is represented in this model by the kinetic constant k1.
The graphics illustrate the strong influence of this activation on all components of the model. Above:
The described influence on the concentration of Caspase 3, cleaved Caspase 3, IAP and the activated
Caspase3-IAP complex. Below: The steady state behavior of the concentration of Caspase 8, activated
Caspase 8, BAR and the activated Caspase8-BAR complex, depending on changes in the parameter k1.
86
In the definitions (25), (26) and (27) all coefficients are normalized.
For an elaborate derivation and description of the calculation rules for the coefficients see (Hofmeyr
2001).
v
v
The calculation of the non-normalized local coefficients S = Sik and p = pkm can be performed
by the forward difference formulas:
( )
vk
Si
vk vk (S (i )) vk (S )
=

Si
vk
pm
( )
vk vk (p (m )) vk (p )
=

pm
(28)
In the equations we use a sufficiently small > 0. S (i ) and p (m) are defined through:
S1

S (i ) = Si +

S
n
p1
p (m ) = pm +
p
r
and
(29)
The decomposition of the stoichiometric matrix N defined in (3) to reduced row echelon form (Strang
1980) by Gauss elimination:
N = LNR
(30)
provides the reduced stoichiometric matrix NR with r linear independent rows and the link matrix L. If
rank(N) =n (r=n), then L is the identity matrix I and N = NR. The Jacobian matrix M of NR and so the
S
Jaciobian matrix of the independent components of the system can be determined by M = N R L. We
can calculate the none-normalized control coefficients arranged in matrices with:
CS = LM 1N R
CJ =
CS + I
(31)
The matrix representation of the response coefficients are determined through

R S = CS p
R J = CJ p
(32)
We can normalize the coefficients through the multiplication with the according normalization factors. They are included in the equations (25), (26) and (27).
In complex model structures it could be helpful to analyze the dependencies and influences of the
components of the system with these methods. A necessary precondition is a stable steady state. Calculating these coefficients for a given model is straight forward. First we determine the -matrices with
the forward differences formulas. Then the other coefficients can be calculated using common matrix
operations. The standard linear algebra algorithms are implemented in several software packages and
for example available at the Netlib library as components of the LAPACK package.
Some modeling tools also provide this analysis method, as described in the section Modeling Tools
and Model Repositories.
87
E xample for Metabolic C ontrol A nalysis

The following reactions and kinetics demonstrate a simple network example:
v0 : ES E + S1
v1 : S1 S 2
v2 : S 2 + E ES
v0 = k0 * ES
k * S1
v1 = 1
k3 + S1
(33)
v2 = k2 * S2
Figure 4 shows colored box plots of the normalized concentration control coefficients and the normalized -elasticity coefficients, in matrix form. Even though the example model is very small and consists
of only three reactions, the according matrix plots show more than ten values per coefficient matrix.
The example demonstrates, that metabolic control analysis describes complex relations according to
certain values. The control coefficients box plot describes an indirect negative influence of v0 on the
steady state concentration of S2 and the influence of v1 upon the concentration of S1 is significantly less
than the influence of v2 on S2.
E xperimental D ata and D rug D evelopment

Experimental data is already available for various diseases in different states. Data that includes medical drug treatment also provides information that could be used for modeling. In Lamb et al. (2006) the
authors describe the connection between diseases, genetic perturbations and the influence of drugs. In
Karaman et al. (2008) we find an interaction map for kinase inhibitors in a quantitative analysis. There
are several options for including the knowledge provided from literature and experimental data in according model structures. Because the information about the considered studies is often very specific,
the applied methods depend on different conditions and are individual. Here, we describe general possibilities and instructions.
Figure 4. A simple example network illustrates the implementation of metabolic control analysis. This
figure illustrates the reaction network, the normalized concentration control coefficients and normalized
-elasticity coefficients in colored box plots
88
One possibility to include experimental data in a model is to initialize the start values of the according
ODE system with corresponding values. These values can be obtained directly from the experimental
data or they can be determined by mathematical equations. The corresponding equations result from
an existing functional situation in the model.
Another possibility for including information from experimental data, is to integrate this knowledge
into the kinetic laws. The given information can be used to determine the kinetic parameters directly,
or to fit the kinetic data using specifically adapted algorithms. There are also several methods for integrating the influence of drugs in kinetic law formulae. For instance, we can determine the value of a
kinetic constant using knowledge about the dose rate and the concrete function of the impact of a drug.
A lot of comprehensive, active research already exits in this area.
AVAI LAB LE T OOLS AND DATA

Several user-friendly software tools are available. These tools offer the possibility to perform time-course
simulations and different analysis methods. An established data exchange format allows for flexible usage, and model repositories provide particular models for different problems. These possibilities allow us
to look at the given problem from different perspectives. Many databases are available in this research
area. They provide both essential and specialist resources which can be used for modeling.
Modeling T ools and Model R epositories

This section introduces the useful tools available for developing and analyzing biological and biomedical models. Systems biology standards and two model repositories are presented. A short overview and
the according URLs are listed in Table 1.
Mathematica, a mathematical software system, provides a large amount of computational methods
and necessary tools for these kind of kinetic modeling, simulation and analysis methods. The software
is developed by Wolfram Research.
Another mathematical commercial software package that is capable of the computations and representations is Matlab, developed by MathWorks. This software package is often used in scientific research,
provides good documentation and is suitable to use for the calculations and graphical representations.
For more details about using Mathematica and Matlab in this context see Klipp et al. (2005).
COPASI (Hoops et al. 2006) a complex pathway simulator tool is also available as Freeware. This
tool is user-friendly and offers the possibility to build models, run simulations and apply various analysis
methods. MCA, steady state calculations, linear stability analysis, parameter scanning and parameter
estimation using experimental data are the main features of this modeling tool. A graphical representation of the results is easy to follow and can be configured in various ways. The import and export of
models in SBML, the Systems biology Markup Language, is also available.
The E-Cell System (Tomita et al. 1999, Takahashi et al. 2003) is a software platform that provides
modeling, simulations and analysis methods of cell simulations at the molecular level. A graphical userinterface clearly arranges the various functions of the object-oriented tool. Further information about
this tool, its functions and underlying theory is available on the E-Cell System Web site.
Cell-Designer (Funahashi et al. 2003) is a tool for modeling biochemical networks. The tool provides
a diagram editor and also supports time-course simulations and parameter scanning. Systems Biology
89
Table 1. This table lists the Web sites of modeling tools, systems biology standards and model repositories
Modeling Tools
PyBios
http://pybios.molgen.mpg.de/Pybios
Mathematica
http://www.wolfram.com/
Matlab
http://www.mathworks.com/
Copasi
http://www.copasi.org
E-Cell
http://www.e-cell.org
Cell-Designer
http://www.systems-biology.org/cd/
JDesigner
http://jdesigner.org/
Jarnac
http://www.sys-bio.org/software/jarnac.htm
Systems biology Standards and

Workbench
SBML, Systems Biology Markup
Language
http://sbml.org/
SBGN, Systems Biology

Graphical Notation
http://sbgn.org/
SBW, Systems Biology

Workbench
http://sbw.sourceforge.net/
Model Repositories
BioModels database
http://www.ebi.ac.uk/biomodels/
JWS Online repository
http://jjj.biochem.sun.ac.za/index.html
Markup Language, SBML the standard for representing models of biochemical and gene-regulatory
networks is also supported. Parts of the new notation SBGN, Systems Biology Graphical Notation are
implemented in the new release CellDesigner 4.0 beta. SBGN is developed as a standard for graphical
representations of computational models in systems biology.
The Systems Biology Workbench SBW (Hucka et al. 2002) includes several software packages for
modeling, analysis and visualization in systems biology. The applications communicate via a simple
network protocol. A detailed description of the software package can be found in Sauro et al. (2003).
JDesigner is a graphical network designer for biochemical networks and part of the SBW. Simulations can be performed in JDesigner by Jaranc, a tool that offers a simulation service within the SBW.
JDesigner and Jarnac are described in more detail in Hucka et al. (2002).
PyBioS supports modeling, time-course simulations, various analysis methods and the dynamic population of models by using information provided by databases interfaces to pathway databases. PyBioS is
a Web-based environment and provides a model repository. The export and import of models is enabled
via an SBML interface. Figure 5 gives an overview of the appearance and functions of PyBioS.
D ata R esources and D atabases

A large amount of data resources and various databases are accessible via the Internet. Table 2 comprises
the information about some databases which provide information and data that we can use for modeling.
Most of the databases are described in Wierling et al.(2007).
90
Figure 5. The Web-based PyBioS simulation environment. The model repository (A) offers models to be
selected. The search interface (B) enables the automatic population, with access to several databases.
The graphic shows some hits that are actually found. Graphically represented time-course simulations
can be found in (C). The result of metabolic control analysis is represented in colored box plots (D) and
a network graph (E) visualizes the network structures of the corresponding reactions.
C ONC LUSI ON
Within biomedical science, mathematical modeling is becoming more and more important. With the
help of several mathematical means, the modeling ability can be extended to encompass biomedical
systems and to analyze system behavior and structures in an arbitrarily detailed manner.
From single genes to proteins and protein complexes that can act, for instance, as transcription
factors the expression of other genes can subsequently be inhibited or activated and with this positive
or negative feedback loops can be established. Based on such information of intra- and extracellular
biological network structures, we can create models of varying complexity.
The deterministic, kinetic modeling introduced here supports the essential and useful possibilities
available for understanding and analyzing the functionality of diseases in living systems. The integra-
91
Table 2. Useful databases freely available on the WWW

Pathway databases
KEGG , the Kyoto Encyclopedia http://www.genome.jp/kegg/
of Genes and Genomes
Reactome, a compendium of
biological pathways
http://www.reactome.org
HumanCyc, HumanGenes and

Metabolism
http://humancyc.org/
Pathway Interaction Database
http://pid.nci.nih.gov/
BioCarta, pathway database
http://www.biocarta.com/
NetPath, signal transduction

pathways
http://www.netpath.org/
The Cancer Cell Map
http://cancer.cellmap.org/cellmap/home.do
IntAct, protein interaction data
http://www.ebi.ac.uk/intact/site/index.jsf
MINT, protein interactions
http://mint.bio.uniroma2.it/mint/Welcome.do
HPRD, human protein reference

database
http://www.hprd.org
BioGRID, set of physical and

genetic interactions
http://www.thebiogrid.org/
Spike, biological signaling

networks
http://www.cs.tau.ac.il/~spike/
Genetic Inheritance
OMIM, human genes and
genetic disorders
http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
Expression data
GEO, Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/projects/geo/
Array Express, micro array data
http://www.ebi.ac.uk/microarray-as/aer/
Kinetic data
BRENDA, Enzyme information
system
http://www.brenda-enzymes.info/
SABIO-RK, reaction kinetics

database
http://sabio.villa-bosch.de/SABIORK/
tion of drug interactions can help us to identify important influences and adverse effects on the according biomedical processes. The presented analysis methods are useful instruments for studying the
relationships and interactions between the components in the observed processes. We can also ascertain
the influence of individual model components on global model behavior. In this way, we can identify
important, previously unrecognized coherences. Furthermore, the cooperation of exact and applied
science in the biomedical research area, provided by mathematical modeling will continue to develop
new methods for understanding the complex interaction mechanisms of various diseases.
92
A ckn ow ledgment
The author thanks Christoph Wierling and Hendrik Hache for proof-reading of this chapter and their
constructive feedback and Andriani Daskalaki for her good ideas and assistance. The author address
special thanks to Jane Conway for her excellent proof-reading. This work was supported by the EU
grants AnEUploidy and EMI-CD within the Framework 6 funding, and the Max Planck Society.
R eferences
Brown, A. J. (1902). Enzyme action. Chem. Soc., 81, 373-386.
Cornish-Bowden, A. (1995). Metabolic control analysis in theory and practice. Adv. Mol. Cell Biol.,
11, 21-64.
Cui, Q., Ma, Y., Jaramillo, M., Bari, H., Awan, A., Yang, S., et al. (2007). A map of human cancer signalling. Molecular Systems Biology, 3, 152.
Deuflhard, P., & Nowak, U. (1987). Extrapolation integrators for quasilinear implicit ODEs. In
P.Deuflhard, B. Engquist (eds.) Large Scale Scientific Computing. Series Progress in Scientific Computing, Birkhuser, 37-50.
Eissing, T., Conzelmann, H., Gilles, E. D., Allgwer, F., Bullinger, E., & Scheurich, P. (2004). Bistability analyses of a caspase activation model for receptor-induced apoptosis. J. Biol. Chem., 279, 3689236897.
Fell, D. A. (1992). Metabolic control analysis: A survey of its theoretical and experimental development.
Biochem. J., 286, 313-330.
Funahashi, A., Morohashi, M., Kitano, H., & Tanimura, N. (2003). CellDesigner: A process diagram
editor for genregulatory and biochemical networks. BIOSILICIO, 1, 159-162.
Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R. et al. (2004). A census of
human cancer genes. Nature Reviews Cancer, 4, 177-183.
Gibson, M. A., & Bruck, J. (2000). Efficient exact stochastic simulation of chemical systems with many
species and many channels. J. Phys. Chem., 104, 1876-1889.
Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem, 81,
2340-2361.
Guldberg, C. M., & Waage, P. (1879). Concerning chemical affinity. Erdmanns Journal fr Practische
Chemie, 127, 69-114.
Hanahan, D., & Weinberg, R. A. (2000). The hallmarks of cancer. Cell, 100, 57-70.
Hindmarch, A. C. (1980). LSODE and LSODI, two new initial value ordinary differential equation
solvers. ACM-SIGNUM Newsletter, 15, 10-11.
93
Hindmarch, A. C. (1983). Odepack, a systematized collection of ode solvers. In R. Stepleman et al.

(eds.), IMACS Transactions on Scientific Computation, 1, 55-64.
Hofmeyr, J. H. S. (2001). Metabolic control analysis in a nutshell. In Yi, T.M., Hucka, M., Morohashi,
M., Kitano, H. (eds.), Proceedings of the 2nd International Conference on Systems Biology, (pp. 291300). Madison, WI: Omnipress.
Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., et al. (2006). COPASI a COmplex
PAthway SImulator. Bioinformatics, 22, 3067-3074.
Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J., & Kintano, H. (2002). The ERATO systems
biology workbench: Enabling interaction and exchange between software tools for computational biology. Procedings of the Pacific Symposium on Biocomputing, 7, 450-461.
Jones, E., Oliphant, T., Peterson, P., & others. (2001). SciPy: Open source scientific tools for Python.
http://www.scipy.org
Karaman, M. W., Herrgard, S., Treiber, D. K., Gallant, P., Atteridge, C. E., Campbell, B. T., et al. (2008).
A quantitative analysis of kinase inhibitor selectivity. Nature Biotechnology, 26, 127-132.
Kaufmann, S. A. (1993). The origins of order: Self-organization and selection in evolution. New York:
Oxford University Press.
Klipp, E., Herwig, R., Kowald, A., Wierling, C., & Lehrach, H. (2005). Systems biology in practice:
Concepts, implementation and application. WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Lamb, J., Crawford, E. D., Peck, D., Modell, J. W., Blat, I. C., Wrobel, M. J., et al. (2006). The connectivity map: Using gene-expression signatures to connect small molecules, genes and disease. Science
313, 1929-1935.
Le Novre N., Bornstein, B., Broicher, A., Courtot, M., Donizelli, M., Dharuri, H. et al. (2006). BioModels
database: A free, centralized database of curated, published, quantitative kinetic models of biochemical
and cellular systems. Nucleic Acids Research, 34, 689-691.
MacFarlane, M., & Williams, A. C. (2004). Apoptosis and disease: A life or death decision. EMBO
Rep, 5, 674-678.
Michaelis, L., & Menten, M. L. (1913). Kinetic der Invertinwirkung. Biochem. Z., 49, 333-369.
Petzold, L. (1983). Automatic selection of methods for solving stiff and nonstiff systems of ordinary
differential equations. Siam J. Sci. Stat. Comput., 4, 136-148.
Powell, M. J. D. (1970). A hybrid method for nonlinear equations. In P. Rabinowitz. Gorden & Breach,
(eds.), Numerical methods for nonlinear algebraic equations (pp. 84-114).
Rehm, M., Huber, H. J. , Dussmann, H., Prehn, J. H. M. (2006). Systems analysis of effector caspase activation and its control by X-linked inhibitor of apoptosis protein. The EMBO Journal, 25, 43384349.
Sauro, H. M., Hucka, M., Finney, A., Wellhock, C., Bolouri, H., Doyle, J., et al. (2003). Next generation
simulation tools: The systems biology workbench and bioSPICE integration. Omics, 7, 355-372.
94
Schuster, S., & Heinrich, R. (1992). The definitions of metabolic control analysis revisited. BioSystems,
27, 1-15.
Strang, G. (1980). Linear algebra and its applications, 2nd Edition. New York: Academic Press.
Takahashi, K., Ishikawa, N., Sadamoto, Y., Sasamoto, H., Otha, S., Shiozawa, A., et al. (2003). E-Cell
2: Multi-platform E-Cell simulation system. Bioinformatics, 19, 1727-1729.
Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T. S., Matsuzaki, Y., Miyoshi F., et al. (1999).
E-CELL: Software environment for whole-cell simulations. Bioinformatics, 15, 72-84.
Walter, W. (1998). Ordinary Differential Equations. New York: Springer Verlag.
Weinberg, R. A. (1994). Oncogenes and tumor suppressor genes. CA: A Cancer Journal for Clinicians,
44, 160-170.
Weinberg, R. A. (2006). The biology of cancer. Garland Science, Taylor & Francis Group, LLC.
Wierling, C., Herwig, R., & Lehrrach, H. (2007). Resources, standards and tools for systems biology.
Briefings in functional genomics and proteosomics, 6, 240-251.
K ey T erms
Bcl2: Is the prototype for a protein family of mammalian genes and the proteins coded by these
genes. The name is derived from B-cell lymphoma 2.
C++: The programming language C++ was developed in 1979 by Bjane Stroustrup. It supports
object-oriented programming and is an enhancement to the programming language C.
DAE: A differential-algebraic equation (DAE) is a special kind of differential equation and expressed
by means of differential algebra. This equation does not necessarily include all dependent variables,
and their derivates must not be expressed explicitly.
Deterministic: An algorithm is called deterministic when the same results are always obtained under
identical conditions. In this respect, deterministic mathematical modeling describes systems without
any random possibility function. Identical starting conditions always provide the same output.
Eigenvalues: In linear algebra the equation Av = v defines the eigenvalues and the corresponding
eigenvectors v of a linear transformation represented by a quadratic matrix A. is a complex value and
v is a complex valued vector. The eigenvalues describe essential characteristics of the linear map.
Enzyme: Protein molecules that catalyze chemical reactions. They play an important role in most
metabolic processes and are responsible for activation and control of biochemical reactions in living
systems.
Euclidean Norm: The Euclidean space denotes an n-dimensional space that can be characterized
through Euclidean geometry. An n-dimensional vector describes a point in this space. The Euclidean
norm defines the length of a vector and the distance function between two vectors is called an Euclidean
metric.
95
Experimental Data: Gene expression profiles derived from micro array based experiments with
different study foci, i.e. prostate cancer, compound testing or diabetes.
Fortran: The name of a programming language derived from FORmula TRANSlation. Fortran
was developed in the 1950s and was especially used for numerical programming. The programming
language directly supports numerical operations and because of this, optimized compiler calculations
can be performed. Up to now, various numerical programming libraries are available and used in mathematical, physical and chemical science.
Kinetics: In this chapter the term kinetic is used in a chemical sense and describes the principles
of reaction velocities.
Linearization: In mathematics, linearization describes the linear approximation of a function at a
given point.
Object-Oriented: Object-oriented programming uses abstract objects and their interactions to describe the contents and the functionalities of the program, according to its design.
ODE: An ordinary differential equation is an equation that contains a function and the derivates of
this function. It differs from partial differential equations in that, in an ordinary differential equation,
the included function depends on only one variable.
Protein Complex: A group of two or more chemically bound proteins formed by stable proteinprotein interactions.
Python: The high-level programming script language Python supports functional, object-oriented
and imperative programming paradigms.
Stoichiometry: The quantitative relationship between the reactant and the product in a chemical
reaction is described by the stoichiometry. It can be used to calculate the quantitative amount of the
product or the educt of a reaction, where one of both measurable quantities is known.
Topology: The topology defines the properties of spaces and maps. Dependent on the described
space arbitrary complex structures used.
96
97
Chapter V
Synthetic Biology as a Proof of

Systems Biology
Andrew Kuznetsov
Freiburg University, Germany
abstract
Biologists have used a reductionist approach to investigate the essence of life. In the last years, scientific
disciplines have merged with the aim of studying life on a global scale in terms of molecules and their
interactions. Based on high-throughput measurements, Systems Biology adopts mathematical modeling and computational simulation to reconstruct natural biological systems. Synthetic Biology seeks to
engineer artificial biological systems starting from standard molecular compounds coding in DNA. Can
Systems and Synthetic Biology be combined with the idea of creating a new scienceSYS Biology that
will not demarcate natural and artificial realities? What will this approach bring to medicine?
We live in a society exquisitely dependent on science and technology,

in which hardly anyone knows anything about science and technology.
- Carl Sagan

Intr od ucti on
Sometimes, we are like the three blind Indian philosophers who tried to guess what kind of animal the
elephant was by touching various parts of it. One blind man while touching the side of elephant announced
the animal was like a wall. The second philosopher hugged its leg and declared that the animal was like
a tree and the third blind man, while holding on to its tail said the animal was a snake. All three were
correct, but all three had a distorted perspective of an elephant. This allegory captures a weakness of the
analytical reductionist approach to biological science and illustrates a paradigm that the whole is greater
Synthetic Biology as a Proof of Systems Biology
than the sum of its parts. A system, holistic approach to Biology means the synthesis of knowledge from
various sources and by different methods of data extraction. This approach starts with data collection
and modeling to understand how components of the system interact, continues with experimentation
and then returns to modeling to refine our understanding of interactions and to identify new questions
to be addressed. This system of thinking emphasizes relationships rather than isolated entities.
The idea of a system-level understanding of Biology is not new. In 1943, Erwin Schrdinger published
the book What is Life, a seminal work on scientific thought that examined the relationship between
the laws of Physics and the mechanisms of life. In particular, it provoked the development of Molecular
Biology and led to the research we know as Systems Biology. Norbert Wiener (1948) and Ludwig von
Bertalanffy (1969) described a systems approach to living organisms i.e. the holistic view that mysterious properties of life arise at the system level from dynamical interactions and diversity of system
components. Breakthroughs in Molecular Biology during the last decades have enabled an analysis of
dynamical interactions inside living cells and between them. Systems Biology appeared as a result of
the Human Genome Project as well as from a growing understanding of how genes and their proteins
give rise to biological forms and functions. Recent studies have involved high-throughput experiments
in Genomics, Transcriptomics, Proteomics and Metabolomics. These -omics should be fused together
to reach an understanding of Biology at a top system-level (Kitano, 2002a). The new field has attracted
biologists, engineers, mathematicians, physicists and chemists who are tackling complex biological
problems. The Internet allows researchers to distribute massive amounts of data. In particular, the theory
of dynamical systems, agent-based approach and systems engineering methods provide the opportunity
to study the collective behavior of biological entities. The challenge is to connect genetic circuits with
physiological behavior.
Following Systems Biology, the goal of Synthetic Biology is both to improve our quantitative understanding of natural phenomenon and to establish an engineering discipline to design artificial biological
systems. It will strongly depend on what possibilities there will be in the multi-scale modeling of whole
organisms. Biological models often have numerous unknown parameters such as kinetic constants,
decay rates and drift terms. A big problem for Systems/Synthetic Biology (SYS Biology for short) is
that these parameters are often very difficult to measure. However, Systems Biology researchers believe
that methods of dynamic analysis, modeling and simulation can provide a deeper understanding of life
(Kitano, 2002b). Synthetic Biology, with the goal of synthesizing life from scratch, gives us other modern hype-and-hope, namely the understanding by building. Regarding complex dynamical systems,
Richard Feynman wrote: What I cannot create I do not understand. By creating artificial life, we are
beginning to answer Schrdingers question: What is Life? This will give us new opportunities to
distinguish the health and pathology for treating for example, schizophrenia, cancer and diabetes.
Co nnecti on between
gen otype and phen otype
A current front of -omics research has moved from metabolic pathway analysis to the reconstruction
of regulatory networks, identification of protein/DNA, protein/RNA and protein/protein interactions,
simulations of signal transduction reactions, validation of experimental data available from highthroughput measurements and to studies on the correlation between gene expressions and phenotype.
The relation between genotype and phenotype is a central question. Do selective forces which act on the
phenotype affect individual genes? Or, is there an epigenetic influence arising from the complex interac-
98
tion between many gene products? (Wagner, 1996) The latest finding demonstrated that the metabolic
pathway assignment is useful in verifying genomic annotations. But, there are many gaps in genetic
and epigenetic events, e.g. the effects of posttranscriptional and posttranslational regulation. What is
missing is a relation between high-level functional states and genetic networks. The global organization and its effects are mostly ignored. Life has been a great secret so far! How can an adapting and
developing dynamics emerge from information hidden in the genome? Still in the case of very simple
circuits that are carefully designed, the phenotype can often not be predicted.
Goldstein (1999) defined emergence as the arising of novel and coherent structures, patterns and
properties during the process of self-organization in complex systems. An emergent structure is more
than a sum of its parts because the interactions between these parts play a significant role. Life is an
example of emergent properties by combinations of individual atoms that form molecules such as DNA,
RNA, proteins, carbohydrates and lipids, which in turn create organelles, cells, tissues, organs, organisms and communities. An understanding of such emergent events requires system-level perspectives.
In fact, a transition from molecules to physiology occurred in Biology and it is generally believed that
multi-scale and nonlinear feedback mechanisms can be understood through mathematical and computational models. A problem is that the theory of dynamical systems used in Physics and Chemistry might
be not enough for biological entities. Living cells are super complex agents containing a vast amount of
information that cannot be reduced to simple rules or sophisticated mathematical functions.
D at a acq uisiti on and system- le ve l underst anding

Because biological systems are not just an assembly of genes and proteins, their properties cannot be
fully understood by drawing a plain connection graph. Modeling results contain many false outputs that
need to be verified by wet lab experiments. Historically, Biology has dealt with vast experimental data
as well as spatial and temporal phenomena that are more complex than in other natural sciences. These
large data sets often bring more confusion than understanding our ability to generate data now
outstrips our ability to analyze it (Patterson, 2003). On the other hand, Proteomics is substrate limited
and currently available experimental data on gene expression are usually not sufficient to reconstruct the
structure of pathways and regulatory networks. Furthermore, the system approach is based on information
in public databases that are often incomplete, not standardized and properly annotated and the quality
of the data is often uncertain. A Systems Biology community needs information standards. Systems
Biology Markup Language (SBML) is an XML-based language for representing models of biochemical reaction networks which has evolved since 2000. Today, SBML is supported by over 100 software
packages, such as Cell Designer, Systems Biology Workbench (SBW), including JDesigner/Jarnac and
SBML toolbox for MATLAB users (Keating et al, 2006). SBML website provides information and the
tools needed to understand Systems Biology (http://sbml.org/index.psp).
One of difficult problems that scientists deal with is how to transform a natural phenomenon into
a set of equations. It is usually impossible to describe a phenomenon totally, so one looks for a set of
equations which describes the system approximately. In general, after we have built a set of equations,
we compare the data generated by equations with real data collected by measurement. If the two sets
of data close, then the equations will result in a good description of the real-world system. If not, then
the underlying equations have to be modified. The set of equations is called a model for the system. The
problem of stating clear assumptions and describing parameters and variables on which the model will
99
be based is not a trivial task. One could build a so-called white box model based on first principles, but in
many cases such models would be complex and it would be impossible to obtain the nature of biological
systems. A more common approach is therefore to start from measurements of the behavior of system
and then to determine a mathematical model of data without going into details. This approach is called
system identification. There are two types of models: the grey box model and the black box model. In the
grey box model, the structure is typically chosen by the user. The model contains unknown parameters
which can be estimated using algorithms of system identification to agree with experimental data. In the
black box model, no prior model is available. Most system identification algorithms are of this type. The
idea is to fit linear and nonlinear models to data. This data-driven approach helps to describe systems
with a complex dynamics and structural uncertainty that are not easily modeled from first principles.
A complexity of models is associated with the number of parameters in the model and the polynomials
of higher degrees. Modern computers often appear to have an infinite capacity for calculation. However,
to solve certain simple problems computing devices show their limits. Natural biological systems are a
product of more than 3500 million years of evolution and molecular computation; they are not optimized
to be modeled easily. Furthermore, several genome projects have shown that most genes have not yet
been characterized, especially in multicellular organisms. Hopefully, network structures of a large set
of gene interactions and biochemical pathways are well known and a behavior of those systems can be
understood through metabolic, sensitivity and dynamic analysis methods including the phase portrait
and bifurcation analysis.
Itera ti ve refinement : management

bi ologica l kn ow ledge
of kn own and unkn own
Because SYS Biologyreverse and direct engineering of genetic networksis extremely complex;
the most reasonable way to study and manage this complexity is the iterative refinement which combines experimental and computational methods. A source of knowledge for Systems Biology is the
high-throughput experimental methods such as microarray and mass spectrometry. An extraction of
patterns and correlations from the vast measurements and large databases is difficult e.g. techniques to
test multiple hypotheses include the cluster analysis, data mining and machine learning (Kell, 2005;
Bosl, 2007). There are several approaches to the reverse engineering of genetic regulatory networks
from gene expression data. The traditional one is a mathematical description of the biochemical processes in terms of differential equations however, it is restricted by small systems. The other one is
cluster analysis from the theory of experiment. Although, clustering provides a simple way to extract
qualitative information about genes co-expression from large data sets, it does not lead to a distinction
between individual genes. A compromise between these two extremes is Bayesian networks. The Bayesian learning paradigm is suitable for representing conditional relations between multiple interacting
quantities; their probabilistic nature is capable of handling a noise inherent in biological processes and
microarray experiments. However, an inference problem is hard if the interactions between hundreds
of genes learned from small data sets. Quantitative studies of complex biological networks are not easy
to perceive (Friedman et al, 2000; Husmeier, 2003; Bansal et al, 2007).
100
Modeling via D ifferential E quations

In some cases, interactions among system components can be described by classical ordinary differential equations (ODE). When spatial effects (e.g. diffusion) play a significant role in a system, partial
differential equations (PDE) can be applied. This approach has been used to explain the existence of
morphogens by Alan Turing (1952). Both ODE and PDE models are deterministic, i.e. given the same
initial conditions for each model, repeated simulations will produce the same results. Both models can
be analyzed by parametric sensitivity analysis and bifurcation analysis. Parametric sensitivity analysis
can be carried out to characterize quantitative changes of model dynamics in response to perturbations
of model parameters (Varma et al, 1999). Bifurcation analysis is used to determine how qualitative properties of a system depend on its parameters, in particular, to find the steady-state solutions of a system
and their stability (Guckenheimer, Holmes, 2002). The differential equations may be not adequate if the
number of molecules is low because there will be fluctuations in the cellular processes. When molecular
noise affects a cellular function, it has to be captured by mathematical analysis. This can be carried out
by chemical master equations (CME). CME model can be solved by using the Gillespies fist reaction
method (Gillespie, 1976). This algorithm chooses the fastest individual reaction and then transitions to
the next state. Gibson modified Gillespies method to execute only the logarithm of a number of reactions by simulation (Gibson, Bruck, 2000). One can now use a Gibson-modified Gillespie algorithm to
execute 1010 reaction events per day on 800-MHz Pentium III Processor. A hypothetical simulation of
E. coli will need 1014-1016 reaction events per cell and seems quite tractable (Endy, Brent, 2001). Similar
to the spatial extension of ODE to PDE, the same spatial effects can be included into stochastic models.
Examples of deterministic/stochastic and non-spatial/spatial simulators given by Lok and Brent (2005)
are summarized in Table 1.
D ata-D riven Modeling A pproaches

Data-driven modeling approaches such as clustering (data organization), principal components (data
compression) and partial least squares (data prediction) help users to analyze large amounts of data by
specific data representation. Because cellular events are dynamic, time courses are the most typical.
However, time-course plots are limiting for applications that track many variables together. An alternative
view of the same data is to consider each variable on its own axis as a dimension. In this case, the state
of a system will be described as a vector. In addition, it is possible to identify a small number of more
important dimensions by reorganizing the data space. The goal of data-driven modeling is to extract
the more informative dimensions from a whole data space. The aim of clustering is to define a distance
between either the vectors themselves (Euclidean distance) or the parameters that are derived from the
vectors (Pearson distance) in a way that reveals potential biological sense. Principal components analysis
reduces the dimensionality by finding new axes, called principal components, as a result of the linear
combination of axes that are the most tightly connected with one another. This method allows the user
to view the data in 3D or 2D space and record the most important biological information. Partial least
squares analysis allows user to pose hypotheses and predictions. Rather than performing the regression
in the original data space, this approach reduces the dimension to a principal component space and finds
the regression between independent and dependent principal components (Janes, Yaffe, 2006).
101
Table 1. Molecular dynamics and simulation of biochemical networks

Deterministic
Stochastic
Non-spatial
ODE
Gepasi1
Gillespies first reaction algorithm

Stochastirator3
Spatial
PDE
Virtual Cell2

exact

approximate

hybrid approach
special-purpose hardware
Field Programmable Gate Arrays (FPGAs)
reaction events occur during collision

MCell4 ray-tracing algorithms
ChemCell5
Moleculizer6 for protein complexes; the program
automatically generates the network of reactions
BioNetGen7 signaling pathways
- http://www.gepasi.org/
- http://www.nrcam.uchc.edu/login/login.html
3
- http://opnsrcbio.molsci.org/stochastirator/stoch-main.html
4
- http://www.mcell.cnl.salk.edu/
5
- http://www.cs.sandia.gov/~sjplimp/cell.html
6
- http://www.molsci.org/~lok/moleculizer/
7
- http://cellsignaling.lanl.gov/bionetgen
1
2
A gent-B ased A pproach: D ata Mining and S imulation in a S ingle Platform

According to science, the world is a stratified structure with many levels. Entities on each level interact
with other entities at the same level and generate new selective qualities for a next higher level. Because
these adaptive interactions are too complex to be captured by analytical methods, computer simulation
is used. The basic idea of such simulation is to specify the rules of behavior of individual entities as well
as the rules of their interactions. Simulated entities are called agents, the simulation of their behavior is
known as agent-based simulation. Properties of individual agents describing their behavior and interactions are specified as elementary properties, and properties emerging on the higher, collective level
are defined as emergent properties. In this respect, an agent is like a computing system with a well
defined interface capable of adaptive problem-solving actions without user intervention. Agent-oriented
engineering has much in common with object-oriented programming and was used to develop software
for assistants for bioinformaticians and as problem solvers for biologists. Agents have been exploited
in Systems Biology for the management of primary databases, genome analysis and annotation, for
identification of spots on a microarray and 2D gels, in mass spectrometry and in biological systems
simulation such as the molecular self-organization (Troisi et al, 2005), bacterial chemotaxis (Emonet
et al, 2005) and T-cell recognition (Casal et al, 2005). An agent paradigm of distributed computing
can be considered as a generalization of cellular automata; each agent acts like a finite state machine.
Agent-to-agent interactions and messages passing between agents are essential for collective behavior.
The agent-based simulation increases the biological intuition and is more suitable than other modeling
methods because the cell as agent is a natural metaphor for biologists. A biological process can be
described by a graphical semi-formal model, validated by a formal model, simulated by a multi-agent
system and finally tested with experimental results (Merelli et al, 2007).
102
S ynthetic
B iology
B iology is an engineering
appr oach t o S ystems
The term Synthetic Biology was originally used to describe efforts merging different disciplines in order
to reach a holistic understanding of life. This term also has been applied to the combination of science
and technology in their attempt to design novel biological functions. In 1978, Waclaw Szybalski wrote
that The work on restriction nucleases has led us into the new era of synthetic biology. Along this
line, a straightforward tactic to test our understanding of a natural system is to make a version of the
system; the similarity between observation and expectation is the highlight of a constructive paradigm.
The potential of gene synthesis was demonstrated in 1964 by the team of Gobind Korana as a part of
their work on the elucidation of genetic code.
Synthetic Biology means an expansion of biotechnology to an engineering formalism, with the
ultimate goal of designing and building engineered biological systems that process information, manipulate chemicals, fabricate materials and structures, produce energy, provide food, maintain and enhance
human health and our environment. An important aspect of Synthetic Biology, which distinguishes it
from conventional genetic engineering, is a strong emphasis on foundational technologies that make the
engineering of Biology easier and more reliable (Endy, 2005). Such examples are the common systems
engineering doctrines, which has been accepted by biologists, including step-by-step modeling and
experimentation trials, sharing models, but not raw data, and microarray-based DNA synthesis (Tian
et al, 2004; Sthler et al, 2006). Synthetic Biology has two branches. The first one, reconstruction of
life, is a dissection of biological systems into minimal modules. This approach includes experiments to
obtain information on isolated parts, the simulation of systems from these parts and future experimental
verification. The second one is the construction of life. In this case, the goal is to build systems inspired
by biological knowledge. The general idea here is to combine modular, robust and reusable artificial
components to reach even more interesting properties about life that are yet unknown.
Because natural biological systems are extremely complex, it makes sense to rewrite them according to our view and deal with the artificial systems. This notion draws its inspiration from refactoring, a
process used to improve the computer software. DNA synthesis, as expected, will replace a traditional
recombinant technology and will allow DNA programming, for instance, the large-scale genome
synthesis, change of codons to improve gene expression, and the incorporation of novel amino acids
into polypeptides. The construction of synthetic metabolic pathways would prevent a complexity of life
created by evolution. Through the synthesis, we could escape cross talks and reduce the entanglement
of signaling pathways.
More remarkable is that Synthetic Biology has naturally accepted the agent and amorphous computing paradigms (Abelson et al, 2000). It is a way of thinking about existing biological machines and of
constructing new ones (Baker et al, 2006). The latest success in the modification of cellular processes
and improvements in DNA synthesis technology led to the recognition of living cells as a programmable
matter. Through genetic engineering and principles of abstraction, composition and specification, it
is now possible to program bacterial cells with sensors, logics and actuators like computers or robots
(Voigt, 2006). Information propagates in cells through synthetic transcriptional cascades. The artificial
cell-to-cell communication was programmed to reach a coordinated bacterial behavior and pattern
formation that allowed observing how global behavior emerges from local interactions between cells.
There are attempts to exploit synthetic gene networks representing analog and digital logic circuitry
103
that regulate gene expression, differentiation and cell communication in higher organisms (Andrianantoandro et al, 2006).
S ynthetic
bi ologists
genetic circ uits
t oolkit : the plug-and-p lay design
of
A design of de novo genetic circuits starts by the identification of suitable genetic elements (building
blocks) including promoter regions, ribosome binding sites, structural genes, attenuators of transcription
and so on. There are active efforts to collect genetic elements as Biobrickspieces of DNA flanked
by idempotent restriction sites. A registry of Standard Biological Parts (http://parts.mit.edu/) includes
the catalog of standard DNA components that can be distributed and shared among students. One idea
behind the registry is that these genetic parts can be recombined to produce many types of devices in
an abstract hierarchical manner. At the next level, devices with a matching interface can be assembled
into more complex systems. This aim could be achieved by the idempotent vector that recreates exactly
the same restriction site and contains DNA parts with the desired functions. The strategy includes genetic transformation by the designed genetic construct and observation of the actual result compared
with the desired output. Well characterized Biobricks will not only serve as building blocks but will
provide kinetic parameters for simulation. In the future, as DNA synthesis becomes less expensive,
the registry will be replaced by specifications of in silico building blocksLEGO set of life for the
computer added design (CAD). Many kinds of circuits like transcriptional cascades with feedback and
feedforward motifs may be constructed by combinations of existing genetic parts. These networks,
for example, can serve as memory elements, perform digital computation, send and receive chemical
signals, attenuate gene expression noise, or generate oscillations and single pulses (Guido et al, 2006;
Heinemann, Panke, 2006). However, many questions remain (Drubin et al, 2007). How would higher
levels of complexity for synthetic biological systems be designed? What kind of standards do we need?
Who will contribute subsystems?
Modern cloning technique was not restricted by the MIT Biobricks assembly method (Knight, 2003).
Typically, PCR with extended primers keeps any flanking restriction site for the cloning DNA which can
be inserted into a desirable vector. There are many other cloning strategies each with advantages and
disadvantages e.g. the NOMAD technology (Rebatchouk et al, 1996), Gateway recombination cloning
system (Brasch et al, 2004), and Topo cloning (Invitrogen). Many commercial firms like Invitrogen
and Stratagene have advanced their own schemes for cloning including equipments and competent cells
for easy transformation. A real problem for cloning techniques is the small capacity of cloning vectors
which is about 10-12 kb for E. coli system. Using an artificial chromosome may be necessary in order
to exceed this limit. In addition to Biobricks registry, there is significant interest in filling the databases
for aptamers and protein domains. Perhaps Synthetic Biology will not be restricted by one catalogue
of known biological parts but will include the tools needed to supplement custom-made parts and to
utilize natural sequences from NCBI (http://www.ncbi.nlm.nih.gov/).
When a circuit is built, it can be evaluated by quantitative analyses including assays for the reporter
gene expression, cell growth, light microscopy, flow cytometry and microfluidics. Single cell measurement has been used to study binding constants, degradation rates, noises in gene expression and variations of chemotaxis activity. Such observations help to explain how a biological system acts and provide
the basis for modeling. Differences between predicted and measured behavior can give details why a
104
synthetic system does not work as anticipated. In some cases, engineered cells have to be characterized
under specific conditions for a long time scale. Such tasks will benefit from microfluidics focusing on
miniaturization of cell environment in order to make possible the single cell assessment. Technologies
that allow the study of cells in microfabricated devices and perform parallel postgenomic data collection will be especially useful in Synthetic Biology. Standardization of measurements will ultimately
enable CAD applications (Marguet et al, 2007).
Evolu ti on as a techn ology f or S ynthetic

life
B iology : an ersa tz
A manufacture of biological parts and devices can be achieved through either rational design or laboratory evolution. Usually preliminary design includes the modeling, testing, and modification of basic
circuit components such as promoters and ribosomal binding sites. The rational design requires prior
information about the parts, which is sometimes unavailable. Opposite of the laboratory evolution is the
method of debugging synthetic biological systems (Yokobayashi et al, 2002). The process launches with
the generation of a huge library of diverse DNA molecules by error-prone PCR and DNA shuffling or
combinatorial synthesis. The library is then subjected to a selection that provides a connection between
genotype and phenotype in order to reach the preferred function. Laboratory evolution requires highthroughput measurements covering a large search space. It should be mentioned that the fitness function
is not evident in many applications and the selection pressure can be applied only in simple cases as for
antibiotic resistance. An evolution of complex behavior may not be straightforward. It is not clear at the
moment how such laboratory evolution can easily be tailored to screen for a content dependent design
of complex genetic networks. There are questions that should be addressed about the evolutionary stability of artificial networks. Both methods could be considered as different modes to optimization and
should be used in combination; an example of the intermediate method is the modular recombination
or compositional evolution (Hiraga, Arnold, 2003).
The next step in Synthetic Biology research would be a computational evolution where a set of parts
can evolve in silico into networks that demonstrate desired behaviors. Whether such molecular evolution
is tractable for digital computers is an open question. As an alternative, the cellular computation could
be designed to take advantage of the vast parallelism of cell agents. In the context of cellular computing, a heterogeneous population of cells is able to investigate a very large search space (Kuznetsov et
al, 2006; Tan et al, 2007).
Basically, Systems Biology might contribute in producing artificial cells demonstrating behaviors
which we associate with life (Barrett et al, 2006). In modern terms, DNA can be considered as software
and the rest of the cell as hardware. Technically, it will soon be possible to synthesize a whole bacterial genome de novo and introduce it into a lipid sack to generate a protocell. This life-like biological
agent will encapsulate macromolecules, capture energy, maintain ion gradients and divide (Forster,
Church, 2007). However, the modern approach to artificial life is more pragmatic with the intention of
cleaning genomes. Recent achievements in Comparative Genomics have revealed the essential genes
that maintain the replication, transcription, translation, reparation, protein processing and metabolism.
It is possible to cut out mobile elements and unknown genes with insufficient functions to make the
genome easier to customize. Progress has been made towards the identification of non-essential genes
in the genome of Escherichia coli. Kang et al (2004) removed by knockout redundant genes to be close
105
to a core genome. It was observed that E. coli strains with a reduced genome were genetically more
stable, increased protein synthesis and even improved electroporation efficiency. A resulting bacterium
with the minimal genome may serve as a vessel for synthetic genetic networks. Mesoplasma florum L1
with very attractive cultivation properties has the genome size of 793 kb comprising only 517 genes.
The genomic sequence has become available and molecular biology methods are developed. A similar
approach is being followed with Mycoplasma genitalium G-37 (580 kb, 482 genes), for which comprehensive data are available (Glass et al, 2006). These findings show that a redesign of an entire synthetic
microorganism is not far off in the future.
Minima l synthetic
organism
as therape
utic nan ob ot
Artificial bacterial robots are expected to appear in the coming years. Applied to medicine, nanorobots
would be programmed for specific biological task and injected into the blood to work at the cellular
level attacking cancer cells (Patel et al, 2006). To effectively fight cancer, therapeutic nanobots
should recognize malignant cells and destroy them. Anticancer agents should avoid immunological
surveillance; demonstrate decentralized and distributed swarm intelligence and emergent cooperative
behavior. It means that they will have to obtain sensors, computational resources to make decisions and
actuators to act on target tumor cells.
Nearly 150 years ago, live bacteria were used in the treatment of cancer. Today there is revived interest
in using bacteria as anti-tumor agents. Bifidobacterium, Clostridium and Salmonella have been shown
to preferentially reproduce within solid tumors and have been used to transport DNA encoded toxins,
prodrug-activating enzymes, cytokines and angiogenesis inhibitors (Pawelek et al, 2003). Bacteria can
feel their environment and distinguish between cell types. The sensors, logics and actuators could be
embodied in genetic circuits of bacteria to perform coordinated tasks for therapy. Anderson et al. (2006)
engineered the non-pathogenic bacteria E. coli able to invade mammalian cells selectively at high cell
density or anaerobic conditions (Figure 1). Previous microarray analysis identified several genes in E. coli,
Figure 1. Condition dependent invasion. At a low cell density or normal aerobic growth conditions, the
engineered bacteria are noninvasive. Above a critical cell density or in a hypoxic environment, sensors
are activated resulting in the synthesis of invasin from Y. pseudotuberculosis that leads to the invasion
of HeLa cells (Anderson et al, 2006)
106
whose expression is strongly induced after the shift from aerobic to anaerobic growth. The formate
dehydrogenase ( fdhF) is one of the most strongly induced genes. Therefore, fdhF promoter was chosen
to activate the invasion after the transition to an anaerobic environment. The inv gene encoding invasin
from Yersinia pseudotuberculosis was cloned in E. coli under fdhF promoter (Figure 2a). Typically, the
invasin binds to 1-integrins presented on the surface of eukaryotic cells and induces bacterial uptake
by mammalian cells. If fdhF promoter is active, then invasin is expressed in E. coli initiating adhesion
and invasion (Figure 2b). The inv gene provoked invasion of the cervical carcinoma, hepatocarcinoma
and osteosarcoma. The engineered bacteria invaded tumor cells either under anaerobic conditions or
when their density was high (mechanism is not shown). Modification of gene circuits by the coupling of
hypoxia-sensing and quorum-sensing into AND gate may improve the specificity of bacterial invasion
of tumor cells. The authors wrote that Genetic logic circuits or response regulatory networks could
integrate multiple inputs to achieve more accurate environmental sensing.
In general, cancer research has focused on the identification of molecular differences between
malignant and healthy cells. Although, there have been attempts to summarize experimental results
through mathematical models, the gaps between in vitro cell lines, animal tumor models, and human
in vivo tumors are still very large. The picture is very complex and will continue to be so; about 300
genes have been identified in tumorgenesis. Molecules from many parallel signal transduction pathways
are involved. Their activities appear to be controlled by multiple factors (Khalil, Hill, 2005; Hornberg
et al, 2006). However, there is great hope that the measured in vivo data can be used for personalized
medicine. Vectors on the basis of retroviruses, adenoviruses and lentiviruses have been attempted to
deliver genes into human cells. Still numerous problems exist such as ensuring those viruses will infect
the correct target cells. Under a new light of collective artificial intelligence and bacterial programming, it seems that the viruses are too primitive to be applied as smart therapeutic agents to perform
a swarm algorithm. They simply do not have enough computing power to make intelligent decisions
about cell type invasion. However, the experiences of viruses in Molecular Biology can be extremely
useful in the construction of a bug with diagnostic and therapeutic modules in one device. A genome
synthesis will make it possible to fabricate this minimal cell. This way of managing a cancer system
Figure 2. Design and results of anaerobic invasion. (a) fdhF promoter was fused with inv gene. (b)
Bacteria, recovered from cells after invasion with no plasmid (wt) and with the FdhInv gene, grown with
no induction in white and anaerobic induction in black. Assays in which no bacteria were recovered are
indicated by an asterisk (Anderson et al, 2006)
107
phenomenon has one advantage. Synthetic genomes will be supplied with manuals and documentation.
Cell chassis will create a context for installing new genetic modules and become an ideal therapeutic
nanobot. These autonomous agents will perform logic operations to make a collective verdict. The
complex biological system engineered to perform a task should be the cell, and in fact not just one cell,
but a population of cells (Weiss, 2007).
Co nstr ucti on of multice llular systems,

organisms (GM O)
genetica
lly modified
A human body is a complex multi-agent system of 1014 highly specialized cells organized in tissues
and organs. Individual cells take their origin from a single fertilized egg. This process of differentiation and morphogenesis is organized by an accompaniment of chemical gradients, cellular receptors,
differential gene expression and cell migration (Gilbert, 2000). Typically, coordinated cell behavior
Figure 3. E. coli receiver cell can shine in green only at an intermediate distance from the sender cell
Circuit operation for a sender and three receivers, exposed to high, medium, and low AHL concentrations,
showing the relevant protein activities in cells at different distances from the sender as mediated through
transcriptional regulation (orange, constitutively expressed response proteins; blue/green, expression of
regulated proteins; green and red arrows, transcriptional induction and repression respectively). High
levels of LacI or LacIM1 are required to repress GFP (Basu et al, 2005)
108
involves cell-to-cell communication and intracellular signal processing. Synthetic Biology has been
used to exploit and improve by design the pattern formation mechanisms found in Nature. To address
this question, Basu et al (2005) created artificial patterns from bacterial cells. Authors demonstrated a
synthetic multicellular system in which genetically engineered receiver cells are programmed to form
ring-like patterns of differentiation based on chemical gradients of an acyl-homoserine lactone (AHL)
signal that is synthesized by sender cells (Figures 3, 4).
An AHL communication signal from sender cells is initiated by expression of the LuxI enzyme.
LuxI catalyses the synthesis of AHL which diffuses through the cell membrane and forms a chemical
gradient around the senders. AHL diffuses into nearby receiver cells and binds to LuxR, an AHL-dependent transcriptional regulator which activates the expression of lambda repressor (CI) and Lac repressor
(LacIM1, a product of a codon-modified lacI). Receiver cells in close proximity to the senders receive
high concentrations of AHL resulting in high cytoplasmic levels of CI and LacIM1 and repression of the
green fluorescent protein (GFP). Receivers that are far from the senders have low AHL concentrations,
and accordingly LacIM1 and CI are expressed only at basal levels. This enables the expression of a wildtype LacI, again resulting in GFP repression. At intermediate distances from the senders, intermediate
AHL concentrations result in moderate levels of CI and LacIM1. However, because the repression efficiency of CI is significantly higher than that of LacIM1, CI effectively shuts off LacI expression while
the LacIM1 concentration is below the threshold required to repress GFP production. This difference
between the CI and LacIM1 repression efficiencies, in combination with a feed-forward loop that begins
with LuxR and culminates in GFP, affords the circuit the desired non-monotonic response to AHL
dosages (Basu et al, 2005).
The authors have shown how a population of bacterial cells can sense a chemical gradient and form
three distinct regions. The space design parameters were determined by modeling and experimentation.
Ron Wiess expects it will be possible to extend the artificial rules of pattern formation within bacteria to
Figure 4. Formation of various patterns. (a) Simulation with two senders that results in the formation
of an ellipse. (bd) Experiments showing various GFP patterns formed as a result of the placement and
initial concentrations of sender cells: b, ellipse, two sender disks; c, heart, three sender disks; and d,
clover, four sender disks (Basu et al, 2005)
109
coordinate a collective behavior of eukaryotic cells. The integration of such systems into higher-level
organisms will have practical applications in three-dimensional tissue engineering, biosensing and
biomaterial fabrication. The cell behavior during development of multicellular organisms correlates
with the spatiotemporal pattern of gene expression. Synthetic gene networks, artificial cell-to-cell communication, and spatiotemporal dynamics of yeast and mouse embryonic stem cells are under active
investigation with the ultimate goal of organ engineering (Weiss, 2007).
The most interesting objectives are not just genetic circuits but circuits embedded into host cells.
Natural genetic circuits are not optimized for the operation within a foreign cellular context. In fact,
the same genetic circuit can have various behaviors in different strains. Theoretically, genetic circuits
should be disconnected from the cellular context. This is exactly the same old problem that Genetic
Engineering had with GMOs. Transgenesis has been used for decades to study molecular mechanisms
of basic cellular processes as well as to create organisms that produce therapeutic proteins. But these
efforts were mostly unsuccessful because of the extreme complexity of biological systems. The uncertainty and context dependence of installed genetic programs as well as mutations and long-term instability are well known obstacles to Genetic Engineering (Kuznetsov, 1995; Cases, de Lorenzo, 2005). It
is now apparent: genetic circuits function best if the number of interactions between the circuits and
the host cell are minimized. For Synthetic Biology, this minimization means engineering a simple
organism with a minimal set of genes by synthesizing its genome. But the critical question of whether
a synthetic minimal genome would be a comfortable niche for selfish genetic elements presented in the
natural environment is open. To avoid this problem, the development of a completely artificial genetic
code as well as the search for alternative energy sources and unusual substrates totally separating the
therapeutic agentsan orthogonal life would be a wiser decision.
Co nc lusi on
At the same time that high-throughput technologies have accelerated Genome Analysis and Proteomics,
the Functional Genomics have remained a bottleneck since it has been based on the slow cloning of
recombinant DNA. Paradoxically, a complexity of Biology was the basis for the development of Systems
and Synthetic Biology. One important goal of SYS Biology is to understand life processes in detail in
order to predict their behavior. If we wish to create a biological system that behaves in a predictable
way, how could that system be organized? If we want to get a bacterium producing artemisinic acid to
combat malaria, how should the metabolic network of the bacterium be rewritten? Or, if our dream is to
produce a smart drug against HIV, what is the best swarm algorithm in vivo? SYS Biology is becoming
a top field with a great potential to revolutionize Biotechnology. Nevertheless, Systems and Synthetic
Biology are in they initial stages and attempting to define themselves and we are at the singular point
in our understanding of what these new technologies will contribute to Biology and Medicine.
In addition to numerous scientific challenges, there are also questions regarding strict safety and
health regulations. Attention has only been given to dual-use technologies. Recent efforts have focused
on the relations of scientists and the media. Eethical issues have just started to be explored. Many questions have arisen (Rai, Boyle, 2007), e.g. what is the best financial mechanism to support this research?
what is more suitable, open source or copyright? is academic freedom important? Similarly, the basic
philosophical ideas of constructive biology have been formulated (Zoloth, 2005): 1) Genes are selfish.
2) Dignity is the intactness of being. 3) Nature is fixed. 4) Nature is normative. 5) Suffering is what
110
defines the human condition. 6) Slopes are slippery. 7) Dual-use is inevitable. 8) Mistakes are inevitable.
9) We will be like gods. 10) The marketplace will distort science. 11) An unfair world (inequity). 12) A
synthetic world, i.e. could we call it carefree Biology?
The paradigm shift to a systems/synthetic science and technology can only be done through the
collaboration of different disciplines: Mathematics, Information Science, Physics, Chemistry, Biology,
Medicine, Micro- and Nano-Engineering. The Genetics, Nanotechnology, Robotics and their fusion
(GNR) are considered the key technologies for the 21st century. Radical discoveries and novel applications, such as the alteration of biomolecules and programming molecular interactions leading to an
assembly into sophisticated artificial systems are expected from this new union.
A ckn ow ledgment
The author is very grateful to Systems Biology and Synthetic Biology scientific communities for materials that have been taken from the open sources like PubMed, SBML portal, OpenWetWare and
Wikipedia. Thanks also to Bert Schnell for his help with the manuscript and for which he declares no
financial interest.
R eferences
Abelson, H., Allen, D., Coore, D., Hanson, C., Homsy, G., Knight, T.F., Jr., Nagpal, R., Rauch, E., Sussman, G.J., & Weiss, R. (2000). Amorphous Computing. Communications of the ACM, 43,(5), 74-82.
Anderson, J.C., Clarke, E.J., Arkin, A.P., & Voigt, C.A. (2006). Environmentally controlled invasion of
cancer cells by engineered bacteria. Journal of Molecular Biology, 355(4), 619-627.
Andrianantoandro, E., Basu, S., Karig, D.K., & Weiss, R. (2006). Synthetic biology: New engineering
rules for an emerging discipline. Molecular Systems Biology, 2(28), 1-14.
Baker, D., Church, G., Collins, J., Endy, D., Jacobson, J., Keasling, J., Modrich, P., Smolke, C., & Weiss,
R. (2006). Engineering life: Building a fab for biology. Scientific American, 294(6), 44-51.
Bansal, M., Belcastro, V., Ambesi-Impiombato, A., di Bernardo, D. (2007). How to infer gene networks
from expression profiles. Molecular Systems Biology, 3(78), 1-10.
Barrett, C.L., Kim, T.Y., Kim, H.U., Palsson, B.O., Lee, S.Y. (2006). Systems biology as a foundation
for genome-scale synthetic biology. Current Opinion in Biotechnology, 17(5) 488-492.
Basu, S., Gerchman, Y., Collins, C.H., Arnold, F.H., & Weiss, R. (2005). A synthetic multicellular system
for programmed pattern formation. Nature, 434, 1130-1134.
Bertalanffy, L. (1969). General system theory: Foundations, development, applications. New York:
George Braziller Inc.
Bosl, W.J. (2007). Systems biology by the rules: Hybrid intelligent systems for pathway modeling and
discovery. BMC Systems Biology, 1(13), 1-25.
111
Brasch, M.A., Hartley, J.L., Vidal, M. (2004). ORFeome cloning and systems biology: Standardized
mass production of the parts from the parts-list. Genome Research, 14(10B), 2001-2009.
Casal, A., Sumen, C., Reddy, T.E., Alber, M.S., & Lee, P.P. (2005). Agent-based modeling of the context
dependency in T cell recognition. Journal of Theoretical Biology, 236(4), 376-391.
Cases, I., & de Lorenzo, V. (2005). Genetically modified organisms for the environment: Stories of success and failure and what we have learned from them. International Microbiology, 8, 213-222.
Drubin, D.A., Way, J.C., & Silver, P.A. (2007). Designing biological systems. Genes & Development,
21(3), 242-254.
Emonet, T., Macal, C.M., North, M.J., Wickersham, C.E., & Cluzel, P. (2005). AgentCell: A digital
single-cell assay for bacterial chemotaxis. Bioinformatics, 21(11), 2714-2721.
Endy, D. (2005). Foundations for engineering biology. Nature, 438, 449-453.
Endy, D., & Brent, R. (2001). Modelling cellular behaviour. Nature, 409, 391-395.
Forster, A.C., & Church, G.M. (2007). Synthetic biology projects in vitro. Genome Research, 17(1),
1-6.
Friedman, N., Linial, M., Nachman, I., & Peer, D. (2000). Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(3-4), 601-620.
Gibson, M.A., Bruck, J. (2000). Efficient exact stochastic simulation of chemical systems with many
species and many channels. Journal of Physical Chemistry, 104, 9, 1876-1889.
Gilbert, S.F. (2003). Developmental biology. Sunderland, MA: Sinauer Associates, Inc.
Gillespie, D.T. (1976). A general method for numerically simulating the stochastic time evolution of
coupled chemical reactions. Journal of Computational Physics, 22, 403-434.
Glass, J.I., Assad-Garcia, N., Alperovich, N., Yooseph, S., Lewis, M.R., Maruf, M., Hutchison, C.A. 3rd,
Smith, H.O., Venter, J.C. (2006). Essential genes of a minimal bacterium. Proceedings of the National
Academy of Sciences of the United States of America, 103(2), 425-430.
Goldstein, J. (1999). Emergence as a construct: History and issues. Emergence: A Journal of Complexity
Issues in Organization and Management, 1, 49-72.
Guckenheimer, J., & Holmes, P. (2002). Nonlinear oscillations, dynamical systems, and bifurcations of
vector fields, 7th Edition. New York: Springer.
Guido, N.J., Wang, X., Adalsteinsson, D., McMillen, D., Hasty, J., Cantor, C.R., Elston, T.C., & Collins,
J.J. (2006). A bottom-up approach to gene regulation. Nature, 439, 856-860.
Heinemann, M., Panke, S. (2006). Synthetic biologyPutting engineering into biology. Bioinformatics, 22(22), 2790-2799.
Hiraga, K., & Arnold, F.H. (2003). General method for sequence-independent site-directed chimeragenesis. Journal of Molecular Biology, 330(2), 287-296.
112
Hornberg, J.J., Bruggeman, F.J., Westerhoff, H.V., & Lankelma, J. (2006). Cancer: A Systems Biology
disease. Biosystems, 83(2-3), 81-90.
Husmeier, D. (2003). Reverse engineering of genetic networks with Bayesian networks. Biochemical
Society Transactions, 31(6), 1516-1518.
Janes, K.A., & Yaffe, M.B. (2006). Data-driven modelling of signal-transduction networks. Nature
Reviews Molecular Cell Biology, 11, 820-828.
Kang, Y., Durfee, T., Glasner, J.D., Qiu, Y., Frisch, D., Winterberg, K.M., & Blattner, F.R. (2004). Systematic mutagenesis of the Escherichia coli genome. Journal of Bacteriology, 186, 15, 4921-4930.
Keating, S.M., Bornstein, B.J., Finney, A., & Hucka, M. (2006). SBML Toolbox: An SBML toolbox for
MATLAB users. Bioinformatics, 22(10), 1275-1277.
Kell, D. (2005). Metabolomics, machine learning and modelling: towards an understanding of the language of cells. Biochemical Society Transactions, 33(3), 520-524.
Khalil, I.G., & Hill, C. (2005). Systems biology for cancer. Current Opinion in Oncology, 17(1), 4448.
Kitano, H. (2002a). Systems Biology: A brief overview. Science, 295, 1662-1664.
Kitano, H. (2002b). Computational Systems Biology. Nature, 420, 206-210.
Knight, T. (2003). Idempotent vector design for standard assembly of biobricks. MIT Artificial Intelligence Laboratory, MIT Synthetic Biology Working Group. Retrieved from http://dspace.mit.edu/
bitstream/1721.1/21168/1/biobricks.pdf
Kuznetsov, A., Schmitz, M., & Mueller, K. (2006, July 26-28). On Bio-Design of Argo-Machine. Presented at 7th German Workshop on Artificial Life (GWAL-7 (pp. 125-133). Jena, Germany.
Kuznetsov, A.V. (1995). Transgenic animals: The methods of construction. Russian Biotechnology, 3-4,
3-6.
Lok, L., & Brent, R. (2005). Automatic generation of cellular reaction networks with Moleculizer 1.0.
Nature Biotechnology, 23(1), 131-136.
Marguet, P., Balagadde, F., Tan, C., & You, L. (2007). Biology by design: Reduction and synthesis of
cellular components and behaviour. Journal of Royal Society Interface, 4(15), 607-623.
Merelli, E., Armano, G., Cannata, N., Corradini, F., dInverno, M., Doms, A., Lord, P., Martin, A.,
Milanesi, L., Moller, S., Schroeder, M., & Luck, M. (2007). Agents in bioinformatics, computational
and systems biology. Briefings in Bioinformatics, 8(1), 45-59.
Patel, G.M., Patel, G.C., Patel, R.B., Patel, J.K., & Patel, M. (2006). Nanorobot: a versatile tool in nanomedicine. Journal of Drug Targeting, 14(2), 63-67.
Patterson, S.D. (2003). Data analysisthe Achilles heel of proteomics. Nature Biotechnology, 21(3),
221-222.
113
Pawelek, J.M., Low, K.B., & Bermudes, D. (2003). Bacteria as tumour-targeting vectors. The Lancet
Oncology, 4, 548-556.
Rai, A., & Boyle, J. (2007). Synthetic Biology: Caught between Property Rights, the Public Domain,
and the Commons. PLoS Biology, 5(3) e58, 389-393.
Rebatchouk, D., Daraselia, N., & Narita, J.O. (1996). NOMAD: a versatile strategy for in vitro DNA
manipulation applied to promoter analysis and vector design. Proceedings of the National Academy of
Sciences of the United States of America, 93(20), 10891-10896.
Schrdinger, E. (1944). What is Life? Available as a Canto Edition from Cambridge University Press.
1992.
Sthler, P., Beier, M., Gao, X., & Hoheisel, J.D. (2006). Another side of genomics: Synthetic biology as
a means for the exploitation of whole-genome sequence information. Journal of Biotechnology, 124(1),
206-212.
Tan, C., Song, H., Niemi, J., & You, L. (2007). A synthetic biology challenge: making cells compute.
Molecular Biosystems, 3(5), 343-353.
Tian, J., Gong, H., Sheng, N., Zhou, X., Gulari, E., Gao, X., & Church, G. (2004). Accurate multiplex
gene synthesis from programmable DNA microchips. Nature, 432, 1050-1054.
Troisi, A., Wong, V., & Ratner, M.A. (2005). An agent-based approach for modeling molecular self-organization. Proceedings of the National Academy of Sciences of the United States of America, 102(2),
255-260.
Turing, A. (1952). The chemical basis of morphogenesis. Philosophical Transactions of the Royal Society
of London, 237(641), 37-72.
Varma. A., Morbidelli, M., & Wu, H. (1999). Parametric sensitivity in chemical systems. Cambridge,
New York: Cambridge University Press.
Voigt, C.A. (2006). Genetic parts to program bacteria. Current Opinion in Biotechnology, 17(5), 548557.
Wagner, A. (1996). Can nonlinear epigenetic interactions obscure causal relations between genotype
and phenotype? Nonlinearity, 9, 607-629.
Weiss, R. (2007, January 11-13). Developments in synthetic biology. In the lecture presented at Systems
Biology, Bioinformatics and Synthetic Biology Workshop (BioSysBio 2007), Manchester, UK.
Wiener, N. (1948). Cybernetics or control and communication in the animal and the machine. Cambridge, MA: MIT Press.
Yokobayashi, Y., Weiss, R., & Arnold, F.H. (2002). Directed evolution of a genetic circuit. Proceedings
of the National Academy of Sciences of the United States of America, 99(26), 16587-16591.
Zoloth, L. (2006, May 20-22). Ethical challenges in Synthetic Biology. The lecture presented at Second International Conference on Synthetic Biology (Synthetic Biology 2.0). University of California,
Berkeley, CA.
114
K ey T erms
Agent: A computing system with a well defined interface capable of adaptive problem-solving actions without user intervention.
Amorphous Computing: The action of generating a coherent behavior from a group of unreliable
agents such as living cells.
Competent Cell: A cell that can accept foreign DNA.
Emergence: Refers to new unexpected behaviors and patterns that arise out of a multiplicity of relatively simple interactions. An emergent behavior can appear when a number of simple entities (agents)
operate in an environment while forming more complex behaviors as a community.
Minimal (or core) Genome: The minimum set of genes necessary for a cell to propagate under
specific environmental conditions.
Nanorobot (or nanobot): Can be defined as the artificially fabricated object able to freely move in
a human body and interact with specific cells at the molecular level itself e.g. seeking out cancer cells
and destroying them.
Orthogonal Life: A term stressing the complete isolation of artificial life-like creatures from natural
processes by using the alternative genetic code and the reliable interface. Chens rules should be taken in
an account: 1) nanomachines should only be specialized, not general purpose, 2) nanomachines should
not be self-replicating, 3) nanomachines should not be made to use an abundant natural compound as
fuel, 4) nanomachines should be tagged so they can be tracked.
Refactoring: Any modification of a computer program which improves its readability or simplifies
its structure without changing its results.
Synthetic Biology: An engineering discipline concerning synthesis of novel biological systems that
are not found in nature. It has involved a paradigm enabling scientists to create life from scratch that
will help to understand principles of Biology. Synthetic Biology holds promise for programming bacteria
to seek and destroy tumors. However, the complexity of Biology and emergent effects have provoked
many technical and ethical challenges.
SYS Biology: A speculative merge of Systems Biology and Synthetic Biology that can be defined
as an approach to the biological reality, where natural and artificial processes should be described in
terms of their components and their interactions in a framework of mathematical models towards a
reconstruction of biological systems.
Systems Biology: The study of an organism, viewed as an interacting network of genes, proteins, and
biochemical reactions which give rise to life. Instead of analyzing individual aspects of the organism,
systems biologists focus on all the components and the interactions among them. Systems Biology is
the discipline that specifically addresses the fundamental properties of biological complexity.
Transgenensis: The introduction of foreign genes into a living organism that confers upon the organism a new property that will be transmitted to a progeny.
115
Section II
Advanced Computational
Methods for Systems Biology
117
Chapter VI
Computational Models for the

Analysis of Modern Biological
Data
Tuan D. Pham
James Cook University, Australia
abstract
Computational models have been playing a significant role for the computer-based analysis of biological
and biomedical data. Given the recent availability of genomic sequences and microarray gene expression,
and proteomic data, there is an increasing demand for developing and applying advanced computational techniques for exploring these types of data such as: functional interpretation of gene expression
data, deciphering of how genes, and proteins work together in pathways and networks, extracting and
analysing phenotypic features of mitotic cells for high throughput screening of novel anti-mitotic drugs.
Successful applications of advanced computational algorithms to solving modern life-science problems
will make significant impacts on several important and promising issues related to genomic medicine,
molecular imaging, and the scientific knowledge of the genetic basis of diseases. This chapter reviews
the fusion of engineering, computer science, and information sciences with biology and medicine to
address some latest technical developments in the computational analyses of modern biological data:
microarray gene expression data, mass spectrometry data, and bioimaging.
Micr oarra y G ene E xpressi on D at a

Microarrays are a relatively new biotechnology that provides novel insights into gene expression and
gene regulation (Brazma and Vilo, 2000; Whitchurch, 2002; Zhang at al, 2002; Pham et al, 2006a).
Microarray technology has been applied in diverse areas ranging from genetics and drug discovery to
disciplines such as virology, microbiology, immunology, endocrinology, and neurobiology. Microarraybased methods are the most widely used technology for large-scale analysis of gene expression because
Computational Models for the Analysis of Modern Biological Data
they allow simultaneous study of mRNA abundance for thousands of genes in a single experiment
(Kellum and Liu, 2003). The generation of DNA microarray image spots involves the hybridization of
two probes labelled with a fluorescent red dye or a fluorescent green dye. The relative image intensity
values of the red dye and the green dye on a particular spot of the arrays indicate the expression ratio for
the corresponding gene of the two samples from which the mRNAs have been extracted. Thus, robust
image processing of microarray spots plays an important role in microarray technology (Nagarajan,
2003; Liew at al, 2003; Lukac et al, 2004).
DNA microarray data consists of a large number of genes and a relatively small number of experimental samples. The number of genes on an array is in the order of thousands, and because this far
exceeds the number of samples, dimension reduction is needed to allow efficient analysis of data classification techniques. Many statistical and machine-learning techniques based different computational
methodologies have been applied for cancer classification in microarray experiments. These techniques
include linear discriminant analysis, k-nearest neighbor algorithms, Bayes classifiers, decision trees,
neural networks, and support vector machines (Dudoit and Fridlyand, 2003; Golub et al, 1999; Guyon
et al, 2002). Nevertheless, common tasks of most classifiers are to perform feature selection and decision logic.
Based on the motivation that conventional statistical methods for pattern classification break down
when there are more variables (genes) than there are samples, Nguyen and Rocke (2002) proposed a
partial least-squares method for classifying human tumor samples using microarray gene expression
data. Zhou et al. (2004) proposed a Bayesian approach for selecting the strongest genes based on microarray gene expression data and the logistic regression model for classifying and predicting cancer genes.
Yeung et al. (2005) reported that conventional methods for gene selection and classification do not take
into account model uncertainty and use a single set of selected genes for prediction, and introduced a
Bayesian model averaging method, which considers the uncertainty by averaging over multiple sets of
overlapping relevant genes. Furey et al. (2000) applied support vector machines for the classification
of cancer tissue samples or cell types using microarrays. Lee et al. (2003) proposed a Bayesian model
for gene selection for cancer classification using microarray data. Statnikov et al. (2005) carried out
a comprehensive evaluation of classification methods for cancer diagnosis based on microarray gene
expression data.
Recently Pham et al. (2006b) carried out cancer classification by transforming microarray data into
spectral vectors. The same authors used the spectral difference or spectral distortion between the pair
of spectra for pattern comparison, which appears to be a potential approach for the cancer classification
using microarray gene expression data.
Mass spectr
ometr y da t a
Current best practice for reducing human mortality rates caused by complex diseases is to detect their
symptoms at early stages. By early recognition of symptoms, one can get the most effective clinical
treatment for the best outcome. Recent advances in biotechnology open doors to fascinating opportunities for the better understanding of the biology of many complex human diseases at molecular levels.
These advances will hopefully lead to the early detection and treatment of such diseases (Petricoin and
Liotta, 2003; Wulfkuhle et al, 2003).
118
Besides the availability of genomic data, life-science researchers study proteomics in order to gain
insight into the functions of cells by learning how proteins are expressed, processed, recycled, and
their localization in cells. Proteomics is defined as the study of proteome which refers to the entire set
of expressed protein in a cell. Major research areas of proteomics include structural, functional, and
interaction studies (Holmes et al, 2005). Structural proteomics uses X-ray crystallography, nuclear
magnetic resonance, or even both to study the final three dimensional shape of proteins. Functional
proteomics involves the use of mass spectrometry (MS) to study the regulation, timing, and location
of protein expression. Interaction studies seek to understand how protein pair between themselves and
other cellular components interacts to constitute to more complex models of the molecular machines.
In particular, protein expression profiles or expression proteomics can be used for large-scale protein
characterization or differential expression analysis that has many applications such as disease classification and prediction, new drug treatment and development, virulence factors, and polymorphisms for
genetic mapping, and species determinants (Adam et al, 2001; Griffin et al, 2002; Aebersold and Mann,
2003; Weir et al, 2003). In comparison with transcriptional profiling in functional genomics, proteomics
has some obvious advantages in that it provides a more direct approach to studying cellular functions
because most gene functions are characterized by proteins (Xiong, 2006).
The identities of expressed proteins in a proteome can be determined by protein separation, identification, and quantification. One of many approaches for separating proteins involves two-dimensional
gel electrophoresis followed by gel image processing. Once proteins are separated, protein differential
expression can be characterized using mass spectrometry, which is a high-resolution technique for
determining molecular masses and provides rapid and accurate measurement of protein profiling in
complex biological and chemical mixture.
Protein profiling of plasma and serum can be prepared with a matrix-assisted laser desorption ionization (MALDI) ion source or the surface-enhanced laser desorption ionization (SELDI) ion source
coupled to a time-of-flight (TOF) mass analyzer with a chevron micochannel plate detector. Detailed
descriptions on mass spectrometry and its advanced developments can be found in the review by Shin
and Markey (2006).
In regards to recent applications of proteomic technology, proteomic patterns have recently been
utilized for early detection of cancer progressions (Sauter et al, 2002; Petricoin et al, 2002; Conrads
et al, 2003). Furthermore, methods for classification of normal and cancerous states using mass spectrometry data have been recently developed. Petricoin et al. (2002) applied cluster analysis and genetic
algorithms to detect early stage ovarian cancer using proteomic spectra. Ball et al. (2002) applied
integrated approach based on neural networks to study SELDI-MS data for classification of human
tumors and identification of biomarkers. Lilien et al. (2003) applied principal component analysis and
a linear discriminant function to classify ovarian and prostate cancers. Sorace and Zhan (2003) used
mass spectrometry serum profiles to detect early ovarian cancer. Wu et al. (2003) compared the performance of several methods for the classification of mass spectrometry data. Tibshirani et al. (2004)
proposed a probabilistic approach for sample classification from protein mass spectrometry data. Morris
et al. (2005) applied wavelet transforms and peak detection for feature extraction of MS data. Yu et al.
(2005) developed a method for dimensionality reduction for high-throughput MS data. Levner (2005)
used feature selection methods and then applied the nearest centroid technique to classify MS-based
ovarian and prostate cancer datasets.
Zhou et al (2006) applied genetic algorithms for protein biomarker discovery for risk stratification
of cardiovascular events. Pham et al. (2006c) have applied the linear predictive coding method for the
119
prediction of the risk of major adverse cardiac events using mass spectral data. The motivation of this
research was initiated from the original work presented in Brennan et al. (2003), who studied 604 patients
who presented in emergency room with chest pain. The blood samples were collected at the presentation
of the emergency room and the protein level of MPO (myloperoxidase) and other known cardiovascular biomarkers were measured. The patients outcome (any cardiovascular event) was monitored for 6
months. The study showed the MPO to be a new biomarker for the prediction of MACE (major adverse
cardiac events) risk in 30 days after the presentation of chest pain in emergency room with accuracy
about 60%. Recently, the FDA (U.S. Food and Drug Administration) approved the CardioMPO kit for
measurement of MPO level (http://www.fda.gov/cdrh/reviews/K050029.pdf). In particular, Pham et al.
(2006b) computed the prediction coefficients as a spectral feature of the MS data and used the minimum
distortion rule to classify control and disease samples. The experimental results appeared to be promising and feasible for protein peak-detection strategy.
Given the promising integration of several machine-learning methods and mass spectrometry data
in high-throughput proteomics (Shin and Markey, 2006), this new biotechnology still encounters several
challenges in order to become a mature platform for clinical diagnostics and protein-based biomarker
profiling. Some of major challenges include noise filtering of MS data, selection of computational
methods for MS-based classification, feature extraction and feature reduction of MS datasets (Anderle
et al, 2004; Salmi et al, 2006).
B io-imaging da t a
By the use of fluorescence-based reagents, high content screening (HCS) studies cell functions by extracting the temporal and spatial information about target activities within cells (Giuliano et al., 1997).
Particularly due to the huge volumes of acquired images, the automation of HCS systems has become
necessary to help life-science researchers understand the complex process of cell division or mitosis
at a rapid speed (Debeir et al., 2005; Fox, 2003). Its power comes from the sensitivity and resolution of
automated light microscopy with multi-well plates, combined with the availability of fluorescent probes
that are attached to specific subcellular components, such as chromosomes and microtubules, for visualization of cell division or mitosis using standard epi-fluorescence microscopy techniques (Yarrow et
al., 2003). By employing a carefully selected reporter probes and filters, fluorescence microscopy allows
specific imaging of phenotypes of essentially any cell component (Murphy, 2001). With these probes we
can determine both the amount of a cell component, and most critically, its distribution within the cell
relative to other components. Typically, 3-4 different components are localized in the same cell using
probes that excite at different wavelengths. Any change in cell physiology would cause a redistribution
of one or more cellular components, and this redistribution provides a certain cytological marker that
allows for scoring of the physiological change.
In time-lapse microcopy images are usually captured in a time interval of more than 10 minutes.
During this period dividing nuclei may move far away from each other and daughter cell nuclei may
not overlap with their parents. An essential task for high content screening is to measure cell cycle progression (inter phase, prophase, metaphase, and telophase) in individual cells as a function of time. Cell
cycle progress can be identified by measuring nuclear changes. Thus, automated time-lapse fluorescence
microscopy imaging provides an important method for the observation and study of cellular nuclei in
a dynamic fashion (Hiraoka and Haraguchi, 1996; Kanda et al., 1998). Stages of an automated cellular
120
imaging analysis consist of segmentation, feature extraction, classification, and tracking of individual
cells in a dynamic cellular population; and the classification of cell phases is considered the most difficult task of such analysis (Chen et at., 2006).
Given the advanced fluorescent imaging technology, there still remain technical challenges in processing and analyzing large volumes of images generated by time-lapse microscopy. The increasing
quantity and complexity of image data from dynamic microscopy renders manual analysis unreasonably
time-consuming. Therefore, automatic techniques for analyzing cell-cycle progress are of considerable
interest in the drug discovery process. Steps for automatic analysis of cell images include image segmentation, feature extraction and classification; whereas the solution for image segmentation plays an
important role in providing accurate information for feature extraction and object classification. Whlby
et al. (2004) combined intensity, edge and shape information for 2D and 3D segmentation of cell nuclei in tissue section. Pham et al. (2006c) have developed several classification models for identifying
individual cell phase changes during a period of time. The same authors extracted the information of
image spatial-continuity as a novel feature using the concept of geostatistics (Isaaks and Srivastava,
1989) and the theory of linear predictive coding (LPC) (Makhoul, 1975); and also presented another
scheme for extracting two-dimensional image feature known as the spatial linear predictive coding
(SLPC) coefficients. Classification of cell phases was then carried out by the vector quantization based
templates for pattern matching.
R eferences
Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422, 198-207.
Anderle, M., Roy, S., Lin, H., Becker, C., Joho, K. (2004). Quantifying reproducibility for differential
proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum.
Bioinformatics, 20, 3575-3582.
Adam, B.L., Vlahou, A., Semmes, O.J., Wright Jr., G.L. (2001). Proteomic approaches to biomarker
discovery in prostate and bladder cancers. Proteomics, 1, 1264-1270.
Ball, G., Mian, S., Holding, F., Allibone, R.O., Lowe, J., Ali, S., Li, G., McCardle, S., Ellis, I.O., Creaser,
C., & Rees, R.C. (2002). An integrated approach utilizing artificial neural networks and SELDI mass
spectrometry for the classification of human tumours and rapid identification of potential biomarkers.
Brazma, A., & Vilo J. (2000). Gene expression data analysis. FEBS Letters, 480, 17-24.
Brennan, M-L., Penn, M.S., Van Lente, V., Nambi, M.H., Shishehbor, R.J., Aviles, M., Goormastic, M.L.,
Pepoy, E.S., McErlean, E.J., Topol, S.E., Nissen, S.L., Hazen (2003). Prognostic value of myeloperoxidase
in patients with chest pain. The New England Journal of Medicine, 13, 1595-1604.
Chen, X., Zhou, X., & Wong, S.T.C. (2006). Automated segmentation, classification, and tracking cancer
cell nuclei in time-lapse microscopy. IEEE Trans. Biomedical Engineering, 53, 762-766.
Conrads, T.P., Zhou, M., Petricoin III, E.F., Liotta L., Veenstra, T.D. (2003). Cancer diagnosis using
proteomic patterns. Expert Rev. Mol. Diagn., 3, 411-420.
121
Debeir, O., Ham, P.V., Kiss, R., & Decaestecker, C. (2005) Tracking of migrating cellsunder phase-contrast
video microscopy with combined mean-shift processes. IEEE Trans. Medical Imaging, 24, 697-711.
Dudoit, S., & Fridlyand, J. (2003). Classification in microarray experiments. Statistical Analysis of Gene
Expression Microarray Data, 3, 93-158, T. Speed, (ed.)Boca Raton, FL: Chapman & Hall,.
Fox, S. (2003). Accommodating cells in HTS. Drug Discovery World, 5, 21-30.
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., & Haussler, D. (2000). Support
vector machine classification and validation of cancer tissue samples using microarray expression data.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh Jr.
M.L., Downing, M.A., Caligiuri, C.D., Bloomeld, E., & Lander, S. (1999). Molecular classification of
cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-537.
Griffin, T., Goodlett, T., & Aebersold, R. (2002). Advances in proteomic analysis by mass spectrometry.
Curr. Opin. Biotechnol., 12, 607-612.
Giuliano et al. (1997). High-content screening: A new approach to easing key bottlenecks in the drug
discovery process. J. Biomolecular Screening, 2, 249-259.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using
support vector machines. Machine Learning, 46, 389-422.
Hiraoka, Y., & Haraguchi, T. (1996). Fluoresence imaging of mammalian living cells. Chromosome
Res, 4, 173-176.
Holmes, M.R., Ramkissoon, K.R., Giddings, M.C. (2005). Proteomics and protein identification. In
Baxevanis AD, Ouellette BFF, (eds.), Bioinformatics: A practical guide to the analysis of genes and
proteins (pp. 445-472). New Jersey: John Wiley & Sons.
Isaaks, E.H., & Srivastava, R.M. (1989). An introduction to applied geostatistics. New York: Oxford
University Press.
Kanda, T., Sullivan, K.F., & Wahl, G.M. (1998). Histone-GFP fusion protein enables sensitive analysis
of chromosome dynamics in living mammalian cells. Current Biology, 8, 377-385.
Kellam P., & Liu, X. (2003). Experimental use of DNA arrays. Orengo, C.A., Jones, D.T., & Thornton,
J.M. (eds.). Bioinformatics: Genes, Proteins & Structures, Bios. Oxford.
Lee, K.E., Sha, N., Dougherty, E.R., Vannucci, M., & Mallick, B.K. (2003). Gene selection: A Bayesian
variable selection approach. Bioinformatics, 19, 90-97.
Levner, I. (2005). Feature selection and nearest centroid classification for protein mass spectrometry.
BMC Bioinformatics, 6(68).
Liew, AW-C, Yan, H., & Yang, M. (2003). Robust adaptive spot segmentation of DNA microarray images. Pattern Recognition, 36, 1251-1254.
Lilien, R.H., Farid, H., & Donald, B.R. (2003). Probabilistic disease classification of expression-dependent
proteomic data from mass spectrometry of human serum. J. Computational Biology, 10, 925-946.
122
Lukac, R., Plataniotis, K.N., Smolka B., & Venetsanopoulos, A.N. (2004). A multichannel order-statistic
technique for cDNA microarray image processing. IEEE Trans. Nanobioscience, 3,272-285.
Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of the IEEE, 63, 561-580.
Morris, J.S., Coombes, K.R., Koomen, J., Baggerly, K.A., Kobayashi, R. (2005). Feature extraction and
quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics, 21, 1764-1775.
Murphy, D.B. (2001). Fundamentals of light microscopy and electronic imaging. Wiley-Liss.
Nagarajan, R. (2003). Intensity-based segmentation of microarrays images. IEEE Trans. Medical Imaging, 22, 882-889.
Nguyen, D.V., & Rocke, D.M. (2002). Tumor classification by partial least squares using microarray
gene expression data. Bioinformatics, 18, 39-50.
Petricoin, E.F., et al. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet,
359, 572-577.
Petricoin, E.F., & Liotta, L.A. (2003). Mass spectrometry-based diagnostics: The upcoming revolution
in disease detection. Clinical Chemistry, 49, 533-534.
Pham, T.D., Wells, C., & Crane, D.I. (2006a). Analysis of microarray gene expression data. Current
Bioinformatics, 1(1), 37-53.
Pham, T.D., Beck, D., & Yan, H. (2006b). Spectral pattern comparison methods for cancer classication
based on microarray gene expression data. IEEE Trans. Circuits and Systems I: Fundamental Theory
and Applications, special issue on Advances on Life Science Systems, 53(11), 2425-2430.
Pham, T.D., Tran, D.T., Zhou, X., & Wong, S.T.C. (2006c). Integrated algorithms for image analysis and
identification of nuclear division for high-content cell-cycle screening. Int. J. Computational Intelligence
and Applications, 6, 21-43.
Salmi, J., Moulder, R., Filen, J-J., Nevalainen, O.S., Nyman, T.A., Lahesmaa, R., & Aittokallio, T. (2006).
Quality classification of tandem mass spectrometry data. Bioinformatics, 22, 400-406.
Shin, H., Markey, M.K. (2006). A machine learning perspective on the development of clinical decision
support systems utilizing mass spectra of blood samples. J. Biomedical Informatics, 39, 227-248.
Sorace, J.M., & Zhan M. (2003). A data review and re-assessment of ovarian cencer serum proteomic
profiling. BMC Bioinformatics, 4(24).
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., & Levy S. (2005). A comprehensive evaluation
of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinforamtics,
21, 631-643.
Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., Le, Q-T. (2004). Sample classification from protein mass spectrometry, by peak probability contrasts. Bioinformatics, 20, 3034-3044.
Verma, D., & Meila, M. (2003). A comparison of spectral clustering algorithms. Technical Report, UWCSE-03-05-01. University of Washington.
123
Whlby, C., Sintor, I.-M., Erlandsson, F., Borgefors, G., & Bengtsson, E. (2004). Combining intensity,
edge and shape information for 2D and 3D segmentation of cell nuclei in tissue sections. Journal of
Microscopy, 215, 6776.
Weir, M.P., Blackstock, W.P., Twyman, M. (2003). Proteomics, Bioinformatics: Genes, Proteins &
Computers. C.A. Orengo, D.T. Jones, J.M. Thornton, (eds.). BIOS Scientific Publishers, 245-257.
Whitchurch, A.K. (2002). Gene expression microarrays. IEEE Potentials, 21, 30-34.
Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., & Zhao, H.
(2003). Comparison of statistical methods for classification of ovarian cancer using mass spectrometry
data. Bioinformatics, 19, 1636-1643.
Wulfkuhle, J.D., Liotta, L.A., & Petricoin, E.F. (2003). Proteomic applications for the early detection
of cancer. Nature, 3, 267-275.
Xiong, J. (2006). Essential Bioinformatics. New York: Cambridge University Press.
Yarrow, J.C., Feng, Y., Perlman, Z.E., Kirchhausen, T., & Mitchison, T.J. (2003). Phenotypic screening
of small molecule libraries by high throughput cell imaging. Comb Chem High Throughput Screen, 6,
279-286.
Yeung, K.Y., Bumgarner, R.E., & Raftery, A.E. (2005). Bayesian model averaging: Development of
an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21,
2394-2402.
Yu, J.S., Ongarello, S., Fiedler, R., Chen, X.W., Toffolo, G., Cobelli, C., & Trajanoski, Z. (2005). Ovarian
cancer identification based on dimensionality reduction for high-throughput mass spectrometry data.
Zhang, X.Y., Chen, F., Zhang, Y.T., Agner, S.G., Akay, M., Lu, Z.H., Waye, M.M.Y., & Tsui, S.K.W. (2002).
Signal processing techniques in genomic engineering. Proceedings of the IEEE, 90, 1822-1833.
Zhou, X., Liu, K-Y., & Wong, S.T.C. (2004). Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomedical Informatics, 37, 249-259.
Zhou, X., Wang, H., Wang, J., Hoehn, G., Azok, J., Brennan, M.L., Hazen, S.L., Li, K., & Wong, S.T.C.
(2006). Biomarker discovery for risk stratification of cardiovascular events using an improved genetic
algorithm. Proc. IEEE/NLM Int. Symposium on Life Science and Multimodality, 42-44.
K ey terms
Artificial Neural Networks: Machine learning methods consisting of interconnecting artificial
neurons that simulate the properties of biological neural networks.
Biomarker Discovery: Discovery of molecular parameters associated with the presence and severity of specific disease states.
124
Cluster Analysis: Methods for grouping objects of similar kind into respective categories.
Decision Trees: Predictive models that map the observations about an event to infer about its target
value.
Discriminant Analysis: Statistical analysis to discriminate between two or more groups of samples.
Feature Extraction: Extraction of representative properties of an object for the purpose of classification.
Feature Reduction: Compression of the feature space of an object.
Genetic Algorithms: Biologically inspired optimization methods.
Geostatistics: Applied statistics of spatially correlated data
High Content Screening: A high throughput platform for understanding the functions of genes,
RNA, proteins, and other cellular constituents at the level of the living cell.
k-Nearest Neighbor Algorithms: Methods for classifying objects based on closest training samples
in the feature space.
Linear Predictive Coding: An encoding method that allows the prediction of the value of the signal
at each sample as a linear combination of the past values of the signal.
Mass Spectrometry Data: A dataset that consists of relative intensities a chromatographic retention time and the ratios of molecular mass over charge. The mass spectrum for a sample is a function
of the molecules and used to test for presence or absence of one or more molecules which may relate to
a diseased state or a cell type.
Microarray Gene Expression Data: Modern biotechnological data generated for studying the interaction of large numbers of genes and how a cells regulatory networks control genes simultaneously.
Naive Bayes Classifier: A classification technique that is based on the so-called Bayesian theorem.
Proteomics: The study of the structure and function of proteins,
Spectral Distortion: A measure of mismatch between two signals based on their spectral properties.
Support Vector Machines: machine learning algorithms that map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed.
Time-Lapse Microcopy Imaging: Microscopy imaging that captures images of dynamic events at
predetermined time intervals.
Vector Quantization: A technique that compresses k dimensional vectors to a finite set of n dimensional vectors, where n is smaller than k.
Wavelet Transform: The representation of a signal in terms of scaled and translated copies of a
finite length or fast decaying oscillating waveform.
125
126
Chapter VII
Computer Aided Knowledge

Discovery in Biomedicine
Vanathi Gopalakrishnan
University of Pittsburgh, USA
abstract
This chapter provides a perspective on 3 important collaborative areas in systems biology research.
These areas represent biological problems of clinical significance. The first area deals with macromolecular crystallization, which is a crucial step in protein structure determination. The second area deals
with proteomic biomarker discovery from high-throughput mass spectral technologies; while the third
area is protein structure prediction and complex fold recognition from sequence and prior knowledge
of structure properties. For each area, successful case studies are revisited from the perspective of computer-aided knowledge discovery using machine learning and statistical methods. Information about
protein sequence, structure, and function is slowly accumulating in standardized forms within databases.
Methods are needed to maximize the use of this prior information for prediction and analysis purposes.
This chapter provides insights into such methods by which available information in existing databases
can be processed and combined with systems biology expertise to expedite biomedical discoveries.
INTR OD UCTI ON
The mission of this chapter is to introduce concepts and terms that form the core of methods devised
for important problems in bioinformatics and systems biology applications. Successful case studies are
presented that utilize prior knowledge to aid in novel biomedical discoveries. Machine learning techniques are applied to various important biological problems, namely macromolecular crystallization,
biomarker discovery from proteomic mass spectra and protein structure prediction via fold recognition.
A common theme is the utilization of protein sequence properties and known task-specific information
that serve as prior knowledge for guiding knowledge discovery. Much of the task-specific information
Computer Aided Knowledge Discovery in Biomedicine
is obtained through direct interactions between the bioinformatician and the domain expert in biomedical science.
Systematization of the processes by which biomedical discoveries are made can uncover useful
information that can help the bench scientist prioritize and focus efforts. A major goal of this chapter
is to describe efforts made toward such systematization in some critical research areas. Recent novel
machine learning algorithms have demonstrated some success in identifying and characterizing interesting relationships among domain concepts resulting in discovery of explanations for well-known
observed scientific phenomena. The domain expert plays a very important role in studying output
generated by computer programs and providing input to bioinformaticians on how to focus their subsequent efforts. Thus, communication between multi-disciplines is crucial to successful computer-aided
biomedical discovery. For example, modeling protein sequence-structure relationships is a challenging
bioinformatics task. Prior knowledge about protein fold can be used to better model protein families
containing remote homologs that have very few sequence characters in common between members
of the same family. Such knowledge is obtained typically from study of the literature combined with
communication with a domain expert.
BACKGR
OUND
Knowledge discovery in biomedicine in the current world is very often the result of computational analyses combined with interpretation by domain experts. Langley (1998) states that artificial intelligence
researchers have tried to develop intelligent artifacts that replicate the act of discovery. There are distinct
steps in the scientific discovery process discussed therein (Langley, 1998) during which developers or
users can influence the behavior of a computational discovery system. Furthermore, Langley (1998)
suggests that such intervention is the preferred approach for using discovery software. In this chapter,
we present an approach to data modeling and discovery that is consistent with this viewpoint.
Jurisica and Wigle (2006) define knowledge discovery (KD) as the process of extracting novel, useful,
understandable and usable information from large data sets. The authors review knowledge discovery
in proteomics and present examples of such algorithms in the literature that aid protein crystallization.
The case studies presented in this chapter reflect state-of-the-art challenges in proteomics along with
computer-aided solutions. Quantitative and qualitative discoveries are described along with the methods
by which they are arrived at. The KD process in complex real-world domains requires multi-disciplinary methods involving both artificial intelligence and statistics applied to databases (Jurisica & Wigle,
2006).
Proteomics can be defined simply as the study of protein composition in a protein complex, organelle, cell or entire organism (Russell, Old, Resing, & Hunter, 2004). Current high-throughput proteomic
technologies require robotics and computational techniques to decipher signals within multitudes of
data. It is becoming clear that the high dimensionality poses a serious challenge to existing artificial
intelligence tools for knowledge discovery and reasoning (Jurisica & Wigle, 2006). The unavailability of
large numbers of samples combined with the high dimensionality of the feature space limits the usefulness of models obtained from such data. Moreover, uncertain and missing values in the data combined
with evolving knowledge of the underlying mechanisms requires an intelligent information system to
be flexible and scalable (Jurisica & Wigle, 2006).
127
KN OW LEDGE DISC OVERY IN BI OMEDICA L APP LICATI ONS

We believe that systematization of belief systems is useful for knowledge discovery in biomedical applications. This involves collaborative input to create useful representations of prior knowledge of the
domain and task, in order to facilitate knowledge discovery. To support this, we present case studies for
three problem areas in computational systems biology: (1) Protein Crystallization Screen Design; (2)
Proteomic Biomarker Discovery from Mass Spectra and (3) Protein sequence-to-structure prediction.
Protein C rystallization S creen D esign

Proteins are macromolecules along with nucleic acids, protein and DNA complexes. Protein crystallization is the major bottleneck in structure determination of proteins because of the large number of
variables that affect the formation of a protein crystal suitable for X-ray diffraction. The relationships
between these variables and the propensity of a macromolecule to crystallize are still not fully understood. Hence, macromolecule crystallization is largely considered an art rather than a science (Ducruix
& Giege, 1992). Past experience has been utilized to guide experimenters toward initial conditions
that favor crystal growth (for example, agents proposed by Jancarik and Kim (Jancarik & Kim, 1991);
and the Hampton Research Crystal Screen). The idiosyncratic nature of individual proteins requires
optimization of conditions obtained from the first screening step. Conceptually, protein crystal growth
can be divided into two phases: (a) screening a subset of possible experimental conditions to determine
promising areas to search further and (b) optimization of the initial search conditions to yield crystals
of good quality (resolution limit of diffraction < 3.5 ).
The Biological Macromolecular Crystallization Database (BMCD) is a repository for experimental
data from successful crystallization studies (Gilliland, Tung, & Ladner, 2002). These positive examples
of successful crystallization plans have been subjected to various statistical and machine learning analyses staring with inductive learning to extract useful associations between crystallization conditions and
obtaining good quality crystals (Hennessy, Gopalakrishnan, Buchanan, Rosenberg, & Subramanian,
1994). Even though the BMCD is a useful resource, the lack of information about failed experimental
setups for each protein that had been successfully crystallized hinders the predictive ability of models
gleaned from this data.
Even though protein crystallization is generally regarded as an art, there exists prior knowledge about
the process by which protein crystals nucleate, grow and stop growing. Figure 1 depicts a hanging drop
vapor diffusion experiment as part of a typical 6 x 4 setup tray. This method is commonly utilized for
protein crystallization, with a droplet of purified protein in solution hanging from the cover slip over
well containing mother liquor which is a cocktail of buffer, precipitant and possibly salts at various concentrations. The drop solution contains the protein along with the same cocktail reagents of the mother
liquor except at lower concentrations. A protein crystal is formed when water evaporates from the drop
at a slow rate in an attempt to reach system equilibrium. Initially, the precipitant concentration in the
drop is at lower concentration insufficient for crystallization. As the water evaporates from the drop,
precipitant concentration increases driving the protein from a liquid to a solid state.
A phase diagram plot of protein concentration versus precipitant concentration in the hanging drop
reveals three regions - undersaturation, saturation and supersaturation. It has been observed that proteins nucleate at a supersaturation close to the boundary between saturation and supersaturation regions;
whereas growth of the crystal takes place only within the saturated region. Hence, a model that takes
128
this prior knowledge into account can help novice crystallographers better understand how to get the
protein in the liquid phase into an appropriate solid phase to obtain good crystals of diffractible quality. If the precipitant was too concentrated in the drop solution, it is likely that only a solid amorphous
precipitate would result, since the solid phase would occur in a region of supersaturation far from the
boundary between saturation and supersaturation regions. This empirical observation however is still
a useful partial result in honing onto appropriate initial crystallization conditions.
Gopalakrishnan et al. (2002) present a simple simulator of protein crystallization. Novice crystallographers can utilize commonly used laboratory reagents virtually to set up experiments for different
hypothetical proteins. The simulator allows the user to observe experiments from time to time by requesting this as another input. Subjective testing showed that the underlying simple model was a reasonable
approximation of the overall behavior of the system by which protein crystals nucleate, grow and stop
growing(Gopalakrishnan, Buchanan, & Rosenberg, 2000; Gopalakrishnan, Buchanan, & Rosenberg,
2002). However the simulation though numerical in nature is meant to only convey qualitative behavior
patterns. The quantitative outputs are not to be taken as what one would observe in the laboratory. Further refinement of the input functions and their parameters based on actual experimental information
would be necessary to utilize the quantitative outputs (such as the resolution limit of diffraction of a
crystal obtained over a particular time interval of simulation).
Figure 1. Illustration of a typical crystallization tray. Each well has a mother liquor solution containing
buffer and precipitant at a higher concentration than in the hanging drop. The drop contains protein
at a particular concentration initially in the liquid phase. Vapor diffusion drives the protein in the drop
into a solid phase (such as amorphous precipitate or crystal). The rate of vapor diffusion determines
the outcome of an experiment (such as whether or not a solid phase appears, and the quality of crystal
if applicable).
A T ra y
A well
(a n experiment)
B u ffe r = 0 .1 M Tris -H C l p H 8 .5
P re cipita nt = 30 % w /v P E G 4 0 0 0
S a lt = 0 .2 M M g C l 2
S evera l E xperiments in P a rallel
129
A systematization of macromolecular classes is presented in Hennessy et al. (2000) and utilized toward
objective screen design using statistical techniques (Hennessy, Buchanan, Subramanian, Wilcosz, &
Rosenberg, 2000). Further systematization of belief systems in crystallization is attempted through the
development and characterization of various chemical hierarchies and relationships within the domain
(Gopalakrishnan, Livingston, Hennessy, Buchanan, & Rosenberg, 2004). Such representations are
utilized within a novel autonomous discovery system for uncovering relationships among parameters
influencing macromolecular crystallization. This system called Heuristic Autonomous Model Builder
(HAMB) (Livingston, Rosenberg, & Buchanan, 2003) is a task-based agenda system built on top of
a rule-based machine learning algorithm, Rule Learner (RL) (Clearwater & Provost, 1990). HAMB
utilizes prior knowledge to annotate rule sets that are generated by RL - these represent models of the
data from the BMCD. The annotations are obtained through interactions with the domain expert to
characterize interestingness of features and relationships within the domain. Thus, HAMB is able to
autonomously prioritize learning tasks for RL, and provide appropriate data and target classes based on
interestingness categories. Though interestingness again is captured numerically, the categorization
overall is qualitative. HAMBs analysis of augmented BMCD data results in some novel suggestions
such as: (a) different crystallization methods should be used for specific types of macromolecules; and
(b) different ionic strengths may be required when crystallizing enzymes, heme-containing proteins
and small proteins (Gopalakrishnan et al., 2004).
An important variable that influences protein crystallization is the pH of the experimental setup, and
hence values of this variable is more abundantly reported in the protein data bank (PDB) (Berman et al.,
2002). The entries in the PDB were analyzed by Dougall (2007) in an effort to understand the relationships
between protein sequence properties and the reported pH of crystallization (Dougall, 2007). Since the
pH is more abundantly reported than other variables of protein crystallization, Dougall (2007) was able
to parse the PDB to create substantial training and test datasets (3000 to 5360 examples from different
versions of the database) for statistical and machine learning analyses. This analysis verified that there is
indeed no correlation between the pH at which a protein crystallizes; and its isoelectric point (pI) where
the net charge is zero agreeing with previous work (Kantardjieff & Rupp, 2004). Furthermore, upon
examination of net charge per kDa (kilodalton), which is the specific charge of the protein and can be
estimated from the sequence; it was found that most of proteins in the training set crystallized at a pH
that resulted in low specific charge. This trend was also observed in the test datasets. This discovery that
most globular proteins in a nonredundant set of the PDB crystallize at a pH that results in a low specific
charge seems to be consistent with well-known facts relating to structural stability of proteins.
Proteomic B iomarker D iscovery from Mass S pectra

Biomarkers are biological molecules that are indicators of physiologic state and also of change during a
disease process (Srinivas, Srivastava, Hanash, & Wright, 2001). Biomarkers can be utilized to provide
early detection of disease and monitor its progression. Proteomic technologies allow us to detect abundances of various proteins present in clinical samples, thereby providing us with a map of protein-related
changes caused by disease. Proteomics is the study of the entire set of proteins in a cell, and proteomic
biomarker panels that represent states of protein abundances or representations within the cell are useful indicators of patient health within the clinical context. Screening using a panel of biomarkers can
enhance the positive predictive value of a test while minimizing false positives or false negatives. Protein
markers can be used in detection, diagnosis, monitoring of therapy, and ultimately, prevention and risk
130
Figure 2. Generic mass spectrometry (MS)-based proteomics experiment. Figure redrawn from review
in Nature by Aebersold and Mann, 2003.
(1) Sample
fractionation
(2) Tryptic
digestion
SDSPAGE
Peptide
mixture
Excised proteins
Relative Abundance
(3) Peptide chromatography and ESI

(5) MS/MS
(4) MS
500
600
0
400
0
m/z
1000
m/z
assessment (Srinivas, Verma, Zhao, & Srivastava, 2002). Commonly used technologies for proteomic
screening are based on mass spectral (MS) analyses of clinical samples and complex mixtures of proteins
obtained through chromatographic separation.
MS measurements are carried out in the gas phase on ionized analytes. By definition, a mass spectrometer consists of an ion source, a mass analyzer that measures the mass-to-charge ratio (m/z) of
the ionized analytes, and a detector that registers the number of ions at each m/z value (Aebersold &
Mann, 2003). The two most common techniques used to volatize and ionize proteins or peptides for
MS analysis are electrospay ionization (ESI) (Fenn, Mann, Meng, Wong, & Whitehouse, 1989) and
matrix-assisted laser desorption/ionization (MALDI) (Karas & Hillenkamp, 1988). ESI desolvates and
ionizes the analytes out of a solution and can be coupled with liquid-based separation tools such as
liquid -chromatography (LC). MALDI sublimates and ionizes samples out of a dry, crystalline matrix
via laser pulses and is normally used to analyze relatively simple peptide mixtures. A recent related
technology is called surface-enhanced laser desorption/ionization (SELDI), where protein chip arrays are
used to provide a variety of surface chemistries for researchers to optimize protein/peptide segregation,
capture and analysis. The chemistries on these chips include classical chromatographic surfaces such as
hydrophobic for reversed-phase capture, cation-and anion exchange surfaces, and immobilized metal
affinity capture (IMAC) for capturing metal-binding (e.g., phosphorylated) proteins. Bound proteins
are liberated by ionization (Jr et al., 1999).
Figure 2 is adapted from Aebersold and Manns review in Nature (Aebersold & Mann, 2003), and
shows a generic MS-based proteomics experiment. The five stages of a typical proteomics experiment
are depicted, wherein proteins to be analyzed are first extracted from cells (stage 1) and then digested
into peptides (stage 2) in order to enhance sensitivity of detection by mass alone. In stage 3, peptides
elute from high-pressure liquid chromatography (HPLC) column ordered by hydrophobicity, making the
mixture simpler for analysis. In stage 4, a mass spectrometer samples this mixture every few seconds,
131
producing a mass spectrum of peptides eluting at each time point. In stage 5, peptides of selected masses
are fragmented into sub-peptides; and a series of tandem mass spectrometric or MS/MS experiments
ensue, producing MS/MS spectrums. Peptides are identified based on their MS and MS/MS spectra.
Four basic types of mass analyzers are used in current proteome research (Aebersold & Mann, 2003)
namely - the ion trap, time-of-flight (TOF), quadrupole and Fourier transform ion cyclotron (FT-MS)
analyzers. Each can be used alone or in tandem in order to exploit the strengths of the approaches.
MALDI is usually coupled to TOF analyzers that measure the mass of intact peptides, whereas ESI has
mostly been coupled to ion traps and triple quadrupole instruments and used to generate fragmentation
or collision-induced dissociation (CID) spectra of selected precursor ions (Aebersold & Goodlett, 2001).
Due to its simplicity, excellent mass accuracy, high resolution and sensitivity, MALDI-TOF is a generally used technique to identify proteins by peptide-mass mapping or fingerprinting. Recently introduced
MALDI-TOF/TOF-MS/MS instruments can also be used to generate CID spectra. Identification and
quantification of the components of a complex protein sample is a multi-step operation. While instruments and methods are needed for separating, identifying and quantifying the polypeptides, tools are
also required to integrate and analyze all of the produced data. Proteomics studies necessarily result
in large amounts of data. Analysis of complete proteomes for biomarkers is therefore time-consuming
and poses computational challenges, as the problem can be likened to search of a needle in a haystack.
Quality of the high-throughput proteomic data, like that of high-throughput genomics data, can also
be highly suspect.
Biomarker discovery from clinical proteomic mass spectra requires computational techniques to
select features that are relevant with respect to disease classification. The problem is particularly difficult
due to the small number of available clinical samples, and the large number of features or analytes (m/z
values) whose abundances are traced via mass spectral technologies such as MALDI and SELDI-TOF.
In the machine learning and data-mining field, this problem is basically one of feature selection. Feature selection methods are divided into three typical groups: Filter, Wrapper and Embedded methods
(Hauskrecht, Pelikan, Valko, & Lyons-Weiler, 2007). Filter methods rank each feature according to some
univariate metric, and utilize only the highest ranked features. Wrapper algorithms (Kohavi & John,
1997) search for the best subset of features using all available features and a classification algorithm.
Embedded methods incorporate feature selection as part of the model building process, for example the
CART method (Breiman, Friedman, Olshen, & Stone, 1984).
Gopalakrishnan et. al. (2006) describe the process by which biomarkers are discovered and validated for Amyotrophic Lateral Sclerosis (ALS), a rare but rapidly neurodegenerative disease for which
there is currently no cure (Gopalakrishnan, Ganchev, Ranganathan, & Bowser, 2006). Proteomic mass
spectral analyses of cerebrospinal fluid samples (CSF) from control and patient cohorts is performed
using SELDI-TOF (Ranganathan et al., 2005). Statistical techniques and a wrapper-based rule learning (RL-Wrap) approach are applied to select putative biomarker panels that enable early detection of
ALS. Three biomarkers are identified using peptide mass fingerprinting and tandem MS/MS, that are
either decreased (transthyretin, cystatin C) or increased (carboxy-terminal fragment of neuroendocrine
protein 7B2) in ALS CSF (Ranganathan et al., 2005). The latter was discovered only by the use of the
rule learner. More details on the biomarker discovery and validation processes follow.
Ranganathan et al. (2005) utilize Ciphergen ProteinChips (Ciphergen Biosystems, Inc. Palo Alto, CA,
USA) to collect mass spectra from CSF samples. The SAX2 and IMAC binding surfaces yielded quality
spectra with reasonable reproducibility. Total number of subjects used for mass spectrometry analyses
was 23 ALS and 31 controls. Of these, 15 ALS and 21 controls formed the initial training group for pre-
132
liminary data analyses, keeping 8 ALS and 10 controls blinded as a test set. For subsequent analyses, all
the data was pooled and test sets were generated at random keeping a fairly even distribution of disease
to control cases. Total ion current of all profiles was used to normalize each of the spectrograms. The
Ciphergen 3.1 Biomarker Wizard application auto detected mass peaks by clustering. This output was
analyzed using non-parametric Mann-Whitney analysis to yield 30 mass ion peaks with statistically
significant (p < 0.01) differences between control and ALS subjects.
Apart from the univariate analysis, non-linear modeling was performed using an inductive rule-learning algorithm, RL (Clearwater & Provost, 1990). This evolved from Meta-DENDRAL (Feigenbaum &
Buchanan, 1993) which was used for predicting mass spectra of complex organic molecules. RL primarily
searches a space of possible rules (IF-THEN statements) by successive specialization, guided by data in
the training set and by prior knowledge. It learns a disjunctive set of weighted conjunctive rules. One of
the distinguishing features of RL is its knowledge-based approach to learning. Prior knowledge can be
used to guide the process by which its rule-based models are constructed. This may include the legal
semantics and syntax for the rules plus additional bias about plausible and implausible relationships that
are well agreed upon by the domain experts. RL models are easily understood by domain experts. This
has been crucial to the successful application of RL to knowledge discovery in the various biomedical
domains discussed herein.
For this study, both raw and normalized spectral data were analysed. Due to the large space of features, a wrapper algorithm was developed which selected features and ran RL for each subset, until all
features had been seen at least once by RL. Then, the rule-based models obtained from each subset are
parsed to pool together only those features that appeared in the models. Then, RL was again applied to
this final subset of features using the training data as input. The final model is then used to predict the
class labels for the test data. A three-fold or five-fold cross-validation is utilized to tune the RL input
parameters during the training. These parameters are then utilized for obtaining the final set of rules.
To apply the rules to predict test data, the inference engine applies evidence gathering using weighted
voting. Each rule in the model has a certain weight associated with it that reflects its degree of certainty
as gathered from the training data. The degree of certainty is based on the proportion of correct class
labels assigned by that rule to the total number of matches to the rules antecedents.
A sample rule produced from RL applied to proteomic mass spectra (Gopalakrishnan et al., 2006)
is shown below:
((zn3010.6341 in 0.302..0.481)) ==> (GROUP = ALS)
CF=0.873, TP=10, FP=0,Pos=18,Neg=22
This rule states that if the normalized abundance of m/z value (or mass value) 3010.6341 produced
from a Zn-IMAC SELDI chip is in the range of 0.302 to 0.481, the sample is classified as ALS with a
certainty factor of 0.873. There are several functions available within RL for calculating the certainty
factor for each rule. The number of true positives covered by that rule is 10 out a total of 18 positive
examples. The total number of controls and ALS cases analyzed in this particular RL run is 40. Several
models were obtained from the RL analyses. One best model was chosen that performed very well on
a randomly chosen test set containing 7 ALS and 5 Control cases. The twelve features in this model
were chosen for further experimental analyses namely, peptide mass fingerprinting and tandem MS/
MS analyses. One novel peptide biomarker was identified directly via tandem MS/MS sequencing that
133
yielded a 100% sequence match with the carboxy-terminal fragment of neuroendocrine protein 7B2.
This finding presents a clue to the metabolic pathways of ALS and is both interesting and significant
from a biomedical standpoint.
Understanding and systematization of this process by which biomarkers are discovered from proteomic mass spectra will facilitate biomedical knowledge discovery as illustrated in the above case
study. This would include obtaining and encoding prior knowledge of task-specific and domain-specific
information. For example, knowledge about the type of clinical sample (e.g., blood, urine, CSF, tears,
saliva) and known abundances of proteins found in that source would enable better feature selection for
disease-specific biomarker discovery. Other types of prior knowledge would include: (a) relationships
between proteins found in pathways and (b) the encoding of specific domain knowledge such as bond
fragmentation patterns in MS data. Machine learning algorithms that can utilize such prior knowledge
to help constrain the hypothesis space will be immensely useful in such tasks.
Protein S equence to S tructure Prediction and C omplex F old R ecognition

Protein sequence to structure prediction is a fundamental problem in bioinformatics. Protein structure is described hierarchically starting with primary sequence represented by the polypeptide chain
of amino acids, followed by secondary structure classified mainly into helix, sheet and coil. Tertiary
structure represents the entire three-dimensional structure of a protein described by the location and
spacing between atoms of one polypeptide chain. Quaternary structure refers to the three dimensional
structure of more than one polypeptide chain, for example the hemoglobin macromolecule contains
four polypeptide chains. Protein sequence databases such as the Swiss-Prot are much more populated
than the structure database (PDB) due to the various difficulties associated with protein structure determination as discussed earlier. Hence, prediction of protein structure from sequence information is
commonly performed as an attempt to model aspects of actual structure. It is believed that such models
might provide clues to protein function.
Protein secondary structure prediction has been extensively studied (Cuff and Barton, 1999; Rost,
2001). Since protein secondary structure refers to repetitive structural components based only on the
backbone of a polypeptide chain without consideration of the individual amino acid side-chains and their
interactions, there is an inherent information loss. Secondary structure prediction algorithms typically
assign a prediction per residue along the sequence. Invariably, this results in loss of predictive accuracy
per residue at the global sequence level of the polypeptide chain. The most difficult predictions are at the
boundaries of helices, sheets and coils. Recent improvements have been made to structure prediction
algorithms by incorporating evolutionary information, and also by combining the results of multiple
independent prediction methods into a consensus prediction (Rost, 2001).
Liu et al. (2004) compare probabilistic score combination methods for secondary structure prediction.
Herein, the combination problem for protein sequences is formulated to take into account both short
and long-range interactions so as to be able to consider the protein sequence as a whole as opposed to
window-based methods which may not be able to capture constraints and dependencies of long-range
interactions (Liu, Carbonell, Klein-Seetharaman, & Gopalakrishnan, 2004). Traditional window-based
combination methods are compared with graphical models such as Hidden Markov Models (HMMs),
Maximum Entropy Markov Models (MEMMs) (McCallum, Freitag, & Pereira, 2000) and chain-structured Conditional Random Fields (CRFs) (Lafferty, McCallum, & Pereira, 2001). Using the CB513
dataset by Cuff and Barton (1999) that contains 513 non-homologous proteins that have an SD score of
134
Figure 3. (A) Representative structure for the parallel beta-helix fold from the PDB; (B) prior knowledge
example: one rung of the repetitive beta-helix fold containing the conserved B2-T2-B3 segment (figure
adapted from(Cowen, Bradley, Menke, King, & Berger, 2002) and (C) representative triple-beta spiral
structure from the PDB.
less than five, a gold standard protein secondary structure assignment is made using the DSSP definition (Kabsch & Sander, 1983). This definition is based on hydrogen bonding patterns and geometrical
constraints. DSSP assign one of eight labels to each amino acid these labels are reduced to a standard
set of three based on discussion by Cuff and Barton (1999): H and G to Helix (H), E and B to Sheets
(E) and all others to Coil (C). The state-of-art performance achieved by window-based methods is by
using PSI-BLAST profiles (Jones, 1999). Liu et al. (2004) tried four graphical models and discovered
that CRFs perform the best. Overall, the graphical models for score combination yielded better accuracies for secondary structure prediction on this dataset compared to window-based methods. Standard
prediction accuracy measures were utilized such as Q3 score for overall per-residue accuracy, Matthews
correlation coefficients per structure type (CH, CC, CE) and segment of overlap (SOV) (Rost, Sander, &
Schneider, 1994; Zemla, Venclovas, Fidelis, & Rost, 1999).
Having discovered that CRFs performed very well for modeling protein secondary structure prediction particularly beta sheets that involve long-range interactions, Liu et al. (2006) proposed a variant
named Segmentation Conditional Random Fields for modeling complex protein folds. Protein fold refers
to frequent arrangement pattern of several secondary structure elements, such as the structural
motif. In contrast to traditional graphical models such as the Hidden Markov Model (HMM), SCRFs
follow a discriminative approach allowing any set of features to be included. Globally optimal solutions are obtained for the model parameters using a convex optimization function. Potential long-range
interactions are modeled within a special kind of graph, called the protein structural graph (Liu, Carbonell, Weigele, & Gopalakrishnan, 2006). This graph has an additional set of edges between nodes that
represent segments of a protein sequence, classified into secondary structure elements that are within
and outside the fold of interest.
Given a protein sequence and a structural graph representing the repetitive elements of that fold, the
SCRF models the conditional probability of a segmentation of the protein sequence given that graph.
The parameters of the model are estimated from the training data, based on features extracted from
sequences belonging to the fold family. The right-handed parallel beta helix fold (Figure 3 A&B) is mod-
135
eled successfully using SCRFs. Only 14 structures with this fold have been resolved. These belong to
nine beta-helix families of closely related proteins in the SCOP database (Murzin, Brenner, Hubbard, &
Chothia, 1995). Cross-family validation shows that the SCRF model performs comparably to that of the
heuristic BetaWrap algorithm (Bradley, Cowen, Menke, King, & Berger, 2001) developed for modeling
the parallel beta helix fold. The generality of the SCRF model makes it useful for protein fold recognition. Potential members of the beta-helix fold family are identified from the UniProt reference databases
(Leinonen et al., 2004). A few of these predictions were subsequently confirmed as beta-helices from
recently crystallized proteins whose structures were resolved by X-ray crystallography (Liu, 2006).
Quaternary structural folds consist of multiple protein chains that form chemical bonds among side
chains to reach a structurally stable domain. The complexity associated with modeling the quaternary
fold is handled in Liu et al. (2007) using a segmentation conditional graphical model (SCGM) along
with efficient inference and learning algorithms for training and testing. This complexity arises because
the labels of all chains have to be considered jointly since every chain contributes to the stability of the
overall structure (Liu, Carbonell, Gopalakrishnan, & Weigele, 2007). While the SCGM model provides a
flexible and scalable framework for utilizing structural properties of a protein fold, the choice of feature
functions that represent segment-based features is key to accurate prediction. Two types of features
are discussed in Liu et al. (2007). The node feature covers properties of an individual segment, such
as average hydophobicity or, length of the segment. The pairwise feature captures dependencies
between segments whose corresponding subsequences form chemical bonds in the three-dimensional
structures (Liu et al., 2007).
Liu et al. (2007) model two complex protein folds, the triple--spiral (van Raaij, Mitraki, Lavigne,
& Cusack, 1999) and the double-barrel trimer.(Benson, Bamford, Bamford, & Burnett, 2004). The
protein structure graph for the triple--spiral (TBS) is constructed based on observed characteristics
of members of that fold family whose structures had been resolved and deposited in the PDB. Weigele
et al. (2003) study members of the TBS fold family (Figure 3C) and make several observations about
the structural characteristics (Weigele, Scanlon, & King, 2003). Such constraints constitute an example
of prior biological knowledge constraining the model space (Liu et al., 2007).
In this subsection the problem of complex fold recognition is studied and novel methods that utilize probabilistic graphical modeling are discussed. The methods are variants of conditional graphical
models and can use prior information available about structural folds in order to model complex folds
such as the Triple Beta Spiral. The prior information encoded is based on known partial sequencestructure information and sequence alignment of the protein families. The methods are significant in
that they comprise general models for the particular complex fold being recognized and perform well
even though the members of the fold family have very little sequence homology, and there exists only
a sparse representation of resolved structures for the fold in the PDB.
F UT URE TRENDS
Systems biology applications are of great importance to the future of biomedicine. The case studies
described in this chapter represent significant research areas that challenge computational biology.
Systematization of prior knowledge will enable efficient solutions to these challenges. Ontologies are
beginning to gain popularity in sub-domains of bioinformatics. Enriching the information space provided by such ontologies will enable rapid retrieval for analysis purposes. Integration of omic data with
136
clinical and health records is an essential aspect of biomedical knowledge discovery. Furthermore, the
discovery of more protein sequence-structure-function relationships that are core to systems biology
will enhance the quality of bioinformatic databases.
C ONC LUSI ON
This chapter presents a perspective on successful use of intelligent computational aids for knowledge
discovery in three collaborative areas of biomedicine. The case studies represent different applications
of machine learning techniques to various important biological problems, namely macromolecular
crystallization, biomarker discovery from proteomic mass spectra and protein structure prediction via
fold recognition. A common theme is the utilization of protein sequence information to derive proteinspecific properties that can serve as prior knowledge for guiding knowledge discovery. The learning
algorithms are unique in that they can represent prior knowledge and use this to constrain the search
space of possible models that can fit the data. Rule-based machine learning systems can facilitate
biomedical knowledge discovery since these models are easily understood. Graphical models enable
effective representation of prior knowledge that captures structure within the data.
REFERENCES
Aebersold, R., & Goodlett, D. R. (2001). Mass spectrometry in proteomics. Chem Rev, 101(2), 269295.
Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422(6928), 198207.
Benson, S. D., Bamford, J. K., Bamford, D. H., & Burnett, R. M. (2004). Does common architecture
reveal a viral lineage spanning all three domains of life? Mol Cell, 16(5), 673-685.
Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., et al. (2002). The
protein data bank. Acta Crystallogr D Biol Crystallogr, 58(6, 1), 899-907.
Bradley, P., Cowen, L., Menke, M., King, J., & Berger, B. (2001). BETAWRAP: Successful prediction
of parallel beta -helices from primary sequence reveals an association with many microbial pathogens.
Proc Natl Acad Sci U S A, 98(26), 14819-14824.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees.
Belmont, CA: Wadsworth International Group.
Clearwater, S., & Provost, F. (1990). RL4: A tool for knowledge-based induction. Paper presented at the
Second International IEEE Conference on Tools for Artificial Intelligence (TAI-90).
Cowen, L., Bradley, P., Menke, M., King, J., & Berger, B. (2002). Predicting the beta-helix fold from
protein sequence data. J Comput Biol, 9(2), 261-276.
Dougall, D. (2007). A protein sequence-properties evaluation framework for crystallization screen
design. Unpublished Doctoral Dissertation, University of Pittsburgh.
137
Ducruix, A., & Giege, R. (1992). Crystallization of nucleic acids and proteins: A practical approach.
New York: Oxford University Press.
Feigenbaum, E. A., & Buchanan, B. G. (1993). Dendral and meta-dendral - Roots of knowledge systems
and expert system applications. Artificial Intelligence, 59(1-2), 233-240.
Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F., & Whitehouse, C. M. (1989). Electrospray ionization
for mass spectrometry of large biomolecules. Science, 246(4926), 64-71.
Gilliland, G. L., Tung, M., & Ladner, J. E. (2002). The biological macromolecule crystallization database:
Crystallization procedures and strategies. Acta Crystallogr D Biol Crystallogr, 58(6, 1), 916-920.
Gopalakrishnan, V., Buchanan, B. G., & Rosenberg, J. M. (2000). Intelligent aids for parallel experiment
planning and macromolecular crystallization. Proc Int Conf Intell Syst Mol Biol, 8, 171-182, California.
Gopalakrishnan, V., Buchanan, B. G., & Rosenberg, J. M. (2002). A simple simulator of protein crystallization. Journal of Applied Crystallography, 35(6), 727-733.
Gopalakrishnan, V., Ganchev, P., Ranganathan, S., & Bowser, R. (2006). Rule learning for disease-specific biomarker discovery from clinical proteomic mass spectra. Springer Lecture Notes in Computer
Science, 3916, 93-105.
Gopalakrishnan, V., Livingston, G., Hennessy, D., Buchanan, B., & Rosenberg, J. M. (2004). Machinelearning techniques for macromolecular crystallization data. Acta Crystallogr D Biol Crystallogr, 60(Pt
10), 1705-1716.
Hauskrecht, M., Pelikan, R., Valko, M., & Lyons-Weiler, J. (2007). Feature selection and dimensionality
reduction in genomics and proteomics. In W. Dubitzky, M. Granzow & D. P. Berrar (eds.), Fundamentals
of data mining in genomics and proteomics. Springer.
Hennessy, D., Buchanan, B., Subramanian, D., Wilcosz, P. A., & Rosenberg, J. M. (2000). Statistical
methods for the objective design of screening procedures for macromolecular crystallization. Acta
Crystallographica D, 56, 817-827.
Hennessy, D., Gopalakrishnan, V., Buchanan, B. G., Rosenberg, J. M., & Subramanian, D. (1994).
Induction of rules for biological macromolecule crystallization. Proc Int Conf Intell Syst Mol Biol, 2,
179-187.
Jancarik, J., & Kim, S. H. (1991). Sparse matrix sampling: A screening method for crystallization of
proteins. Journal of Applied Crystallography, 24(409).
Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices.
J Mol Biol, 292(2), 195-202.
Jr, G. W., Cazares, L. H., Leung, S. M., Nasim, S., Adam, B. L., Yip, T. T., et al. (1999). Proteinchip(R)
surface enhanced laser desorption/ionization (SELDI) mass spectrometry: A novel protein biochip
technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer
Prostatic Dis, 2(5/6), 264-276.
138
Jurisica, I., & Wigle, D. (2006). Knowledge discovery in proteomics. Chapman & Hall/CRC.
Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: Pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers, 22(12), 2577-2637.
Kantardjieff, K. A., & Rupp, B. (2004). Protein isoelectric point as a predictor for increased crystallization screening efficiency. Bioinformatics, 20(14), 2162-2168.
Karas, M., & Hillenkamp, F. (1988). Laser desorption ionization of proteins with molecular masses
exceeding 10,000 daltons. Anal Chem, 60(20), 2299-2301.
Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2),
273-324.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probablistic models for
segmenting and labeling sequence data. Paper presented at the Eighteenth International Conference on
Machine Learning (ICML01), Williamstown, MA.
Langley, P. (1998). The computer-aided discovery of scientific knowledge. Paper presented at the First
International Conference on Discovery Science. Fukuoka, Japan.
Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R., & Apweiler, R. (2004). UniProt archive. Bioinformatics, 20(17), 3236-3237.
Liu, Y. (2006). Conditional graphical models for protein structure prediction. Unpublished Doctoral
Dissertation, Carnegie Mellon University, Pittsburgh.
Liu, Y., Carbonell, J., Gopalakrishnan, V., & Weigele, P. (2007). Protein quaternary fold recognition
using conditional graphical models. Paper presented at the International Joint Conference on Artificial
Intelligence - IJCAI07.
Liu, Y., Carbonell, J., Klein-Seetharaman, J., & Gopalakrishnan, V. (2004). Comparison of probabilistic
combination methods for protein secondary structure prediction. Bioinformatics, 20(17), 3099-3107.
Liu, Y., Carbonell, J., Weigele, P., & Gopalakrishnan, V. (2006). Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology, 13(2), 394-406.
Livingston, G., Rosenberg, J. M., & Buchanan, B. G. (2003). An agenda- and justification-based framework for discovery systems. Journal of Knowledge and Information Systems, 5(2), 133-161.
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information
extraction and segmentation. Paper presented at the Seventeenth International Conference on Machine
Learning (ICML00), Stanford, CA, USA.
Murzin, A. G., Brenner, S. E., Hubbard, T., & Chothia, C. (1995). SCOP: A structural classification of
proteins database for the investigation of sequences and structures. J Mol Biol, 247(4), 536-540.
Ranganathan, S., Williams, E., Ganchev, P., Gopalakrishnan, V., Lacomis, D., Urbinelli, L., et al. (2005).
Proteomic profiling of cerebrospinal fluid identifies biomarkers for amyotrophic lateral sclerosis. J
Neurochem, 95(5), 1461-1471.
139
Rost, B., Sander, C., & Schneider, R. (1994). Redefining the goals of protein secondary structure prediction. J Mol Biol, 235(1), 13-26.
Russell, S. A., Old, W., Resing, K. A., & Hunter, L. (2004). Proteomic informatics. Int Rev Neurobiol,
61, 127-157.
Srinivas, P. R., Srivastava, S., Hanash, S., & Wright, G. L., Jr. (2001). Proteomics in early detection of
cancer. Clin Chem, 47(10), 1901-1911.
Srinivas, P. R., Verma, M., Zhao, Y., & Srivastava, S. (2002). Proteomics for cancer biomarker discovery.
Clin Chem, 48(8), 1160-1169.
van Raaij, M. J., Mitraki, A., Lavigne, G., & Cusack, S. (1999). A triple beta-spiral in the adenovirus
fibre shaft reveals a new structural motif for a fibrous protein. Nature, 401(6756), 935-938.
Weigele, P. R., Scanlon, E., & King, J. (2003). Homotrimeric, beta-stranded viral adhesins and tail
proteins. J Bacteriol, 185(14), 4022-4030.
Zemla, A., Venclovas, C., Fidelis, K., & Rost, B. (1999). A modified definition of Sov, a segment-based
measure for protein secondary structure prediction assessment. Proteins, 34(2), 220-223.
KEY TERMS
Clustering: The unsupervised grouping of data items in the absence of class labels.
Conditional Random Fields (CRFs): These are undirected discriminative graphical models that
directly compute the conditional likelihood of a hidden state sequence (y) given the observation sequence
(x). This P(y|x) is proportional to the product of the potential functions over all the cliques in the graph.
CRFs define the clique potential as an exponential function and guarantee finding of the global optimum
since the optimization function is convex (Lafferty et al., 2001). Forward and backward probability calculations are derived similar to HMMs. Unlike HMMs, no assumptions are made about independence
of the observed features. The feature definition can also be arbitrary, including overlapping features
and long-range interactions (Liu et al., 2006).
Feature Extraction: The process of extracting and building features from raw data such as the
amino acid sequence of a protein. Feature functions are utilized to extract and process informative
features that are useful for prediction.
Hidden Markov Models (HMMs): These are directed chain-structured probabilistic graphical
models that are generative in nature. They assume that the data are generated by a particular model and
compute the joint distribution, P(x, y) of the observation sequence x, and the hidden state sequence y.
Inductive Rule Learning: The development of IF-proposition-THEN-concept rule-based models
from feature vectors, which are (attribute, value) pairs that describe the training examples. The rulebased models are expected to generalize to classify test examples accurately.
140
Metabolomics: The study of small molecule metabolites and their expression within a system or
organism.
Supervised Machine Learning: The use of class labels as prior knowledge to learn discriminative
models from training examples consisting of feature vectors descriptive of the target class.
X-Ray Crystallography: The most general method for experimental determination of protein and
other macromolecule three-dimensional structure. A good quality crystal is obtained first from a purified sample and then subjected to X-ray diffraction.
141
Section III
Genomics and Bioinformatics

for Systems Biology
143
Chapter VIII
Function and Homology of

Proteins Similar in Sequence:
Phylogenetic Pro.ling
Thomas Meinel
abstract
The function of proteins is a main subject of research in systems biology. Inference of function is now,
more than ever, required by the upcoming of novel protein sequences in consequence of the discovery
of new proteomes. The calculation of sequence similarity is an easily feasible way to compute protein
comparisons. The comparison of complete proteomes touches one of the earliest topics in bioinformatics; the biologically meaningful organization of proteins in protein families. Several approaches that
interpret function or evolutionary aspects of proteins from sequence similarity are reviewed, which in
particular reflects the arsenal of techniques introduced until now. Phylogenetic profiling, a method
that compares a set of genes or proteins by their presence or absence across a given set of organisms,
is also presented in this chapter. Proteins in a functional context, for example, a pathway or a protein
complex, are represented by identical or similar phylogenetic profiles. The detection of functional contexts by phylogenetic profiling is also playing a prospective role as an analytic tool in systems biology.
Already established tools for phylogenetic profiling as well as particular biological examples based on
the SYSTERS protein family data set are presented.
INTR OD UCTI ON
Protein sequence similarity is a feature that plays a central role in comparative proteomics for the inference of protein function or analysis of protein evolution. To study the complexity of functional cellular
units like proteins, basic research can often only be conducted on animal models with the completely
available experimental design. These studies are expected to play a significant role in medical research.
Function and Homology of Proteins Similar In Sequence
Results are considered to be comparable with results from clinical diagnostics. It must be confirmed
that results are transferable to human proteins if such experiments are not possible. Therefore, the determination of proteins with identical function in animal models and in human is essential.
Evolution of life can be characterized by the development of species as well as by the divergence
of protein sequences, and it is notable that also the development of protein function is an evolutionary
process. Several more or less independent biological research fields are introduced to elucidate the
backgrounds of a particular evolution event - often only under constraints of the temporary availability
of appropriate methods. The development of techniques for a rapid and voluminous sequencing of DNA
continues to lead to complete gene sets, genomes, for an increasing list of organisms. Computational
techniques calculate the translation of genomic information to proteins including processes like alternative splicing or translation start site variation. The generation of all electronically inferred proteomes is
based on such specific algorithms. In parallel to the development of those tools, new evolutionary events
were detected and investigated. Some of them are evolutionary events like gene duplication, gene fusion
or fission, protein domain rearrangements, horizontal gene transfer, multiple copy number of genes.
In its first part, this book chapter emphasizes the reasons for distinguishing between sequence similarity and homology and function of proteins. Sequence similarity is a parameter that can be computed
from a simple biophysical trait of a protein, the sequence, i.e., the primary protein structure. However,
it is more complex to determine protein homology, even if it is plausible that proteins with common
evolutionary history are similar in sequence. The other way around, in the context of bioinformatics,
inference of homology is the interpretation of an observation - namely sequence similarity. The goal
here is to determine similar proteins in recent organisms as descendants from a gene with common
ancestry and thereby as homologs. The matter of inference of protein function can be discussed in the
same way: Similar proteins possess with high probability a common ancestry and therefore similar
function. But proteins can adopt function or are specialized during their evolutionary history. Proteins
of similar function not necessarily possess similar sequence, therefore.
Consequently, it is necessary to know existing protein sequence comparison methods and underlying methods for the partitioning of proteins into protein families. In fact, a scientist works with more
or less closely related members of protein families when using an expression like two homologous
genes. Methodological backgrounds of established data sets are therefore briefly reviewed by this book
chapter.
It is observed that proteins in a common functional context are evolutionary conserved in most of
the organisms that own such a functional context. A phylogenetic profile is a pattern of presence or
absence of a gene or protein across a given set of organisms. Phylogenetic profiling is a method that
compares proteins by their phylogenetic profiles. Because proteins are different in organisms, a grouping
of proteins is essential for the generation of a phylogenetic profile. Phylogenetic profiling is therefore
depending on the method for partitioning of the protein space into protein families. This book chapter
in its second part reviews the backgrounds of established phylogenetic profiling tools, restrictions to
subsets of organisms on super-kingdom level, general limitations for the detection of functional contexts, and provides particular biological examples of phylogenetic profiles. Phylogenetic profiling as
a method to infer unknown protein contexts or to elucidate contexts of proteins unknown in function
becomes prospectively relevant in systems biology, and more so with the increasing number of complete
eukaryotic proteomes.
Function of proteins is a central issue of this review, as subject of detection for unknown proteins
using sequence similarity and as subject of inference of functional contexts in phylogenetic profiling.
144
F UNCTI ON OF PR OTEINS
Predicting and analyzing the function of novel proteins presents one of the main challenges in computational biology, more than ever since the number of proteins continuously explodes with the increasing number of completely sequenced organisms. Functionality models revealed from the analysis of
particular chemical parameters, e.g., enzyme kinetics, and led to sketches of selective pathway maps.
Nowadays, this knowledge provides the basis for comparative methods of protein function detection,
as one of the main research topics in the field of systems biology.
A nnotating F unction of Proteins

Only three percent of proteins in databases that are annotated with substantive information and annotations other than unknown or hypothetical are supported by experiments (Brown & Sjlander, 2006).
For most of protein sequences, the annotation according function is acquired electronically from other
similar sequences. This crucial point is well known and leads to a pretended reliability of sequence
annotation for proteins (Valencia, 2005). Systematic errors in databases can hereby be propagated and
are, if seldom detected, not necessarily corrected. Some approaches take this into account and claim
for categorization only using physical parameters like structural properties or similarity of sequence.
They do not take responsibility for functional classification. The SYSTERS protein family approach
(Krause, Stoye, & Vingron, 2005), e.g., relies only on sequence similarity, the respective data set references annotations of protein sequence entries by extracting most common keywords.
It is more advantageous to deduce protein function from structural or phylogenomic properties
(Krishnamurthy, Brown, Kirshner, & Sjlander, 2006; Brown & Sjlander, 2006. Relying on phylogenomic trees includes the evolutionary aspect more than using purely sequence parameters. For instance,
genome duplications or domain shuffling are hereby better regarded, spatial specialization or neofunctionalization of paralogous genes can be better detected. Excluding the so called in-paralogs, groups of
orthologs can be more precisely classified by phylogenomics, but this is always difficult because evolutionary distances can extremely vary for particular orthologous groups (Luz & Vingron, 2006). Even
if the three-dimensional protein structure stays conserved, a single exchange of amino acid residues at
substrate-specific reaction centers can critically lead to a change in function. But it is very difficult to
define general rules for the inference of a genomic-scale orthology data set. A performance assessment
of orthology detection (Chen, Mackey, Vermunt, & Roos, 2007) used statistical techniques for a pairwise
comparison of several methods, some of them are explained in the data sets section.
Shift of function in protein subfamilies presents a special aspect in function prediction (Brown,
Krishnamurthy, & Sjlander, 2007). Such conversions of protein function are analyzed and collected
in the FunShift database (Abhiman & Sonnhammer, 2005a; 2005b). Subfamilies that are derived from
Pfam families are analyzed (Mistry, Bateman, & Finn, 2007). They can possess conservation shifting
sites, i.e., different amino acid residues at conserved sites with probable functional evidence, as well as
rate shifting sites, i.e., sites of different evolutionary rates in two subfamilies.
Categorization According Predefined Functional Classes

Enzymes or pathways include per se information on identified function. Experimental results comprised
in enzyme data collections like IntEnz (Fleischmann et al., 2004) or Brenda (Barthelmes, Ebeling,
145
Chang, Schomburg, & Schomburg, 2007) have been developed. Functional enzymes in a pathway
context are manually curated and analyzed in the KEGG database (Kanehisa et al., 2008) or Reactome
(Vastrik et al., 2007).
The GO annotation is a collection of structural or functional terms intended by the Gene Ontology
Consortium (Ashburner et al., 2000). It attempts to annotate each gene or protein with GO terms, i.e.,
characterizing descriptions, and GO identifiers for standardization. Three categories (biological process,
molecular function, cellular component) facilitate the structure, GO terms are organized in sublevels.
Special attempts, GO tools and non-consortium tools, are available for searching and browsing, annotation, microarray analysis support and verify annotations through statistical support.
Ubiquitary and Organism Group-Specific Protein Function

Housekeeping and information transduction genes are essential for a living organism and exist therefore
as ubiquitary proteins in almost all organisms. Using those functional genes reveals advantageous phylogenetic or orthology analyses. Attempts that focus on the inference of the tree of life are frequently
compared by the phylogeny of the ubiquitary 16S rRNAs that serves for a reference tree. In contrast, loss
of essential genes that participate in pathways of the host organism is observed in parasitic or symbiotic
organisms. There exist also many highly specialized proteins uniquely in small groups of organisms.
They contribute, beyond gene regulation phenomena, to the individuality of those species.
SEQ UENCE SIMILARITY
AND H OMOLOGY
A sequence comparison procedure analyzes the primary protein structure of two proteins, the sequence.
This is an observation and can be interpreted in an evolutionary sense as homology. In this section, the
basics for sequence similarity and homology are described.
S imilarity of S equences
The basis of most protein categorization approaches is a sequence comparison, well-known are routines
like BLAST (Altschul et al., 1997) or the Smith-Waterman algorithm (Smith & Waterman, 1981). These
widespread tools are known as applications that support fast access to biological data resources and
prove their strengths in, for instance, the general task to confirm if a newly detected protein as similar
to an already known protein. They are working with differences on the levels of technical background,
algorithm assumption and time efficiency, but they generally consider local sequence similarity. Therefore, conserved protein substructures, mostly found as protein domains, govern a sequence comparison.
The most frequently used comparison parameter is the expectation value, e-value, for the probability
that two protein sequences are not randomly found in a sequence database. The e-value depends on the
sequence length as well as the database size. Thus, querying two sequences obtained from two organisms against the proteome of the respective other organism, e-values differ from each other if, even
identical, sequences are queried against proteomes of different sizes. This leads to the insight that an
e-value is not only a parameter exclusively for relative considerations but also has to be determined in
reciprocal directions.
146
Orthology, Paralogy
Orthology and paralogy are key concepts of evolutionary genomics and are therefore discussed in depth
(Koonin, 2005; Fitch, 2000; Sonnhammer & Koonin, 2002). Homology includes the terms orthology and
paralogy. Two independent evolutionary processes towards two recent genes revealed, when diverged
after a last speciation event, as two physically distinguishable protein sequences in both organisms, as
homologs. Underlying processes are mutations, insertions, deletions, or events in higher dimensions
like domain shuffling or changes in the gene order. These events, always infecting the information level
- the DNA -, reveal in substantial changes of the result of translation, the protein (with the exception
of synonymous mutations according to the ambiguity of the genetic code). Out of all proteins in both
proteomes under consideration, it is therefore mostly probable and plausible that a close evolutionary
relationship of proteins is reflected by higher sequence similarity compared to all other proteins in the
proteome.
Orthologs are per definition genes in two different species that have vertically descended from a
single gene in the last common ancestor. Orthologs are considered to be detectable by sequence similarity. The best pairwise hit, as result of the similarity search of two proteomes against the other, is the
central criterion in the InParanoid method (Remm, Storm, & Sonnhammer, 2001): Two so-called main
orthologs, a term originally defined in the field of evolution, form the seed for an orthologous group,
and other closely similar sequences of both organisms are successively added to the seed. Hence, the
central step for this fully automatic approach is the all-against-all sequence comparison that can only
indicate the underlying evolutionary process and can be used for such an interpretation.
The analysis of homology is more complex because the number of paralogs, which resulted from
a split and remained as similar proteins in a single organism, is often very large: Paralogs occur as
results of genome duplications. Alternative splicing isoforms can technically not easily separated from
duplication products by similarity measurements. All paralogous genes together are orthologs from
the point of view of a closely similar gene in a second organism. The time points of both incidences,
of speciation and of divergence start, are important and have to be distinguished: out-paralogs predate
a species split, in-paralogs arose after a split into two species. The latter are, by definition and in the
prevailing terminology, properly paralogs. Out-paralogs should be more precisely distinguished in the
group of all orthologs. The nature of those definitions rules some basic computational approaches and
the underlying methods.
All-against-all comparisons by sequence similarity searches are the initial step in most of the methods described in this section. For analyses of hundreds of proteomes, any methodology is limited by
the complexity of homology.
B est H its, S um of H its

One widely used method for determining pairwise orthologs is the calculation of the reciprocal best
hit. This one-to-one combination of sequences of two organisms can be expanded to a proteome-wide
comparison. Much more frequent is the occurrence of multiple best hits. One-to-many or many-to-many
sets are built by sum-of-hits approaches. In this case, high-scoring sequence pairs with distance measures smaller than the arbitrary distance threshold participate additionally to the core set of orthologs.
A difficulty occurs if the reciprocal best hit concept is applied and more than two organisms are
considered. Then, for three organisms {A; B; C}, for instance, the pairwise comparisons {A-B; A-C;
147
B-C} are to be calculated. Frequently, pairwise best reciprocal hits appear to sequence pairs as {a1-b1;
a1-c1; b1-c2}. In this case, the individuals of A and B point at different sequences of proteome C (c1
and c2). It is obvious that this problem is increased with the number of proteomes. The categorization
of proteins can be successful in the sum-of-hits manner, regarding the implications of orthology and
paralogy. Methods that facilitate this goal are the focus of the next section.
SEQ UENCE SIMILARITY

OF PR OTEIN FAMI LIES
IS A CAPAB LE PARAMETER
F OR THE INFERENCE
The common evolutionary background of proteins can, under the constraints discussed, be successfully
determined by computational calculations that utilize sequence similarity. The presentation of methods
that perform all-against-all comparisons of full-length proteins (mainly measuring local similarities)
with the strategy of finding reciprocal best hits is the focus of this section. The publication by Krause
(2006) reviews the history of protein family forming as well as backgrounds and descriptions of established approaches and data sets.
D ata Origin
Types of data sets are either focused on the complete protein space, i.e., the set of known and available proteins, or a data set consists of a closed protein space, for instance by restriction to taxonomic
groups like bacteria, eukaryotes, plants, animals. In each case, however, all available data reflect only
a snapshot of the available knowledge of proteins. In public protein sequence databases, several tenthousands of organisms are represented by at least one protein sequence. Currently round about 600
complete proteomes, with sizes between 500 and 25,000 proteins, are known. The number is steadily
increasing. Also refined methods that determine proteins from genomic information via new alternative
splicing isoforms or alternative start site transcripts enhance the space of (putative) homologous proteins
in sequence databases. To categorize sequences from a global organism set regards the evolutionary
aspect more than a sequence set that is reduced regarding only a subgroup of organisms. However, a
reduction of the sequence space is often necessary.
Preprocessing of D ata
Taking into account the high variability and the large amount of proteins is one of the challenging tasks
in optimizing algorithms, the available technical background, software, or computing time. An answer
for limited capacity can be the reduction of the sequence space. This is frequently the reason for a
pre-selection by a similarity criterion, e.g., 97, 80, or 50 percent. Universal protein sequence databases
like UniProt (The UniProt Consortium, 2008) serve for sequence subsets and can be utilized for such
restrictions, UniRef50 is a set with a similarity level of 50 percent. To achieve the complete sequence
space, an replenishing step is required that follows the protein family inference processing.
Measures of S equence S imilarity and D istance

Most frequently used distance measures are calculated from similarity scores. These distance measures
and subsequently calculated e-values, which serve as essential decision parameters, are required in the
148
categorization phase. They are calculated by prominent alignment-dependent approaches like BLAST
or the Smith-Waterman algorithm.
A similarity score sub-summarizes the validation result of a comparison in each single letter position
within two character strings of protein sequences. Hereby, a pre-calculation of average ratios is used
which examined the mutation from each single amino acid towards another. Such a mutation probability
matrix, known as a scoring matrix, implicates models of evolution (Dayhoff, Schwartz, & Orcutt, 1978;
Henikoff & Henikoff, 1992). The basis for computing similarity scores are pairwise alignments including gaps that indicate deletions or inventions and gap penalties that have influence on the scoring. The
character string comparison is performed on a pairwise protein sequence alignment. Such a significance
test on often not clearly similar sequences utilizes local alignments, and e-values are calculated from
local similarity scores. The BLAST algorithm is a heuristic search method and is used for approximate
local sequence similarity. Newer concepts also consider alignment-free distant measures (Kelil, Wang,
Brzezinski, & Fleury, 2007).
C ategorization
Methods aimed at the partitioning of data to subsets wherein common traits are shared are clusterings.
Variations of clusterings are adapted for special tasks. A concept of a categorization is smart if it relies
solely on information within the data. Clusterings satisfy this requirement. Single linkage clustering
is a common, computationally cheap and well-studied categorization method in sequence comparison
analyses that use distances. Hierarchical clusterings, as one application option, organize sub-clusters
and further sub-levels in a tree structure. Applying a hierarchical clustering on protein sequences, the
result is a protein sequence tree. Herein, edge lengths correspond to the distance measure. The choice of
a suitable threshold to separate biologically meaningful sub-clusters is the crucial point of hierarchical
clusterings. Sophisticated attempts with dynamical distance measure cutoffs, more widely than fixed
cutoffs, are introduced to separate clusters of coherent proteins.
Other methods of protein sequence partitioning utilize techniques like Markov clustering (confer
next section) or spectral clustering (Paccanaro, Casbon, & Saqi, 2006). Automated phylogenetic trees
based on a particular similarity measure and followed by subsequent partitioning (Kelil, Wang et al.
2007) are reported to serve for a successful categorization as well as layout based heuristics for weighted
cluster editing (Wittkop, Baumbach, Lobo, & Rahmann, 2007).
SEQ UENCE -BASED
PR OTEIN FAMI LY DATA SETS
Pre-selection of data and applied methods that are aimed at to infer protein families from sequence
information stand in a close connection to the resulting data set. In this book chapter, some approaches
that claim for the inference or detection of orthology are included under the umbrella term protein family. Established protein family inferring approaches and underlying methods that utilize whole sequence
information are explained in this section.
SYSTERS
SYSTERS (systematic re-search) is initially introduced as a set-theoretical approach. In the current
version, a hierarchical clustering (Krause, Stoye, & Vingron, 2005) utilizes the whole protein space
149
from publicly accessible protein sequence resources. Starting with an all-against-all Smith-Waterman
comparison on the basis of a pre-calculated non-redundant protein sequence set (80 percent criterion), a
single linkage tree of the non-redundant set is constructed from pairwise sequence distances. The internal
structure of the tree, i.e., the data itself, separate SYSTERS superfamilies using a specific characteristic,
the subtree size, to determine dynamical e-value cutoffs. In a second step a graph-based approach, the
minimal cut algorithm, subdivides superfamily graphs into clusters. These so called SYSTERS protein
families are formed fully automatically and are the working entities of the SYSTERS data set. Sequences
closely related to sequences of the non-redundant set are added to respective families. Hence, they are
completing the redundant protein sequence set as it was achieved from the data resources. SYSTERS
restrictively stands for the clustering of similar sequences and references terms of homology or function
from annotations of the single sequence entries. However, additional information - posterior calculated
phylogenetic trees and multiple alignments - is achievable in the web server (Meinel, Krause, Luz, Vingron, & Staub, 2005) and supports the clustering of sequences towards SYSTERS protein families.
C OG : C lusters of Orthologs
Completely sequenced archaeal and bacterial proteomes were compared by an all-against-all BLAST
search (Tatusov, Koonin, & Lipman, 1997). The formation of a cluster of orthologous groups, COGs,
initially succeeds for closely similar genes of three organisms, a triangle represents the similarity
measures of the pairwise best hits as edges. If this triangle is detected to comprise a common edge to
further triangles, these mutually consistent genome-specific best hits were merged to that COG. The
core domain architecture of a protein drives the construction of a COG using position-specific scoring
matrices for individual domains. Multi-domain proteins that artificially bridge a COG were manually
split into individual domains and support the assignment of COGs in accordance with their distinct
evolutionary affinities. Examinations and visual inspections follow the intermediate classification. In
a later update version of the COG database, few eukaryotic genomes were added to the data set, now
called as KOG (Tatusov et al., 2003), by a special software, the COGNITOR program.
T ribeMC L: T ribes, E nsembl F amilies

The TribeMCL algorithm compares protein sequences in an all-against-all manner using BLAST, and
pairwise asymmetric results are cutoff-restricted. Expectation values enter into a so called MarkovMatrix that can be represented as connection graph (Enright, Van Dongen, & Ouzounis, 2002). The
Markov Clustering (MCL) procedure simulates a random walk through this graph by iterative rounds
by alternating decision parameters until the result is robust. This bootstrapping-like procedure allows
protein families hidden in the graph to become visible by gradually stripping the graph down to its
basic components. For the Tribes database (Enright, Kunin, & Ouzounis, 2003), protein sequences of
83 complete genomes of all three super-kingdoms of life entered into the procedure; for the Ensembl
database, TribeMCL is used to infer Ensembl families only for currently more than 25 eukaryotic organisms into the Ensembl database (Flicek et al., 2008), wherein some of them are still incomplete as
draft genomes. Ensembl provides an alternative orthology and paralogy prediction method in contrast
to Ensembl Families. Advanced sequence-based and phylogeny-based approaches are combined in
Ensembl Orthologies. The method can be conferred by http://www.ensembl.org/info/about/docs/compara/homology_method.html.
150
InParanoid
Proteins of currently 35 eukaryotic organisms are taken into the calculation of similarity scores, for the
detection of two so-called main orthologs as well as of adequate in-paralogs. The main orthologs are
the seed sequences of an orthologous group and is passed by a best hit similarity score. Closest related
sequences of each of the two organisms, in-paralogs, are successively added to the seed if the similarity
to their organism-specific correspondent is within the score. Out-paralogs are more distant from the
main orthologs and form an own group. Each group is restricted to a pair of organisms, and orthologs
from other organisms will not be respected in the group (Remm, Storm, & Sonnhammer, 2001; OBrien,
Remm, & Sonnhammer, 2005). Results are available in the InParaniod database (Berglund, Sjlund,
Ostlund, & Sonnhammer, 2008). For extension of this approach on multiple proteomes and circumvent
the restriction by pairwise considerations, a subsequent clustering using the InParanoid output was introduced. The algorithm called MultiParanoid (Alexeyenko, Tamas, Liu, & Sonnhammer, 2006) merges
multiple pairwise ortholog groups into ortholog groups of multiple proteomes.
OrthoMC L
OrthoMCL is a graph clustering algorithm that identifies homologous proteins based on sequence similarity (Li, Stoeckert, & Roos, 2003; Chen, Mackey, Stoeckert, & Roos, 2006). Species are compared in
order to detect putative orthologs by the reciprocal best hit. In-paralogs are identified within the same
genome that are more similar to each other compared to any sequence from other genomes. OrthoMCL,
as a fully automated method, combines the homology definitions of the InParanoid approach with the
technique of the Markov clustering (that is also used for TribeMCL) to resolve the many-to-many orthologous relationships inherent in comparisons across multiple genomes.
P-POD
The Princeton Protein Orthology Database, P-POD (Heinicke et al., 2007), results from the idea to
compare automatically derived orthology with knowledge from literature. OrthoMCL is the underlying
automatic algorithm. P-POD claims to be a comparative genomics analysis tool that combines phylogenetic relationships and information from curated literature or other resources oriented in issues like
diseases. It appeals particularly to experimentally working biologists who are achieving information
from or about genes in well-studied model organisms.
OTHER
ATTEMPTS
THAT DETECT
PR OTEIN FAMI LIES
Protein domains as well as phylogenies, that are based on multiple sequence alignments are used to
categorize proteins into protein families beyond pure whole sequence-oriented methods.
Pfam
Protein domains are conserved functional substructures and therefore possess an evolutionary background. The Pfam database (Finn et al., 2006)) comprehends round about 8,000 manually curated
151
protein families. They are generated using Hidden Markov Models (HMMs; a statistically supported
method) for protein domains. Therefore, Pfam families are domain driven. To respect evolutionary
relationships, Pfam families are organized in so called clans, respecting the fact that artificially high
thresholds frequently separate two related families or divergent members of families which cannot be
regarded in a HMM.
PhIG s
The Phylogenetically Inferred Groups (PhIGs) database introduces known evolutionary relationships in
addition to protein sequence distances from a BLAST all-against-all comparison to an iterative hierarchical clustering (Dehal & Boore, 2006). In contrast to the prior presented attempts, multiple sequence
alignments of proteins are used to derive orthology from sequence information. A complete phylogenetic
gene tree is created using widely accepted analytic methods of molecular evolution.
T reeF am
For the TreeFam database (Li et al., 2006; Ruan et al., 2008), orthology of genes is declined from
phylogenetic trees of all animal gene families considering that the analysis of phylogenetic trees is an
established and very accurate way of determining orthology. 25 animal genomes are regarded in phylogenetic trees. The TreeFam database provides in its major part, TreeFam-B, phylogenetic trees that
are automatically generated using clusters of PhIGs as seed. A second part consists of manually curated
trees (TreeFam-A; which comprises less than 10 % of all trees).
TAX ON OMY AND THE N UMBER OF C OMPLETE LY SEQ UENCED

ORGANISMS
The overwhelming number of bioinformatics approaches relies on a taxonomy that is provided by the
NCBI (Wheeler et al., 2004). NCBI collects expert knowledge and provides references for the included
organisms. The data set is maintained daily. The NCBI taxonomy serves for universally used digital
taxonomic identifiers, the TaxIDs. Lineages to each organism are provided as well as the universal
taxonomic tree including internal nodes like super-kingdom, class, order, family and species or sublevels
like subspecies as leaves. In phylogenetic analyses, species as well as subspecies as well as strains are
often considered as operational taxonomic units, a synonym to the notation organisms.
In addition to TaxIDs, a second taxonomic nomenclature is provided by the taxonomic division of the
ExPASy server (ExPASy Proteomics Server, 2008) and is used in the UniProt and Swiss-Prot databases.
It is characterized by the usage of mnemonics that abbreviate organism names most frequently by five
letters. Respective organisms are a subset of the NCBI taxonomy, corresponding organisms are linked via
standard NCBI TaxIDs. In practice, Swiss-Prot combines those mnemonics with gene names in protein
sequence identifiers (e.g., SYFA_HUMAN, human phenylalanyl tRNA-synthetase alpha chain).
Currently, round about 25 eukaryotic genomes are completely analyzed. Due to the easier sequencing
according to smaller genome sizes and the importance under biotechnological or disease aspects, the
number of bacterial and archaeal genomes is much larger, above 550 and round about 50, respectively.
152
Completed and ongoing genome projects are recorded in the GOLD database (Liolios, Mavromatis,
Tavernarakis, & Kyrpides, 2008).
PHY LOGENETIC
PR OFI LING
Phylogenetic Profiles Combine Gene Presence Information with Taxonomic

Information; Phylogenetic Profiling
A phylogenetic profile is a presence or absence pattern of a discrete gene or protein across a given set of
organisms. The array of organisms can be organized with respect to their taxonomy. Then, phylogenetic
profiles (or phyletic patterns, synonymously) visualize the presence of a gene or protein in a systematic
and easily readable way.
Proteins of organisms, that are represented in the phylogenetic profile, are usually not identical by
sequence. Therefore, a phylogenetic profile has generally to be associated to an instance that categorizes
proteins, e.g., to protein families as presented in a prior section. Hence, the population of a phylogenetic
profile depends on the protein family inferring method. It is remarkable that a phylogenetic profile
generally represents a protein family that contains at least one member protein of each of organisms
represented.
A phylogenetic profile serves for a taxonomic overview about a protein family. Vice versa, a protein
family can be identified by a phylogenetic profile. Phylogenetic profiling achieves a comparison of
several phylogenetic profiles. Similar or identical profiles possess a special importance with biological
background.
C onservation of Proteins in a F unctional C ontext Leads to a C onservation of

Phylogenetic Profiles
If a functional context of proteins is evolutionary conserved and is occurring in a set of organisms, affiliated proteins possess also a common evolutionary history. Each of the essential proteins must survive
against any evolutionary pressure. Then, this group of organisms contributes co-inherited members to
respective protein families and therefore to identical presence indications in each single phylogenetic
profile. Hence, the idea was emphasized that similar phylogenetic profiles could be used to predict functional relationships between proteins of unknown function (Pellegrini, Marcotte, Thompson, Eisenberg,
& Yeates, 1999). A functional context concerns proteins that are in a physical contact (e.g., interactions
among proteins or domains of proteins) or enzymes in a shared pathway.
However, similar or identical profiles do not necessarily indicate common functional context.
Ubiquitary families of essential proteins, that are merely present in two organisms are not necessarily
functionally related or co-evolved, but are represented by phylogenetic profiles with presence indications for all organisms. Analogously, there might also exist organism-specific protein families that are
not involved in a common functional background but are characterized by similar phylogenetic profiles.
It could be shown (Jothi, Przytycka, & Aravind, 2007) that the correlation is weak between profile
similarity measured by mutual information and membership in a KEGG pathway measured by scoring
similarities albeit only informative organisms were careful filtered out for the analysis. A hierarchical
153
clustering of phylogenetic patterns of COGs (Glazko & Mushegian, 2004) analyzed the pattern distance,
graphs producing methods, the partitioning into subgraphs, and the estimation of error rates for the
prediction of functional linkage.
Several measures for phylogenetic patters were introduced (Wu, Kasif, & DeLisi, 2003), the simplest is the Hamming distance that counts the differences of each position in a profile. Other measures
successfully applied are the Pearson correlation coefficient and mutual information. This assessment
revealed that similar as well as complementary profiles possess similarities to common KEGG pathways.
The calculation of the overlap of gene pairs to Gene Ontologies was used to benchmark methods and
metrices that compare phylogenetic profiles (Cokus, Mizutani, & Pellegrini, 2007).
The organism set of the most phylogenetic profiling approaches is dominated by the bacterial and the
archaeal super-kingdom. More interesting in the context of clinical studies is the inclusion of eukaryotic organisms. The Ensembl database serves for sequence and protein family information of several
eukaryotes. PhyloPat (Hulsen, de Vlieg, & Groenen, 2006) is initiated to summarize all Ensembl orthologs into phylogenetic patterns. However, assessments on phylogenetic profiles excluding prokaryotic
proteins revealed poor results (Snitkin, Gustafson, Mellor, Wu, & DeLisi, 2006; Jothi, Przytycka, &
Aravind, 2007).
Convergence events, non-orthologous gene displacements, can lead to a complementary profile
combination. This means that all presence states in the first profile are absence state in the second and
vice versa. However, two complementary profiles with proteins of identical functional background do
not necessarily indicate convergence. If the family inferring approach separates two divergent groups of
proteins with common history into two protein families like in the SYSTERS data set, both respective
phylogenetic profiles are complementary, too.
Phylogenetic Profiles on the Basis of the SYSTERS Data Set

Phylogenetic profiling basing on the SYSTERS data set is introduced in regard to several boundaries.
E.g., at least three completely sequenced organisms must contribute proteins to a protein family and
therefore to the set of phylogenetic profiles. The SYSTERS subset with protein families of that quality is called PhyloMatrix (Meinel, Krause, Luz, Vingron, & Staub, 2005). 106 organisms of all three
super-kingdoms are included to the profile set, the current release from 2003 consists of 78 bacterial,
12 eukaryotic, and 16 archaeal proteomes (organisms in taxonomic order are available upon web request). 7,563 unique phylogenetic profiles represent 19,374 SYSTERS protein families, 1,933 families
are characterized by a single profile that comprises the three organisms human, mouse, and pufferfish.
This might plausibilize the correlation of unspecific profiles in eukaryotic phylogenetic profiling as
assessed by several authors.
A specificity of the PhyloMatrix profile set is, according to the original SYSTERS clustering procedure, the partitioning of obviously orthologous sequences into separate protein families that are not
quenched together by a post-processed manual curation. Therefore, phylogenetic profiling in SYSTERS
can additionally be used for evolutionary studies of divergence phenomena. A special group of ubiquitary proteins with a central role in the information transduction chain shall demonstrate this SYSTERS
specificity as well as serve for an example for phylogenetic profiling basing on PhyloMatrix. It is also
traceable in the SYSTERS web interface.
Figure 1(a) presents an example that illustrates the classical profiling approach - two proteins in a
functional context comprise a similar phylogenetic profile - and the specificity of the SYSTERS clustering
154
Figure 1. (a) Phylogenetic profiles (o/-; presence/absence) of three SYSTERS protein families in superfamily 115499 (profiles A to C) and four families in SYSTERS superfamily 113736 (profiles D to G). All
seven families comprise phenylalanyl tRNA-synthetases. A functional context, here given by two protein
subunits, is observed for protein families in profile combinations A-D and B-E. Confer the PhyloMatrix
web service for details. (b) Protein domain composition (colored; Pfam code) in -subunits, -subunits
and compartment-specific isoforms of phenylalanyl tRNA-synthetases, clustered in two SYSTERS superfamilies (SF) and seven SYSTERS protein families (PF), confer Figure 1. Eukaryotic isoforms are
distinguished according to the respective cellular compartmentalization.
SYSTERS superfamily 115499
SYSTERS protein family 141095 - Phenylalanyl-tRNA synthetase alpha subunit
A ------------------------------------------------------------------oo---------- oooooooooo-o oooooooooooooooo
SYSTERS protein family 141096 - Phenylalanyl-tRNA synthetase alpha subunit
B oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo--oooooooooo ------------ ---------------SYSTERS protein family 141094 - mitochondrial Phenylalanyl-tRNA synthetase
C ------------------------------------------------------------------------------ ooooooooo--o ----------------
|<-Bacteria------------------------------------------------------------------->|<-Eukaryota>|<-Archaea----->|
SYSTERS superfamily 113736

SYSTERS protein family 139353 - Phenylalanyl-tRNA synthetase beta subunit
D ------------------------------------------------------------------oo---------- oooooooooo-o oooooooooooooooo
SYSTERS protein family 139351 - Phenylalanyl-tRNA synthetase beta subunit
E oooooooo-ooooooooooooooooooooooooooooooooooooooooooooooooooooooooo--oooooooooo ----o------- ---------------SYSTERS protein family 139354 - hypothetical protein
F ------------------------------------------------------------------------------ ooo-ooo----- ---------------SYSTERS protein family 139352 - Phenylalanyl-tRNA synthetase beta subunit
G ---------------------------------------ooooooooooooooooo---------------------- ------------ ----------------
(a)
(b)
to separate more divergent groups of proteins from each other. Aminoacyl tRNA-synthetases, aaRSs,
link highly specifically amino acids to the cognate tRNAs as key components in the protein translation machinery. Sequence similarities and distances for protein sequences of aaRSs are well-studied in
prokaryotes (Woese, Olsen, Ibba, & Sll, 2000). Phenylalanyl tRNA-synthetases, PheRSs, comprise, in
bacteria as well as in archaea, both alpha and beta subunits. These two groups of interacting proteins
155
are clustered separately from each other in two SYSTERS superfamilies; respective profile pairs are
similar: A-D and B-E. However, each superfamily is split in, at least, two main protein families. Figure
1(b) illustrates slightly different protein domain compositions observable in both bacterial subunits. The
generalized sketch for the respective SYSTERS protein families 141096 (alpha subunits) and 139351
(beta subunits) reveals one or two additional domains in comparison to archaeal and eukaryotic proteins
(Wolf, Aravind, Grishin, & Koonin, 1999). This difference forces the separation of the sequences into
the two SYSTERS protein families PF 141095 and PF 139353 indicated by complementary phylogenetic profiles presented in Figure 1(a) by the profile pairs A-B and D-E. Here, the evolutionary history
can be verified as divergence. In addition, the endosymbiotic history seems to be reflected by a third
family albeit the annotation in the respective entries is poor (confer the SYSTERS web server for more
information). While cytosolic proteins in eukaryotes are combined with archaeal proteins, mitochondrial proteins are clustered additionally in their own family PF 141094. Figure 1(b) discloses that two
functions, one of a single domain of the bacterial alpha subunit and one of a single domain in the beta
subunit, are comprised towards a single mitochondrial protein consisting of two concatenated domains
with an obviously preceding history. The given example of phenylalanyl tRNA-synthetases very insistently demonstrates those influences of several evolutionary events that can be observed by phylogenetic
profiles: The early separation towards archaea and bacteria is reflected by complementary pattern pairs,
the endosymbiosis event by co-occurrence of two alpha-like subunits in eukaryotes, and the similarity
(or identity) of profiles base on the functional context of alpha and beta subunits.
Aminoacyl tRNA-synthetases, aaRSs, are equally key components of the information transduction
chain generally ubiquitary proteins. In Figure 2, phylogenetic profiles demonstrate this for at least fourteen aaRSs by the presence in all organisms, some of them are comprised in a single protein family. At
least two protein families exist for each of the six other aminoacyl tRNA-synthetases separating archaeal
from bacterial proteins. Eukaryotic proteins are exclusively clustered together with archaeal proteins
- GlyRSs and, as explained previously, both PheRS subunits -, or exclusively clustered together with
bacterial proteins, LysRSs. This is supported by (nearly perfect) complementary phylogenetic profiles.
A separate mitochondrial protein family is known at least for PheRSs. According to the endosymbiont theory, whereby mitochondrial proteins are recruited from an invasive bacterium, mitochondrial
aaRSs in eukaryotes are more similar to bacterial than to archaeal proteins. This is visible in profiles
for LysRSs, but also for ProRSs, TrpRSs and TyrRSs. Eukaryotes possess additionally archaea-like,
cytosolic proteins for the latter three aminoacyl tRNA-synthetases. In the protein family pairs of aaRSs
complementary patterns base on divergence even if an affiliation to common SYSTERS superfamilies
is not always given (data available in the web server). For both GlyRS families, for instance, the common structural superclass is evident (data not shown). Locally complementary profiles are found for
GlyRSs, LysRSs, and ProRSs within the bacterial super-kingdom, for LysRSs and SerRSs within the
archaeal super-kingdom.
Proteins can be specific for a group of organisms. Horizontal gene transfer can frequently be observed in the bacterial super-kingdom and is detectable by phylogenetic profiles. Spirochaetes, a separate bacterial taxonomic class, are known for an individual behavior in profiles. Also the two single
bacterial presence indicators in the given example, profile A and D in Figure 1(a), belong to this effect.
Horizontal gene transfer leads to single presence states outside of an organism group in the profile of
an organism-specific protein family.
156
Figure 2. Significant phylogenetic profiles for all twenty aminoacyl tRNA-synthetases (o/-, presence/
absence): ubiquitary SYSTERS protein families for Ala, Cys, His, {Ile, Leu, Val}, Met, Arg, Ser, Thr,
{Asp, Asn, Glu, Gln} specificity, and organism-specific SYSTERS protein families for Phe, Gly, Lys, Pro,
Trp, Tyr specificity. IUPAC-IUB three-letter codes for amino acids stands for the aminoacyl specificity
of respective tRNA-synthetases. A facultative minor letter indicates subunits of a protein complex, a,
alpha; b, beta; e, epsilon; m, mitochondrial. Curly brackets indicate that groups of aminoacyl tRNAsynthetases are unified in a single SYSTERS Protein Family (e.g., PF 139203 for Ile, Leu, Val) due to
high sequence similarities.
|<-Bacteria------------------------------------------------------------------>|<-Eukaryota>|<-Archaea----->|
oooooooo-ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 143119 Ala
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o ooooooo-ooo-oo-o PF 143743 Cys
------------------------------------------------------------------oo---------- oooooooooo-o oooooooooooooooo PF 141095 Phe-a
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo--oooooooooo ------------ ---------------- PF 141096 Phe-a
------------------------------------------------------------------------------ ooooooooo--o ---------------- PF 141094 Phe-m
------------------------------------------------------------------oo---------- oooooooooo-o oooooooooooooooo PF 139353 Phe-b
oooooooo-ooooooooooooooooooooooooooooooooooooooooooooooooooooooooo--oooooooooo ----o------- ---------------- PF 139351 Phe-b
oooooo--------------------------------------------oooo--oooooooo-oooo---o----- oooooooooo-o oooooooooooooooo PF 141159 Gly
------oooooooooooooooooooooooooooooooooooooooooooo----oo--------o----ooo-ooooo ----o------o ---------------- PF 152193 Gly
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 138672 His
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 139203 {Ile,Leu,Val}
----o-------------------------------------------------------------oo---------- ------------ o---oooooooooooo PF 149724 Lys
ooooooooooooooooooooooooo--ooo--oooooooooooooooooooooooooooooooooo--oooooooooo oooooooooo-o -ooo-oo--------- PF 138269 Lys
------------------------ooo--ooo---------------------------------------------- ------------ ---------------- PF 149725 Lys
ooooo-oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 137021 Met
o-----oooooooooooooooooo-----------ooo-----------o-------o-ooooo--o-o---o----- oooooooooo-o oooooooooooooooo PF 138344 Pro
-oooooooooooooooooooooooooooooooooooooooooooooooo-ooooooooo-----oo-o-ooo-ooooo ooooooooo--- ---------------- PF 127052 Pro
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 150167 Arg
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooooo ooooooo-ooo-oo-o PF 146233 Ser
----------------------------o------------------------------------------------- ------------ -------o---o--o- PF 129876 Ser
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 143907 Thr
------------------------------------------------------------------------------ oooooooooo-o oooooooooooooooo PF 143128 Trp
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oo-oooooo--o ---------------- PF 143124 Trp
------------------------------------------------------------------------------ oooooooooo-o oooooooooooooooo PF 137023 Tyr
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooooooooo--o ---------------- PF 139117 Tyr
oooooooooooooooooooooooooo-ooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooo-o oooooooooooooooo PF 138346 {Asp,Asn}

------------------------------------------------------------------------------ ------------ oooooooooooooooo PF 142105 Gln-e
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooooooooo--o ---------------- PF 138345 {Glu,Gln}
ooooooooooooo--ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooooooooo--o oooooooooooo--oo PF 140465 {Asp-a,Asn-a,
Gln-a,Glu-a}
oooooo---------o--------oo-oooooooooooooooooooooooooooooo-oooooooooooooooooooo oo-oooooo--o ooo-oooo---o--oo PF 142104 {Asp-b,Asn-b,
Gln-b,Glu-b}
|<-Bacteria------------------------------------------------------------------>|<-Eukaryota>|<-Archaea----->|
Analyses Using Phylogenetic Profiles

Co-inheritance was predicted by functional annotations using pairs of phylogenetic profiles as well
as genome-wide proteins network linkages in bakers yeast or the bacterium Escherichia coli (Date &
Marcotte, 2003; Wu, Hu, & DeLisi, 2006; Zheng, Roberts, & Kasif, 2002). The modularity in the evolution of functional modules, sets of orthologous groups expressed in blocks of phylogenetic patterns, was
analyzed and quantified (Snel & Huynen, 2004). It was shown that the interspecies flexibility depends
on functional differentiation within orthologous groups. In other words, it cannot be expected that the
157
functional basis of a phylogenetic pattern is perfectly respected by all members of a protein family as
well as the organism composition of the pattern is perfect.
In the study of Snel and Huynen (2004), several data types of origin were acquired: metabolic pathways, protein complexes, transcriptional modules. This gives an impression of the universality of using
phylogenetic profiling. Other reference sets use homologous proteins with reference proteins (Pellegrini,
Marcotte, Thompson, Eisenberg, & Yeates, 1999; Marcotte, Pellegrini, Thompson, Yeates, & Eisenberg,
1999), in-detail-studies within a protein group of interest like tRNA-synthetases (Dohm, Vingron, &
Staub, 2006), clusters of orthologs (Tatusov, Koonin, & Lipman, 1997), Pfam protein domains (Ye &
Godzik, 2004; Pagel, Wong, & Frishman, 2004; Yang, Doolittle, & Bourne, 2005), proteins associated
to distinct pathways (Dandekar, Schuster, Snel, Huynen, & Bork, 1999) or analyze keywords (Liberles,
Thoren, von Heijne, & Eloffson, 2002). Phylogenetic profiles were helpful for the subcellular localization of protein groups (Marcotte, Xenarios, van Der Bliek, & Eisenberg, 2000). Very effective is the
simultaneous visualization for phylogenetic profiles and networks restricted to respective compounds
(Doerks, von Mering, & Bork, 2004).
Phylogenetic Profiling Approaches

Most of prominent phylogenetic profiling approaches are established in the world wide web. URLs of
exemplary internet services are provided:

158
The COG data set (http://www.ncbi.nlm.nih.gov/COG/) can be directly accessed by browsing

phylogenetic profiles. COGs are the basis for many studies on phylogenetic profiles referenced in
this review.
PhyloMatrix (http://systers.molgen.mpg.de/PhyloMatrix/): SYSTERS is a fundamental data resource for protein families. The underlying method is an automated approach that strictly bases
on sequence similarity. In the PhyloMatrix web service, various query options provide access to
phylogenetic profiles of SYSTERS protein families as well as an outlink from each protein family
serves for access to the PhyloMatrix tool.
PhyloPat (http://www.cmbi.ru.nl/phylopat/) is based on Ensembl families and comprises more
than 25 genomes. This service possesses a high-quality web functionality and allows for access
by profiles to respective gene families.
OrthoMCL (http://www.orthomcl.org) provides phyletic patterns for orthologous groups of protein
sequences from multiple eukaryotic genomes.
The STRING database (http://string.embl.de) provides simultaneously information about the taxonomic distribution for proteins and respective networks of gene neighborhood, (von Mering et al.,
2007).
ProLinks (http://dip.doe-mbi.ucla.edu/pronav/) is integrated in the Database of Interacting Proteins
and contains a phylogenetic pattern search option with text display output (Bowers et al., 2004).
PLEX (http://bioinformatics.icmb.utexas.edu/plex/), the Protein Link EXplorer (Date & Marcotte,
2005), provides phylogenetic profiles accompanied by quantitative estimates of linkage confidence.
PhylProM (http://www.sbc.su.se/~anna/PhylProM/index.html) is one of the first services that introduced keyword-based phylogenetic profiling (Liberles, Thoren, von Heijne, & Eloffson, 2002).
DISC USSI ON
This book chapter reviews the sequence-based origin of information for methods that determine functional properties and evolution of proteins. Evolutionary events can be observed in three dimensions:
divergence of sequences, speciation, and divergence of protein function. The arrows in Figure 3 illustrate the evolution of these three dimensions. Techniques for determining sequence similarity are
considered as well as subsequent clustering procedures including several attempts of categorization
towards orthology or similarity. Here, sequence-based or appropriate phylogeny-inferring methods are
utilized for the respective approaches. As illustrated in Figure 3, the sequence-information-based term
protein family or, alternatively, the evolutionary term ortholog can be deemed to be the link between
sequence grouping information and taxonomic information. Many computational attempts combine the
Figure 3. There are three dimensions of evolution covered by computational molecular biology: divergence of sequences, speciation towards organisms, and invention or divergence of function. A protein
sequences distance, which is a measure for the divergence of two sequences, is suggested by the interspace of two dots; protein sequences as grayish dots; two different methodological approaches, dotted
or solid circles, are assumed. Speciation is visualized as a symbolic taxonomic tree with organisms as
leaves, dark squares. Family proteins (or orthologs; corresponding to the inferring method) combine
information about clustering and species but not regarding function. To comprise proteins in enzymes,
respectively pathways, is a categorization of protein function across organisms without sequence information. In paralogs, function can be slightly shifted for sequences within a single organism and with
common evolutionary history. Phylogenetic profiling (PP) can be applied on both family proteins or
enzymes.
P ro te in S e q u e n c es
O rg a n is m s
PP
pa
e
or nzy
pa me
th w s
ay
s
PP
fam ily p roteins

or o rthologs
ra
lo g
s
P ro te in F u n c tion
159
term function with the result from inference of orthology. However, function more often possesses
own evolutionary aspects: there can exist functional shifts caused by mutations, or function differences
between isoforms. Therefore, protein function cannot be directly determined from orthology, it can be,
at most, deduced from it.
Each protein family definition depends on the methodology of protein comparisons and categorization. One aspect that rules the partitioning of proteins is the extremely different speed of evolution for
different proteins. Induced problems often can only be attempted by manual curation. In approaches that
are induced to detect homology function, a motivation for manual curation can preferentially occur. The
SYSTERS method clusters similar proteins without possible limitation by manual curation. In contrast
to this, it is remarkable that some approaches rely often only on sequence similarity while claiming to
infer orthology. Therefore, similarly like in already established in phylogeny-based approaches, it is an
advantage to cumulate information and include phylogeny methods to a purely family-forming approach
than to claim for orthology.
At this point, it is suggested to generally add assured information on function to the knowledge about
sequence similarity. A concept of functional orthology would reduce the number of respective ortholog
candidates, as for instance existing in alternative splicing isoforms, and give a more exact picture of
comparable proteins in different organisms. Orthologous connections between organisms also would
hereby raise from the current information level, the gene level, to the transcript/protein level towards
the operational biological unit that currently is not the case. This would also support the portability of
information on function from animal models to other organisms like human and could be achieved by
including additional experimental evidence to cluster information, for instance by data from microarray-based experiments.
Phylogenetic profiles in general provide visualized information about the taxonomic distribution of a
single protein family. As illustrated in Figure 3, phylogenetic profiling works in parallel to a clustering
of sequences with origin from different organisms and combines several protein families. It can be used
to indicate functional contexts by pattern similarity because it works independent from the knowledge
on function or functional contexts of respective protein families.
A refinement of phylogenetic profiling can be achieved either with extension of the organism number
in the profile or with a refined definition of the biological entity. In comparison to the current number of
prokaryotic organisms, currently a few hundred, the number of eukaryotic proteomes (twenty to thirty)
are very sparse for phylogenetic profiling. Specific bacterial protein contexts are well to observe within
the broad bacterial sub-pattern, for instance flagellar proteins. Many prokaryotes of the bacterial superkingdom are pathogens and therefore of interest for research on pathogenic-induced diseases. Here,
phylogenetic profiling can help to understand protein functionality within the bacterial super-kingdom.
Parasites, a special group of prokaryotes, are observable by special presence or absence indications in
phylogenetic profiles and related protein families are directly detectable. Eukaryotic proteomes support
significantly the endosymbiont hypothesis for mitochondria in combination with respective complete
prokaryotic sub-patterns. Many eukaryotic protein families show unresolved all-presence indications
in phylogenetic profiles. The currently sparse resolution within the eukaryotic sub-pattern would be
increased with an exhaustive increase in research on proteomes of eukaryotes. Currently more than 25
Ensembl organisms are available in the PhyloPat approach. Here, however, phylogenetic profiling of
eukaryotes is not supported with prokaryotic sequence information of evolutionary relevance.
The information content of the biological entity, i.e., the protein family or cluster of orthologous
genes, varies with the underlying methodological approach. Modulating or extending the approach,
160
the definition of a biological entity can be refined towards increased information content in terms of
function or evolution. Any biological entity can be the instance that is assigned with a phylogenetic
profile, whereas a change of the definition background not necessarily induces changes in the profile.
Phylogenetic profiling using SYSTERS protein families, for instance, extends the original functional
context approach wherein a particular pattern represented a singular gene (family): The evolutionary
complexity can be resolved in a SYSTERS superfamily by differentiation into separate SYSTERS protein.
Here, additional divergence information can be detected. In general, there exists a competition regarding divergence and convergence within complementary profiles. A general observation in SYSTERS
is that the detection of divergence is more frequent than the detection of convergence. However, in the
case when many similar sequences of a particular organism occur in a single protein family, additional
discrimination between orthology and paralogy or the usage of the functional orthology concept would
increase the accuracy of this method.
Phylogenetic profiles are generated for many established data sets or are initiated for separate purposes of assessments. Most of them are BLAST-based attempts, many validations are using the COG
data set. Phylogenetic profiling on other prominent protein classification systems, in particular phylogeny-based attempts, consequently would refine insights into the efforts that the phylogenetic profiling
method gives.
Phylogenetic profiling currently only combines two scientific fields, taxonomy with function or
taxonomy with sequence information. In the latter case, function verified by experiments as the third
evolutionary field should be introduced to improve the tagging or even solidify the inference of functional
contexts. A restriction to experimental confirmed function information, for instance by the functional
orthologs concept, would give a higher information quality, not only for the protein family but also for
phylogenetic profiling. Phylogenetic profiling offers the opportunity to gain insight into large-scale data
by an intuitive, easily readable and fast visualization.
ACKN OW LEDGMENT
TM hereby wants to thank Antje Krause for a fundamental introduction into the field of computational
molecular biology in its particular view on protein families. Valuable hints for the concept of this review
are adopted from her thesis.
REFERENCES
Abhiman, S., & Sonnhammer, E. L. (2005a). Large-scale prediction of function shift in protein families
with a focus on enzymatic function. Proteins, 60(4), 758-768.
Abhiman, S., & Sonnhammer, E. L. (2005b). FunShift: A database of function shift analysis on protein
subfamilies. Nucleic Acids Research, 33(Database issue), 197-200.
Alexeyenko, A., Tamas, I., Liu, G., & Sonnhammer, E. L. (2006). Automatic clustering of orthologs
and inparalogs shared by multiple proteomes. Bioinformatics, 22(14), 9-15.
161
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389-3402.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. (2000). Gene ontology:
Tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics, 25(1), 25-29.
Barthelmes, J., Ebeling, C., Chang, A., Schomburg, I., & Schomburg, D. (2007). BRENDA, AMENDA
and FRENDA: The enzyme information system in 2007. Nucleic Acids Research, 35(Database issue),
511-514.
Berglund, A. C., Sjlund, E., Ostlund, G., & Sonnhammer, E. L. (2008). InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Research, 36(Database issue), 263-266.
Bowers, P. M., Pellegrini, M., Thompson, M. J., Fierro, J., Yeates, T. O., & Eisenberg, D. (2004). Prolinks:
A database of protein functional linkages derived from coevolution. Genome Biology, 5(5), R35.
Brown, D. P., Krishnamurthy, N., & Sjlander, K. (2007). Automated protein subfamily identification
and classification. PLoS Computational Biology, 3(8), e160.
Brown, D., & Sjlander, K. (2006). Functional classification using phylogenomic inference. PLoS Computational Biology, 2(6), e77.
Chen, F., Mackey, A. J., Stoeckert, C. J., & Roos, D. S. (2006). OrthoMCL-DB: querying a comprehensive
multi-species collection of ortholog groups. Nucleic Acids Research, 34(Database issue), 363-368.
Chen, F., Mackey, A. J., Vermunt, J. K., & Roos, D. S. (2007). Assessing performance of orthology
detection strategies applied to eukaryotic genomes. PLoS ONE, 2(4), e383.
Cokus, S., Mizutani, S., & Pellegrini, M. (2007). An improved method for identifying functionally
linked proteins using phylogenetic profiles. BMC Bioinformatics, 8 Suppl 4, S7.
Dandekar, T., Schuster, S., Snel, B., Huynen, M., & Bork, P. (1999). Pathway alignment: application to
the comparative analysis of glycolytic enzymes. The Biochemical Journal, 343, 115-124.
Date, S. V., & Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by genome-wide
analysis of functional linkages. Nature Biotechnology, 21(9), 1055-1062.
Date, S. V., & Marcotte, E. M. (2005). Protein function prediction using the Protein Link EXplorer
(PLEX). Bioinformatics, 21(10), 2558-2559.
Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). Atlas of protein sequence and structure (Vol. 5). Silver
Spring: National Biomedical Research Foundation.
Dehal, P. S., & Boore, J. L. (2006). A phylogenomic gene cluster resource: The Phylogenetically Inferred
Groups (PhIGs) database. BMC Bioinformatics, 7, 201.
Doerks, T., von Mering, C., & Bork, P. (2004). Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucleic Acids Research, 32(21), 6321-6326.
Dohm, J. C., Vingron, M., & Staub, E. (2006). Horizontal gene transfer in aminoacyl-tRNA synthetases
including leucine-specific subtypes. Journal of Molecular Evolution, 63(4), 437-447.
162
Enright, A. J., Kunin, V., & Ouzounis, C. A. (2003). Protein families and TRIBES in genome sequence
space. Nucleic Acids Research, 31(15), 4632-4638.
Enright, A. J., Van Dongen, S., & Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7), 1575-1584.
ExPASy Proteomics Server. (2008). UniProt Knowledgebase: Controlled vocabulary of species. Retrieved
March 13, 2003, from ftp://ftp.expasy.org/databases/uniprot/knowledgebase/docs/speclist.txt
Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., et al. (2006).
Pfam: Clans, Web tools and services. Nucleic Acids Research, 34(Database issue), 247-251.
Fitch, W. M. (2000). Homology a personal view on some of the problems. Trends in Genetics, 16(5),
227-231.
Fleischmann, A., Darsow, M., Degtyarenko, K., Fleischmann, W., Boyce, S., Axelsen, K. B., et al.
(2004). IntEnz, the integrated relational enzyme database. Nucleic Acids Research, 32(Database issue),
434-437.
Flicek, P., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., et al. (2008). Ensembl 2008.
Nucleic Acids Research, 36(Database issue), 707-714.
Glazko, G. V., & Mushegian, A. R. (2004). Detection of evolutionarily stable fragments of cellular
pathways by hierarchical clustering of phyletic patterns. Genome Biology, 5(5), R32.
Heinicke, S., Livstone, M. S., Lu, C., Oughtred, R., Kang, F., Angiuoli, S. V., et al. (2007). The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists. PLoS
ONE, 2(1), e766.
Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22), 10915-10919.
Hulsen, T., de Vlieg, J., & Groenen, P. M. (2006). PhyloPat: Phylogenetic pattern analysis of eukaryotic
genes. BMC Bioinformatics, 7, 398.
Jothi, R., Przytycka, T. M., & Aravind, L. (2007). Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC
Bioinformatics, 8, 173.
Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., et al. (2008). KEGG for linking
genomes to life and the environment. Nucleic Acids Research, 36(Database issue), 480-484.
Kelil, A., Wang, S., Brzezinski, R., & Fleury, A. (2007). CLUSS: Clustering of protein sequences based
on a new similarity measure. BMC Bioinformatics, 8, 286.
Koonin, E. V. (2005). Orthologs, paralogs, and evolutionary genomics. Annual Review of Genetics, 39,
309-338.
Krause, A. (2006). Large scale protein sequence clustering - not solved but solvable. Current Bioinformatics, 1(2), 247-254.
163
Krause, A., Stoye, J., & Vingron, M. (2005). Large scale hierarchical clustering of protein sequences.
BMC Bioinformatics, 6, 15.
Krishnamurthy, N., Brown, D. P., Kirshner, D., & Sjlander, K. (2006). PhyloFacts: An online structural
phylogenomic encyclopedia for protein functional and structural classification. Genome Biology, 7(9),
R83.
Li, H., Coghlan, A., Ruan, J., Coin, L. J., Hrich, J. K., Osmotherly, L., et al. (2006). TreeFam: A
curated database of phylogenetic trees of animal gene families. Nucleic Acids Research, 34(Database
issue), 572-580.
Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic
genomes. Genome Research, 13(9), 2178-2189.
Liberles, D. A., Thoren, A., von Heijne, G., & Eloffson, A. (2002). The use of phylogenetic profiles for
gene predictions. Current Genomics, 3, 131-137.
Liolios, K., Mavromatis, K., Tavernarakis, N., & Kyrpides, N. C. (2008). The genomes on line database
(GOLD) in 2007: Status of genomic and metagenomic projects and their associated metadata. Nucleic
Acids Research, 36(Database issue), 475-479.
Luz, H., & Vingron, M. (2006). Family specific rates of protein evolution. Bioinformatics, 22(10), 11661171.
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., & Eisenberg, D. (1999). A combined
algorithm for genome-wide prediction of protein function. Nature, 402(6757), 83-86.
Marcotte, E. M., Xenarios, I., van Der Bliek, A. M., & Eisenberg, D. (2000). Localizing proteins in the
cell from their phylogenetic profiles. Proceedings of the National Academy of Sciences of the United
States of America, 97(22), 12115-12120.
Meinel, T., Krause, A., Luz, H., Vingron, M., & Staub, E. (2005). The SYSTERS Protein Family Database in 2005. Nucleic Acids Research, 33(Database issue), 226-229.
Mistry, J., Bateman, A., & Finn, R. D. (2007). Predicting active site residue annotations in the Pfam
database. BMC Bioinformatics, 8, 298.
OBrien, K. P., Remm, M., & Sonnhammer, E. L. (2005). Inparanoid: a comprehensive database of
eukaryotic orthologs. Nucleic Acids Research, 33(Database issue), 476-480.
Paccanaro, A., Casbon, J. A., & Saqi, M. A. (2006). Spectral clustering of protein sequences. Nucleic
Acids Research, 34(5), 1571-1580.
Pagel, P., Wong, P., & Frishman, D. (2004). A domain interaction map based on phylogenetic profiling.
Journal of Molecular Biology, 344(5), 1331-1346.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., & Yeates, T. O. (1999). Assigning
protein functions by comparative genome analysis: Protein phylogenetic profiles. Proceedings of the
National Academy of Sciences of the United States of America, 96(8), 4285-4288.
Remm, M., Storm, C. E., & Sonnhammer, E. L. (2001). Automatic clustering of orthologs and in-paralogs
from pairwise species comparisons. Journal of Molecular Biology, 314(5), 1041-1052.
164
Ruan, J., Li, H., Chen, Z., Coghlan, A., Coin, L. J., Guo, Y., et al. (2008). TreeFam: 2008 Update. Nucleic
Acids Research, 36(Database issue), 735-740.
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of
Molecular Biology, 147(1), 195-197.
Snel, B., & Huynen, M. A. (2004). Quantifying modularity in the evolution of biomolecular systems.
Genome Research, 14(3), 391-397.
Snitkin, E. S., Gustafson, A. M., Mellor, J., Wu, J., & DeLisi, C. (2006). Comparative assessment of
performance and genome dependence among phylogenetic profiling methods. BMC Bioinformatics, 7,
420.
Sonnhammer, E. L., & Koonin, E. V. (2002). Orthology, paralogy and proposed classification for paralog
subtypes. Trends in Genetics, 18(12), 619-620.
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., et al. (2003).
The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 4, 41.
Tatusov, R. L., Koonin, E. V., & Lipman, D. J. (1997). A genomic perspective on protein families. Science, 278(5338), 631-637.
The UniProt Consortium. (2008). The universal protein resource (UniProt). Nucleic Acids Research,
36(Database issue), 190-195.
Valencia, A. (2005). Automatic annotation of protein function. Current Opinion in Structural Biology,
15(3), 267-274.
Vastrik, I., DEustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G., Croft, D., et al. (2007). Reactome:
A knowledge base of biologic pathways and processes. Genome Biology, 8(3).
von Mering, C., Jensen, L. J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., et al. (2007). STRING 7-recent developments in the integration and prediction of protein interactions. Nucleic Acids Research,
35(Database issue), 358-362.
Wheeler, D. L., Church, D. M., Edgar, R., Federhen, S., Helmberg, W., Madden, T. L., et al. (2004).
Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Research, 32(Database issue), 35-40.
Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences
with FORCE -A layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8, 396.
Woese, C. R., Olsen, G. J., Ibba, M., & Sll, D. (2000). Aminoacyl-tRNA synthetases, the genetic code,
and the evolutionary process. Microbiology and Molecular Biology Reviews, 64(1), 202-236.
Wolf, Y. I., Aravind, L., Grishin, N. V., & Koonin, E. V. (1999). Evolution of aminoacyl-tRNA synthetases--analysis of unique domain architectures and phylogenetic trees reveals a complex history of
horizontal gene transfer events. Genome Research, 9(8), 689-710.
Wu, J., Hu, Z., & DeLisi, C. (2006). Gene annotation and network inference by phylogenetic profiling.
BMC Bioinformatics, 7, 80.
165
Wu, J., Kasif, S., & DeLisi, C. (2003). Identification of functional links between genes using phylogenetic
profiles. Bioinformatics, 19(12), 1524-1530.
Yang, S., Doolittle, R. F., & Bourne, P. E. (2005). Phylogeny determined by protein domain content.
Proceedings of the National Academy of Sciences of the United States of America, 102(2), 373-378.
Ye, Y., & Godzik, A. (2004). Comparative analysis of protein domain organization. Genome Research,
14(3), 343-353.
Zheng, Y., Roberts, R. J., & Kasif, S. (2002). Genomic functional annotation using co-evolution profiles
of gene clusters. Genome Biology, 3(11).
K ey terms
BLAST: Basic Local Alignment Search Tool. A heuristic algorithm for searching of similar words
or sequences in databases.
Distance Measure: Measure to compare protein sequences by their amino acid composition. Subsumming validations of two character states (amino acids on an identical position) in a similarity measure
leads to the similarity score. The distance is the difference of the relative similarity to 1.
E-Value: Parameter that describes the probability that a protein or nucleotide sequence is not randomly found in a sequence database.
Local Sequence Similarity: Similarity of two sequences is often found only on a local sequence
level by a sequence comparison algorithm (e.g., BLAST). Identical partial subsequences are found in
protein domains, for instance, and induce local sequence similarity.
Multiple Sequence Alignment: Three ore more sequences are displayed in a picture with comparable characters in a column (for proteins: amino acid residues).
Pairwise Sequence Alignment: Two sequences are displayed in two rows with comparable characters, amino acid residues for protein sequences, in columns.
Phylogenetic Profile: Presence/absence indication for a family of genes or proteins across a given
set of organisms. A phylogenetic profile represents a gene or protein family by serving for a taxonomic
overview.
Phylogenetic Profiling: Comparison of two or more phylogenetic profiles. Protein families of functional contexts possess similar phylogenetic profiles.
Similarity Of Sequences: Two protein sequences can be compared in each amino acids position.
Identical residues or similar biophysical behavior of compared amino acids determines sequence similarity. Necessary is an alignment of at least two protein sequences.
Similarity Score: Measure of exchange of each of all twenty amino acids towards each of the remaining nineteen others with organization in a scoring matrix.
166
167
Chapter IX
Computational Methods for the

Prediction of GPCRs Coupling
Selectivity
Nikolaos G. Sgourakis
Rensselaer Polytechnic Institute, USA
Pantelis G. Bagos
University of Central Greece, and University of Athens, Greece
Stavros J. Hamodrakas
University of Athens, Greece
abstract
GPCRs comprise a wide and diverse class of eukaryotic transmembrane proteins with well-established
pharmacological significance. As a consequence of recent genome projects, there is a wealth of information at the sequence level that lacks any functional annotation. These receptors, often quoted as
orphan GPCRs, could potentially lead to novel drug targets. However, typical experiments that aim at
elucidating their function are hampered by the lack of knowledge on their selective coupling partners
at the interior of the cell, the G-proteins. Up-to-date, computational efforts to predict properties of
GPCRs have been focused mainly on the ligand-binding specificity, while the aspect of coupling has
been less studied. Here, we present the main motivations, drawbacks, and results from the application
of bioinformatics techniques to predict the coupling specificity of GPCRs to G-proteins, and discuss the
application of the most successful methods in both experimental works that focus on a single receptor
and large-scale genome annotation studies.
Computational Methods for the Prediction of GPCRs Coupling Selectivity
INTR OD UCTI ON / BACKGR
OUND
G-protein coupled receptors (GPCRs) comprise a very important family of eukaryotic cell-surface membrane proteins. They are characterized by the structural hallmark of seven transmembrane helices, as
exemplified by the crystal structure of rhodopsin (Palczewski et al. 2000), that has been extensively used
as a homology modeling template for many receptor sequences (Nikiforovich et al. 2001; Becker et al.
2003). GPCRs play a pivotal role in signal transduction of eukaryotic cells, acting as the major sensors
at the boundary between a cell and the outside world. Depending on their ligand-binding specificity,
GPCRs can be activated by a broad range of external stimuli, from ions and small molecules to larger
peptides and proteins, including light (Gether 2000). To perform these functions, GPCRs have evolved
to a diversity of sequences that are traditionally classified in six major families, based mainly on shared
homology (Horn et al. 2003). GPCRs have known representatives in most eukaryotic organisms, including yeast and plants, such as the recently discovered Arabidopsis thaliana seven-transmembrane (7TM)
domain receptor GCR1 (Jones and Assmann 2004).
As signified by their name, upon binding to a ligand, GPCRs exert their role through the specific
interaction with a more limited repertoire of intracellular proteins that hydrolyze GTP, namely the
G-proteins (Neer and Clapham 1988). G-proteins are heterotrimeric complexes composed of three
subunits G, G and G. They are classified into four main families, according to the type of their subunit, which also possesses Ras-like GTPase activity (Benjamin et al. 1995). These include Gs and
Gi/o, which stimulate and inhibit adenylate cyclase, respectively (Johnston and Watts 2003), Gq/11, that
activates phospholipase C (Exton 1993) and the less characterized G12/13 family that activates the Na+/H+
exchange pathway (Kurose 2003). At least 16 different subtypes of G subunits have been identified
and classified in these four families (Downes and Gautam 1999; Kristiansen 2004). Interaction of the
G-protein trimer with the activated receptor triggers the exchange of the bound GDP with GTP, and
subsequently the dissociation of the complex to G and G moieties, that activate downstream effector
molecules. Hydrolysis of GTP to GDP by the subunit renders the complex to its original, inactive state
(Neer 1995). As a result, depending on the selectivity of the GPCR - G-protein interaction, a specific
downstream pathway may be activated. Despite extensive experimental and computational studies,
the structural basis of this specificity is not well characterized, while the mechanisms that determine
the function of the activated GPCR/G-protein complex are yet to be uncovered (Muramatsu and Suwa
2006). Furthermore, the diversity of GPCR-G-protein interactions is enriched by several receptors that
may alternatively interact with more than one family of G-proteins, known as promiscuous GPCRs.
For instance, the human thyrotropin receptor can couple to all four G-protein families (Laugwitz et
al. 1996). In general, promiscuity seems to be a rule rather than an exception for interactions between
GPCRs and G-proteins (Wess 1998; Oliveira et al. 1999; Horn et al. 2000). Several lines of evidence
indicate the importance of the GPCR intracellular regions, as well as the intracellular boundaries of
the transmembrane helices (Gether 2000). It is also established that the regions of interaction on the
G-protein are mainly the N-terminus of the G and the N- and C-termini of G subunit. However, up
to date, these findings have not been incorporated to a high-resolution, systematic model of GPCR
G-protein interactions, while the nature of the underlying mechanism is believed to be specific to the
interacting partners (Wess 1998).
Due to their function as input nodes in the signaling pathways of eukaryotic cells, GPCRs play a
very important role in health and disease (Muller 2000). GPCRs are involved in a variety of pathological conditions including cystic fibrosis, cancer and HIV-mediated infection of host cells. The ability
168
to rationally modulate the cells signaling pathways through these target molecules has introduced a
new era in pharmacology. In fact, the significance of G-Protein coupled receptors (GPCRs) as major
pharmacological targets can be compared with no other single family of proteins. This is further illustrated by the fact that more than 50% of all known drugs act on GPCRs, a family that is represented by
less than 1% of open reading frames (ORFs) identified in the human genome. Taking into account the
growing number of novel GPCRs, that could be used as potential drug candidates (Chalmers and Behan
2002; Dechering 2005), the variety of signaling pathways that start or can be regulated by GPCRs and
the rate of expansion of biological sequence databases, the wealth of pharmacological targets available
within this versatile family of proteins is evident.
Furthermore, recent genomics initiatives have resulted in a plethora of newly discovered GPCR
sequences. These receptors are often characterized as orphan GPCRs (oGPCRS), in the sense that they
lack any annotation regarding their function. Most importantly, the endogenous ligands of oGPCRs
are yet to be identified. The lack of knowledge on the ligand repertoire that can activate these receptors
limits their utility as drug candidates for traditional ligand-based assays. One approach to overcome this
limitation involves the use of constitutively active GPCRs (reviewed in (Chalmers and Behan 2002)),
that stimulate cellular signaling in the absence of a bound ligand. Constitutively Active Receptor Technology (CART) is based on the ability to genetically engineer receptors by modifying the conformation of a latch region of the core seven-transmembrane domain. However, high-throughput screening
involves detection of a cellular response, such as cAMP production in the case of receptors that couple
to Gi/o or Gs and calcium flux or inositol phosphate production for Gq/11-coupled receptors. Therefore,
the knowledge of the interacting G-protein is essential in designing these assays. Prediction methods,
based on the available genomic information, offer an attractive alternative in deciphering the coupling
specificity experimentally, a procedure that consumes time and laboratory resources.
Although several computational methods have been developed to provide information on the ligandbinding properties of GPCRs (reviewed in (Lu et al. 2006)), as well as on their phylogeny (Papasaikas et
al. 2003; Shigeta et al. 2003; Papasaikas et al. 2004), the aspect of selective coupling to the G-proteins
has been less studied. In particular, the application of standard bioinformatics tools to develop prediction methods of the immediate interacting partners of GPCRs was hampered for many years by the lack
of a systematic database of interactions, and more importantly by the lack of experimental data on the
mechanism of selective binding, activation and signal transduction through their G-protein partners.
Recently, the range of G-proteins and their interaction with GPCRs has been extensively annotated in
the G-protein Database (Elefsinioti et al. 2004). Such knowledge is essential in designing experiments
for screening orphan receptors against libraries of potential ligands by providing the missing link between GPCRs and cellular responses. In this review we focus to briefly describe the main motivations,
drawbacks and results from the application of computational methods for the prediction of GPCRs
coupling selectivity.
C OMPUTATI ONA L METH ODS F OR C OUPLING PREDICTI
ON
Several methods have been developed for the prediction of GPCRs coupling selectivity to G-proteins
and their properties are summarized in Table 1. Their training philosophy, performance and caveats are
discussed herein in further detail. An attractive feature of some of these methods is their availability
online through web-based servers, for non-commercial users. Such methods can be widely used as
169
online prediction tools by both experimentalists, targeted at the characterization and study of specific
receptors or bioinformaticians, for large scale genome annotation projects and the development of new,
more efficient computational tools.
Pattern and N ave B ayes Models-B ased Methods

The increasing biochemical evidence that GPCRG-protein recognition takes place within the intracellular loops of the receptor (Wess 1998; Wong 2003) motivated the application of pattern discovery
techniques aimed at identifying the functionally important segments of residues for the interaction with
G-proteins, and using these patterns to classify GPCRs according to their coupling specificity (Mller
et al. 2001). This was the first extensive pattern discovery study applied for the prediction of GPCRs
coupling selectivity. The authors used a non-redundant dataset of 103 human GPCRs with experimentally determined coupling selectivity (Alexander 2000). However, this approach was limited by the low
degree of accuracy of regular expressions as classifiers of biological sequences, as well as the need for
prediction of the transmembrane segments of GPCRs in a preceding step, as discussed later. In specific,
the use of regular expressions limits the extent of sequence variability that can be incorporated to the
prediction model. Furthermore, this choice of model leads to a redundancy of patterns that describe the
same coupling group, as exemplified in figure 1 of reference (Sgourakis et al. 2005). Thus, the common
experimental observation of different coupling to G-proteins for receptors from the same subfamily
cannot be reproduced by such a prediction scheme. This also adheres to the low sensitivity of the method
(30-40% as reported by the authors). In general, it appears that the coupling of GPCRs to G-proteins is a
late event in molecular evolution that has been achieved by the mutation of key residues of the receptors
intracellular domains to fine-tune interaction with certain families of G-proteins, while excluding others
(Wess 1998; Sgourakis et al. 2005). This is the basis of the coupling prediction problem, and the reason
that traditional sequence alignment methods produce predictions with little accuracy that cannot be
generalized to the entire repertoire of receptors (Horn et al. 2000).
More advanced statistical recursion methods such as Nave Bayes models (Cao et al. 2003) were
also applied for the same task. The authors used a training dataset of 91 receptors (Alexander 2001).
This approach introduced a new method in GPCR coupling prediction that however did not lead to a
significant increase in accuracy (72%, validated in an independent dataset of 55 GPCRs). Furthermore,
their method seemed to over-predict interactions for most GPCRs in their dataset, by assigning multiple interacting partners (i.e. the number of promiscuous receptors was more that the experimentally
observed). The authors justified this observation by arguing that promiscuous coupling is a common
attribute of GPCRs (Oliveira et al. 1999), although not always experimentally determined. However,
this raises issues regarding the specificity of their model.
Profile Hidden Markov ModelsBased Methods

Profile Hidden Markov Models (pHMMs) (reviewed in (Eddy 1998)) have been widely used as a statistical recursion tool for the classification of biological sequences (Krogh et al. 1994). The first application
of Hidden Markov Models as predictors of GPCRs coupling specificity was carried out by Goldsteins
group (Qian et al. 2003). This was also the first information-intensive computational approach to classify
GPCRs according to their coupling specificity. The authors used the intracellular domains of GPCRs with
known coupling specificity to train tree-based Hidden Markov Models (Mitchison and Durbin 1995) that
170
94%
93%
103 GPCRs
91 GPCRs
102 GPCRs
132 GPCRs
282
GPCRs**
188 GPCRs
282
GPCRs**
153 GPCRs
Vilo et al.
Cao et al.
Sreekumar
et al.
GRIFFIN
PREDCOUPLE
PREDCOUPLE2
Guo et al.
Ono et al.
yes
yes
yes
no
Gi/o, Gs,
Gq/11
Gi/o, Gs,
Gq/11,
G12/13
Gi/o, Gs,
Gq/11
Gi/o, Gs,
Gq/11
no
Gi/o, Gs,
Gq/11
no
Gi/o, Gs,
Gq/11
no
no
Gi/o, Gs,
Gq/11
Gi/o, Gs,
Gq/11
no
Promiscuous
Gi/o, Gs,
Gq/11
G-protein
classes
None required
Biological
functions
ACC and
SVMs
pHMMS,
NLP and
Decission
Trees
None required
None required
Refined
pHMMs
/ QFAST
algorithm
Refined
pHMMs and
ANNs
Ligand Properties
Transmembrane
topology
knowledgerestricted
pHMMs
SVMs and
pHMMs
Transmembrane
topology
Transmembrane
topology
Transmembrane
topology
Additional inputs
Nave Bayes
Models
Regular
expressions
tree-based
HMMs
Method
** Several non-human receptors were included
* A significant drop in accuracy is expected for this method for full-length GPCR sequences
92%
90%
85%
>95%*
72%*
N/R
83%*
95 GPCRs
Qian et al.
Accuracy
Trainning
set
Tool
not available
program
download
web server
web server
web server
personal
communication
personal
communication
web server
personal
communication
Availablity
http://bioinformatics.
biol.uoa.gr
http://bioinformatics.
biol.uoa.gr
http://griffin.cbrc.jp/
sreekuk@wyeth.com
jack.cao@astrazeneca.
com
http://ep.ebi.ac.uk/
GPCR/
richard.goldstein@
nimr.mrc.ac.uk
URL / email
Table 1. Prediction methods of GPCRs coupling selectivity. The training dataset, applied technique
and performance of each method are presented, in addition to its online availability. Out of 9 published
methods, 4 can predict efficiently promiscuous coupling, 3 are available through an online server for
non-commercial users and only 1 can predict coupling to G-proteins of the G12/13 family. N/R: Not reported.
171
could act as classifiers when aligned against a GPCR sequence of unknown selectivity. Their method
was found to perform quite well, given the small size of the training dataset (95 receptor sequences). In
a following study, Sreekumar and coworkers used all the intracellular domains of GPCRs with known
coupling selectivity concatenated in a single sequence to train their models (Sreekumar et al. 2004).
For this reason, these profiles were described as knowledge-restricted HMMs. This strategy leaded to
a high degree of correct classification rate (~95%), when tested against receptors with known location
of the transmembrane segments. However their approach, and also the previously mentioned methods,
was limited by the requirement of known transmembrane topology of the receptor under query. Despite
the high-efficiency in predicting the coupling specificity of GPCRs with known transmembrane topology, this limitation rendered these methods practically inapplicable in the case of oGPCRs, where no
information other than the sequence is available. At this point we should notice the relative low accuracy
of transmembrane segment prediction algorithms, despite the development of methods targeted specifically at GPCRs (Rayan et al. 2000). Even the most accurate methods predict the correct topology with
a rate that does not exceed 75% on a residue level (Viklund and Elofsson 2004).
In a previous bioinformatics work (Sgourakis et al. 2005), we have addressed the problem of GPCR
coupling specificity despite the lack of transmembrane topology information with a high degree of efficiency. This was made possible through the implementation of a selection process to generate refined
Hidden Markov Models of high discriminative power that model the intracellular domains of receptors
that couple to the three main families of G-proteins. Results from individual profiles that corresponded
to different intracellular domains were combined by the QFAST algorithm (Bailey and Gribskov 1998),
to produce the final score for each coupling group, while ROC-curve analysis was used to optimize the
cutoff that is applied to produce a final prediction. This strategy allowed for the first time the prediction
of interaction for promiscuous receptors, although with a rather small accuracy that did not exceed 30%.
Another novelty of our approach was the inclusion of the membrane-proximal segments of the transmembrane domains extending from the intracellular face of the membrane to train the highly refined,
discriminative models. This was motivated by the fact that the location of the membrane boundaries
is not precisely determined, even in high-resolution crystal structures. Furthermore, according to the
general model of receptor activation, the intracellular face of a receptor opens up during the binding of
the ligand, rendering residues at the interior of the receptor towards the membrane accessible to interaction with the G-protein (Gether 2000). In addition, this method offers the advantage of high specificity
against non-GPCR sequences through the implementation of GPCR-specific profiles from the PFAM
database (Sonnhammer et al. 1998; Bateman et al. 2004). Therefore, for the first time we provided a
high-throughput genome annotation pipeline that could be used independently of transmembrane prediction schemes to guide experiments that aim to decipher the role of oGPCRs.
Methods that C ombine a Variety of T echniques

In a following study (Sgourakis et al. 2005), we expanded the range of GPCRs recognized by our
models, by including the less characterized group of G12/13-coupled receptors in addition to the Gi/o,
Gq/11 and Gs families. A very important aspect of GPCR signaling (as reviewed in (Riobo and Manning 2005)), G12/13 coupling was ignored by previous methods and, to the best of our knowledge, by all
methods published to date due to the lack of experimentally determined interaction data (Moller et al.
2001; Cao et al. 2003; Yabuki et al. 2005; Guo et al. 2006). To overcome this limitation, we performed
172
an extensive literature search and were able to construct an extensive, non-redundant dataset of 188
GPCR sequences, annotated according to their coupling specificity that includes 35 sequences of receptors that couple to G12/13. A very important feature of this method is the incorporation of sequences
belonging to promiscuous receptors (65 in total) that were also retrieved from the literature. By using
an intuitive method to train our models, and the aforementioned optimization procedure, we were able
to construct a refined library of pHMMs that could efficiently model promiscuous coupling (with an
accuracy of 85%, exceeding all published methods), as well as interactions with G12/13 proteins (with
an accuracy of 95%), as evaluated in a five-fold cross-validation procedure. In addition, we introduced
an Artificial Neural Network algorithm to combine the results of independent models in producing the
final prediction of the method. This methodology was later adapted for the identification of promoters
in large-scale genomic data from prokaryotic genomes (Mann et al. 2007). Furthermore, the specificity of this improved method against non-GPCR sequences was enhanced by the use of the results from
querying an unknown sequence against all refined models in our database, combined with the QFAST
algorithm. This addition to the already implemented highly specific GPRC profiles from the PFAM
database (Bateman et al. 2004), was proven to a very efficient filter in screening non-GPCR sequences,
by effectively identifying all non-GPCRs in two independent datasets consisting of 1113 globular and
1356 transmembrane non-GPCR sequences (adopted from (Papasaikas et al. 2003)).
Support Vector Machines have also been applied to the same task, in numerous studies (Yabuki
et al. 2005; Guo et al. 2006). In the first published method, Yabuki and coworkers used a hierarchical
approach to train a method that besides the receptors sequence also uses properties of the ligand as
inputs, to produce a prediction on its coupling sensitivity. They rationalize the use of ligand information
by considering the activated receptor as an entity that is composed of the ligand, the GPCR and the Gprotein. Indeed, several lines of evidence indicate that the same receptor can signal through a variety
of pathways, depending on its bound agonist, a phenomenon known as agonist trafficking (Hermans
2003). Their method, that is available for online use through a web-based server, uses both profile Hidden Markov Models and Support Vector Machines (SVMs) (reviewed in (Yang 2004)). The pHMMs
are used in a first step to separate class A, that includes the majority of known receptors, from all the
others. Thus, pHMMs generated for the opsins and olfactory receptors subfamilies and the class B,
class C, frizzled and smoothened, families are used to classify GPCRs not belonging to class A. The
authors claim that the coupling specificity for these receptors is ubiquitously determined by their family type, although there are several reported examples in the literature where this is not the case (Wess
1998). In a next step, a series of SVM classifiers is used to predict coupling of the remaining GPCRs
into the three main coupling groups. Several parameters are used as inputs to the SVM at this step,
including scores from pHMMs that were trained from GPCRs that bind peptide or amide ligands. The
final, integrated system produces predictions with an accuracy of >85%, as evaluated in 4-fold cross
validation procedure. However, the observables used as inputs to the method are derived from the sequence of the intracellular loops, and also from the aminoacid composition of very specific positions
with respect to the sequence of bovine rhodopsin, that is used as an alignment template. Therefore, the
entire prediction is expected to be very sensitive to the performance of the alignment algorithm, which
will produce the position of the transmembrane segments. Given the variability in the output of such
algorithms that is not guaranteed to have seven transmembrane segments, we can assess the amount
of error that resides in this strategy: Qian and coworkers repot that of a total of 470 GPCR sequences
submitted to the TMHMM server (Krogh et al. 2001) for prediction of transmembrane topology only
173
417 were predicted as having seven transmembrane segments (Qian et al. 2003). However, the method
of Yabuki and coworkers predicts seven transmembrane segments even for sequences not belonging to
GPCRs. We can thus conclude that this method is not suitable for application in large-scale genomic
data for high-throughput genome annotation purposes.
Recently, a second method that uses SVMs was published (Guo et al. 2006). The main motivation
of this method is to avoid the use of membrane topology information both at the prediction and training steps. It uses the Autocross Covariance Transform method (Wold et al. 1993) to generate the input
vectors for the SVM classifier, based on the physicochemical properties of the entire GPCR sequence
as described by the projections on the first three eigenvectors of a 29-dimensional space that includes
features such as bulk, hydrophobicity and electrostatics (Hellberg et al. 1987). The authors adopted a
dataset from a previously published work from our group (Sgourakis et al. 2005) to train their method.
In addition, to enhance the specificity of their method against non-GPCR sequences, they used a dataset
of 1090 non-GPCR transmembrane proteins, adopted from an earlier work on GPCR classification (Guo
et al. 2006). Given the low degree of sequence similarity between the intracellular domains of different
receptors, and the heuristic nature of most alignment algorithms, the advantage of this strategy is that
it does not rely on the accuracy of sequence profiles. Furthermore, the construction of GPCR-specific
models results in a high degree of accuracy (>95%) in discriminating GPCRs from non-GPCR sequences.
Thus, this method can also be used as a stand-alone tool. The authors report an accuracy of 91.3% for
non-promiscuous GPCRs, as evaluated in a jackknife test with a non-redundant training dataset, and
an accuracy of 80% for promiscuous receptors. In general, this approach yields reliable predictions
from GPCR sequence alone, for all GPCR coupling types, including promiscuous receptors, without
the requirement of any transmembrane topology information or filtering of non-GPCR sequences in a
preceding step. However, it does not provide predictions for G12/13coupled receptors, since, as the authors
claim, a dataset could not be established due to the deficiency of data. On the contrary, as shown in our
preceding study (Sgourakis et al. 2005), this problem can be solved through a careful and extensive
retrieval set of data from the literature. In fact, we freely provide the required dataset in our web-based
server: (http://bioinformatics.biol.uoa.gr/PRED-COUPLE2/training).
T ext Mining Methods

In another study (Ono and Hishigaki 2006), the information included as input for the prediction is mined
from the literature using Natural Language Processing (NLP) techniques. In this approach, text mining
was used to extract features of biological functions as keywords from various databases. The authors
claim a correct classification rate of 92.2%, when tested against a dataset of 152 GPCRs with known
coupling specificity. However, this method is also depended on the availability of the known biological
functions used as input to the predictor. In case of a novel sequenced orphan GPCR, such knowledge is
missing and thus the method could not be benefited by the inclusion of the biological function. Following
a similar rationale, the method cannot be applied for searching whole genomes in order to characterize
orphan GPCRs. When the biological functions are not included in the prediction, the methods accuracy
drops significantly below 90%, and thus is comparable to earlier methods.
174
INSIGHTS
FR OM THE APP LICATI ON OF C OMPUTATI ONA L T OOLS
The utility of these methods was confirmed by their use in combination with experimental data to decipher the coupling specificity of important receptors. In a recent study that focused on the differential
activation of the Cockroach species Periplaneta Americana Adipokinetic Hormone Receptor (AKHR),
the effects of two different types of hormones were studied (Wicher et al. 2006). In order to have a starting point for characterizing the pathways activated by this receptor, the authors used both GRIFFIN
(Yabuki et al. 2005) and PRED-COUPLE (Sgourakis et al. 2005) to obtain a prediction of the coupling
specificity of AKHRs. Both tools predicted coupling with Gs, however our method also predicted a
further coupling specificity for Gq. The experiments carried out by Wicher and coworkers confirmed
coupling with Gq in addition to Gs, through the depletion of ion currents, as a result of Gq-mediated
inhibition of K+ channels. Furthermore, they were able to detect differences in the coupling specificity
of the receptor, based on the type of bound hormone: both hormones induce coupling to Gq with the
same efficiency, while AKH I has a higher potential to induce coupling to Gs. This example indicates the
high accuracy level of current prediction tools, since AKHR was not included in the training datasets
of either GRIFFIN or PRED-COUPLE.
Computational prediction tools have also been very useful in characterizing the function of light
receptors in photoreceptive retinal ganglion cells (pRGCs). These receptors belong to the Melanopsin
gene family Opn4 and comprise alternate light sensors that are responsible for the perception of environmental brightness (Peirson and Foster 2006). Their function is independent of the typical Vitamin
Abased photopigments, such as rhodopsin. Melanopsins are believed to resemble intervertebrate
photopigments, however, the signaling pathways in vertebrate cells are not known. Peirson and Foster applied the prediction methods developed by our group to obtain an estimate of the variability in
coupling specificity within the Melanopsin gene family (figure 2 in (Peirson and Foster 2006)). First,
they confirm the validity of our method through a variety of correct predictions. Furthermore, for the
melanopsins, this analysis predicts an extensive range of interactions with the G-protein families. Although not proven to be true in vivo, this is a striking difference of the Melanopsin receptors from the
visual pigment opsins that is supported by several lines of experimental evidence (Newman et al. 2003;
Melyan et al. 2005). The sequence variability in the third intracellular loop of the Melanopsin GPCRs
was proposed to be the basis of this variability.
In a recently published study (Muramatsu and Suwa 2006), Muramatsu and coworkers provided
insight into the structural basis for GPCR-G-protein coupling selectivity, focusing on class A receptors
from the GPCRDB 7.0 database (Horn et al. 2003), through the identification of key residues located
mainly in the intracellular loops of the receptors. To perform this task, they used the solved crystal
structure of rhodopsin (Palczewski et al. 2000) as a reference frame for the mapping of multiple sequence
alignments of selected loop sequences. A Hidden Markov Modelbased alignment procedure was used,
and states in the model were assigned according to the position of transmembrane helices of rhodopsin.
Based on the statistics of the occurrence of different residues at various positions in the alignment, the
authors were able to infer a set of rules that correlate clusters of residues with the coupling specificity of GPCRs. Furthermore, they were able to identify mutation data from the literature in support of
their observations. This study confirms the importance of the membrane-proximal boundaries of the
transmembrane helices in establishing the coupling selectivity of GPCRs. Another interesting observation was that residues responsible for coupling to Gi/o and Gq/11 can be also found at the transmembrane
and extracellular domains of GPCRs. Mutation studies on selected receptors corroborate this finding.
175
Allosteric changes transmitted across the plasma membrane to the G-protein binding interface upon
ligand binding could account for these observations. Also, this analysis suggests, that a few residues
along the sequences of receptors that couple to Gq/11 are responsible for their selectivity, a fact that
perhaps explains the lower accuracy of some prediction methods for this class of receptors (Cao et al.
2003; Guo et al. 2006). However, these results are very sensitive to the choice of the alignment template
of rhodopsin, a structure that may not cover the entire spectrum of class-A receptors. In the absence
of a solved structure of the activated complex of GPCR with bound G-protein, more mutation data are
needed to confirm these findings.
F UT URE TRENDS
/ C ONC LUSI ON
The field of GPCR computational biology has shown many advances in the post-genomic era; however
the mechanism of selective binding to G-proteins is yet to be uncovered, mainly due to the lack of
high-resolution structural data. The structure of complexes of GPCRs with different G-proteins and in
different activation states would undoubtedly provide insight into the series of molecular events required
for G-protein activation, and explain the basis of the coupling selectivity of GPCRs, at the atomic level.
Prediction methods could provide important information in designing experiments, and facilitate the
structure generation process by suggesting the active interface between GPCRs and G-proteins. Immediate advances in the field of computational prediction of the coupling selectivity of GPCRs could
result from the application of consensus methods that utilize a variety of machine learning techniques,
a strategy that has been proven successful for a variety of problems in bioinformatics (Cuff et al. 1998;
Arai et al. 2004; Bagos et al. 2005; Tjalsma and van Dijl 2005; Hamodrakas et al. 2007). Finally, the
incorporation of high-throughput coupling data from biochemical experiments would iteratively improve
the efficiency of prediction algorithms and provide additional clues to the structural basis of GPCRs
coupling specificity to G-proteins.
REFERENCES
Alexander (2000). Receptor & ion channel nomenclature supplement. TiPS.
Alexander (2001). Nomenclature supplement. TiPS.
Arai, M., Mitsuke, H., Ikeda, M., Xia, J. X., Kikuchi, T., Satake, M., & Shimizu, T. (2004). ConPred
II: A consensus prediction method for obtaining transmembrane topology models with high reliability.
Nucleic Acids Res, 32(Web Server issue), W390-3.
Bagos, P. Liakopoulos, G., T. D., & Hamodrakas, S. J. (2005). Evaluation of methods for predicting the
topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics, 6, 7.
Bailey, T. L. & Gribskov, M. (1998). Combining evidence using p-values: Application to sequence homology searches. Bioinformatics, 14(1), 48-54.
176
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall,
M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., & Eddy, S. R. (2004). The Pfam protein
families database. Nucleic Acids Res, 32(Database issue), D138-41.
Becker, O. M., Shacham, S., Marantz, Y., & Noiman, S. (2003). Modeling the 3-D structure of GPCRs:
Advances and application to drug discovery. Curr Opin Drug Discov Devel, 6(3), 353-61.
Benjamin, D. R., Markby, D. W., Bourne, H. R., & Kuntz, I. D. (1995). Solution structure of the GTPase
activating domain of alpha s. J Mol Biol, 254(4), 681-91.
Cao, J., Panetta, R., Yue, S., Steyaert, A., Young-Bellido, M., & Ahmad, S. (2003). A naive Bayes model
to predict coupling between seven transmembrane domain receptors and G-proteins. Bioinformatics,
19(2), 234-40.
Chalmers, D. T. & Behan, D. P. (2002). The use of constitutively active GPCRs in drug discovery and
functional genomics. Nat Rev Drug Discov, 1(8), 599-608.
Cuff, J. Clamp, A., M. E., Siddiqui, A. S., Finlay, M., & Barton, G. J. (1998). JPred: A consensus secondary structure prediction server. Bioinformatics, 14(10), 892-3.
Dechering, K. J. (2005). The transcriptomes drugable frequenters. Drug Discov Today, 10(12), 85764.
Downes, G. B., & Gautam, N. (1999). The G protein subunit gene families. Genomics, 62(3), 544-52.
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-63.
Elefsinioti, A. L., Bagos, P. G., Spyropoulos, I. C., & Hamodrakas, S. J. (2004). A database for G proteins
and their interaction with GPCRs. BMC Bioinformatics, 5, 208.
Exton, J. H. (1993). Role of G proteins in activation of phosphoinositide phospholipase C. Adv Second
Messenger Phosphoprotein Res, 28, 65-72.
Gether, U. (2000). Uncovering molecular mechanisms involved in activation of G protein-coupled receptors. Endocr Rev, 21(1), 90-113.
Guo, Y., Li, M., Lu, M., Wen, Z., & Huang, Z. (2006). Predicting G-protein coupled receptors-G-protein
coupling specificity based on autocross-covariance transform. Proteins, 65(1), 55-60.
Guo, Y. Z., Li, M., Lu, M., Wen, Z., Wang, K., Li, G., & Wu, J. (2006). Classifying G protein-coupled
receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform.
Amino Acids, 30(4), 397-402.
Hamodrakas, S. J., Liappa, C., & Iconomidou, V. A. (2007). Consensus prediction of amyloidogenic
determinants in amyloid fibril-forming proteins. Int J Biol Macromol, 41(3), 295-300.
Hellberg, S., Sjostrom, M., Skagerberg, B., & Wold, S. (1987). Peptide quantitative structure-activity
relationships, a multivariate approach. J Med Chem, 30(7), 1126-35.
Hermans, E. (2003). Biochemical and pharmacological control of the multiplicity of coupling at G-protein-coupled receptors. Pharmacol Ther, 99(1), 25-44.
177
Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., & Vriend, G. (2003). GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res, 31(1), 294-7.
Horn, F., van der Wenden, E. M., Oliveira, L., AP, I. J. & Vriend, G. (2000). Receptors coupling to G
proteins: Is there a signal behind the sequence? Proteins, 41(4), 448-59.
Johnston, C. A. & Watts, V. J. (2003). Sensitization of adenylate cyclase: A general mechanism of neuroadaptation to persistent activation of Galpha(i/o)-coupled receptors? Life Sci, 73(23), 2913-25.
Jones, A. M. & Assmann, S. M. (2004). Plants: The latest model system for G-protein research. EMBO
Rep, 5(6), 572-8.
Kristiansen, K. (2004). Molecular mechanisms of ligand binding, signaling, and regulation within the
superfamily of G-protein-coupled receptors: Molecular modeling and mutagenesis approaches to receptor structure and function. Pharmacol Ther, 103(1), 21-80.
Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler, D. (1994). Hidden Markov models in
computational biology. Applications to protein modeling. J Mol Biol, 235(5), 1501-31.
Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. (2001). Predicting transmembrane protein
topology with a hidden Markov model: Application to complete genomes. J Mol Biol, 305(3), 567-80.
Kurose, H. (2003). Galpha12 and Galpha13 as key regulatory mediator in signal transduction. Life Sci,
74(2-3), 155-61.
Laugwitz, K. L., Allgeier, A., Offermanns, S., Spicher, K., Van Sande, J., Dumont, J. E. & Schultz, G.
(1996). The human thyrotropin receptor: A heptahelical receptor capable of stimulating members of all
four G protein families. Proc Natl Acad Sci USA, 93(1), 116-20.
Lu, F., Li, J. & Jiang, Z. (2006). Computational identification and analysis of G protein-coupled receptor
targets. Drug Development Research, 67, 771-780.
Mann, S., Li, J., & Chen, Y. P. (2007). A pHMM-ANN based discriminative approach to promoter
identification in prokaryote genomic contexts. Nucleic Acids Res, 35(2), e12.
Melyan, Z., Tarttelin, E. E., Bellingham, J., Lucas, R. J., & Hankins, M. W. (2005). Addition of human
melanopsin renders mammalian cells photoresponsive. Nature, 433(7027), 741-5.
Mitchison, G. & Durbin, R. (1995). Tree-based maximal likelihood substitution matrices and hidden
Markov models. Journal of Molecular Evolution, 41, 1139-1151.
Mller, S., Vilo, J. & Croning, M. D. (2001). Prediction of the coupling specificity of G protein coupled
receptors to their G proteins. Bioinformatics, 17(Suppl 1), S174-81.
Muller, G. (2000). Towards 3D structures of G protein-coupled receptors: A multidisciplinary approach.
Curr Med Chem, 7(9), 861-88.
Muramatsu, T. & Suwa, M. (2006). Statistical analysis and prediction of functional residues effective
for GPCR-G-protein coupling selectivity. Protein Eng Des Sel, 19(6), 277-83.
Neer, E. J. (1995). Heterotrimeric G proteins: Organizers of transmembrane signals. Cell, 80(2), 24957.
178
Neer, E. J. & Clapham, D. E. (1988). Roles of G protein subunits in transmembrane signalling. Nature,
333(6169), 129-34.
Newman, L. A., Walker, M. T., Brown, R. L., Cronin, T. W., & Robinson, P. R. (2003). Melanopsin
forms a functional short-wavelength photopigment. Biochemistry, 42(44), 12734-8.
Nikiforovich, G. V., Galaktionov, S., Balodis, J., & Marshall, G. R. (2001). Novel approach to computer
modeling of seven-helical transmembrane proteins: Current progress in the test case of bacteriorhodopsin. Acta Biochim Pol, 48(1), 53-64.
Oliveira, L., Paiva A. C., & Vriend, G. (1999). A low resolution model for the interaction of G proteins
with G protein-coupled receptors. Protein Eng, 12(12), 1087-95.
Ono, T., & Hishigaki, H. (2006). Prediction of GPCR-G protein coupling specificity using features of
sequences and biological functions. Genomics Proteomics Bioinformatics, 4(4), 238-244.
Palczewski, K., Kumasaka, T., Hori, T., Behnke, C. A., Motoshima, H., Fox, B. A., Le, I., Trong, Teller,
D. C., Okada,T., Stenkamp, R. E., Yamamoto, M., & Miyano, M. (2000). Crystal structure of rhodopsin:
A G protein-coupled receptor. Science 289(5480), 739-45.
Papasaikas, P. K., Bagos, P. G., Litou, Z. I., & Hamodrakas, S. J. (2003). A novel method for GPCR
recognition and family classification from sequence alone using signatures derived from profile hidden
Markov models. SAR QSAR Environ Res, 14(5-6), 413-20.
Papasaikas, P. K., Bagos, P. G., Litou, Z. I., Promponas, V. J., & Hamodrakas, S. J. (2004). PRED-GPCR:
GPCR recognition and family classification server. Nucleic Acids Res, 32(Web Server issue), W380-2.
Peirson, S. & Foster, R. G. (2006). Melanopsin: Another way of signaling light. Neuron, 49(3), 331-9.
Qian, B., Soyer, O. S., Neubig, R. R., & Goldstein, R. A. (2003). Depicting a proteins two faces: GPCR
classification by phylogenetic tree-based HMMs. FEBS Lett, 554(1-2), 95-9.
Rayan, A., Siew, N., Cherno-Schwartz, S., Matzner, Y., Bautsch, W. & Goldblum, A. (2000). A novel
computational method for predicting the transmembrane structure of G-protein coupled receptors: application to human C5aR and C3aR. Receptors Channels 7(2), 121-37.
Riobo, N. A., & Manning, D. R., (2005). Receptors coupled to heterotrimeric G proteins of the G12
family. Trends Pharmacol Sci, 26(3), 146-54.
Sgourakis, N. G., Bagos, P. G., & Hamodrakas, S. J. (2005). Prediction of the coupling specificity of
GPCRs to four families of G-proteins using hidden Markov models and artificial neural networks.
Sgourakis, N. G., Bagos, P. G., Papasaikas, P. K., & Hamodrakas, S. J. (2005). A method for the prediction of GPCRs coupling specificity to G-proteins using refined profile Hidden Markov Models. BMC
Bioinformatics, 6,104.
Shigeta, R., Cline, M., Liu, G., & Siani-Rose, M. A. (2003). GPCR-GRAPA-LIB--A refined library of
hidden Markov Models for annotating GPCRs. Bioinformatics, 19(5), 667-8.
179
Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A., & Durbin, R. (1998). Pfam, multiple sequence
alignments and HMM-profiles of protein domains. Nucleic Acids Res, 26(1), 320-2.
Sreekumar, K. R., Huang, Y., Pausch, M. H., & Gulukota, K. (2004). Predicting GPCR-G-protein coupling using hidden Markov models. Bioinformatics, 20(18), 3490-9.
Tjalsma, H. & van Dijl, J. M. (2005). Proteomics-based consensus prediction of protein retention in a
bacterial membrane. Proteomics, 5(17), 4472-82.
Viklund, H. & Elofsson, A. (2004). Best alpha-helical transmembrane protein topology predictions are
achieved using hidden Markov models and evolutionary information. Protein Sci, 13(7), 1908-17.
Wess, J. (1998). Molecular basis of receptor/G-protein-coupling selectivity. Pharmacol Ther, 80(3),
231-64.
Wicher, D., Agricola, H. J., Sohler, S., Gundel, M., Heinemann, S. H., Wollweber, L., Stengl, M., & Derst,
C. (2006). Differential receptor activation by cockroach adipokinetic hormones produces differential
effects on ion currents, neuronal activity, and locomotion. J Neurophysiol, 95(4), 2314-25.
Wold, S., Jonsson, J., Sjostrom, M., Sandberg, M., & Rannar, S. (1993). DNA and peptide sequences and
chemical processes multivariately modeled by principal component analysis and partial least-squares
projections to latent structures. Analytica Chimica Acta, 277(2), 239-253.
Wong, S. K. (2003). G protein selectivity is regulated by multiple intracellular regions of GPCRs. Neurosignals, 12(1), 1-12.
Yabuki, Y., Muramatsu, T., Hirokawa, T., Mukai, H., & Suwa, M. (2005). GRIFFIN, A system for
predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov
model. Nucleic Acids Res, 33(Web Server issue), W148-53.
Yang, Z. R. (2004). Biological applications of support vector machines. Brief Bioinform, 5(4), 328-38.
key T erms
Coupling Selectivity: G protein trimers are named after their -subunits, which on the basis of their
amino acid similarity and, most importantly by their cellular function, are grouped into four families.
These include, Gs and Gi/o, which stimulate and inhibit respectively adenylate cyclase, Gq/11
which stimulates phospholipase C, and the less characterized G12/13 family that activates the Na+/
H+ exchanger pathway. The specificity of the interaction of a given GPCR with the pool of available
intracellular G-proteins is termed coupling selectivity or specificity. The ability of certain GPCRs to
interact with more that one types of G-proteins (i.e. Gs and Gi/o) is known as promiscuous coupling
selectivity. GPCRs coupled to members of the G12/13 family are all exhibiting promiscuous coupling
preferences.
Genome Annotation: The functional characterization (by means of biochemical experiments or
computational prediction algorithms) of novel genes in newly sequenced and assembled genomes.
180
G-Protein Coupled Receptors (GPCRs): Also known as seven transmembrane (heptahelical)

receptors, due to their characteristic membrane topology (seven transmembrane helices, extracellular
N-terminus and intracellular C-terminus). They are transmembrane proteins acting as the sensory
component of cellular signalling pathways. GPCRs, are a key class of eukaryotic membrane receptors
and roughly 50% of all small molecule therapeutics target GPCRs. Vision, smell and some of taste
uses GPCRs. Ligands for GPCRs cover a wide range of organic chemical space, including proteins,
peptides, sugars, amines and amino-acids, nucleotides, lipids and more. They transduce signals from
extracellular space into the cell, through their interaction with G proteins, which act as switches forming hetero-trimers composed of different subunits (,,). Two GPCRs crystal structures are currently
available, the structure of Rhodopsin and the recently solved three-dimensional structure of beta-2
Adrenergic Receptor.
G-Proteins: The term is used to describe GTP-binding proteins. There are two classes of G-proteins, the small cytoplasmic G-proteins (Gh) and the hetero-trimeric G-proteins composed of different
subunits (,,) that mediate the signal of heptahelical receptors (GPCRs). Agonist binding to GPCRs
leads to association of the hetero-trimeric G protein with the receptor, GDP-GTP exchange in the G
protein subunit followed by dissociation of the G protein into -GTP and complexes. The dissociated subunits can activate or inhibit several effectors such as adenylyl cyclase, PLC, tyrosine kinases,
phosphodiesterases, phosphoinositide 3-kinase, GPCR kinases, ion channels, and molecules of the
mitogen-activated protein kinase pathway, resulting in a variety of cellular functions. However, there
is evidence that some GPCRs transduce their signal through in a way that is not G protein-dependent,
and also that hetero-trimeric G proteins are involved in mediating the action of single-spanning membrane receptors.
Hidden Markov Models (used herein): Probabilistic models widely used for describing features of
a protein sequence. Hidden Markov Models introduce a regular grammar that characterizes a set of
biological sequences. These are generative models, which renders them highly applicable in biological
sequence analysis. In general, a HMM is composed of a set of states that form a first order Markovian process, connected by means of the transition probabilities. Each state, has a unique probability
distribution for generating (emitting) the symbols of the finite alphabet (nucleotides or amino-acids).
The most widely used variant of Hidden Markov Model (HMM) is the profile HMM which models in a
probabilistic manner the matches, inserts and deletions occurring in every column of a multiple sequence
alignment. However, other variations are also common (i.e. the circular HMM).
Orphan Receptors: GPCRs for which no information on their ligand or coupling specificity is
available. These are usually identified as a result of genome sequencing projects and large efforts are
undertaken to functionally characterize them.
181
182
Chapter X
Bacterial -Barrel Outer

Membrane Proteins:
A Common Structural Theme Implicated in

a Wide Variety of Functional Roles
Pantelis G. Bagos
University of Central Greece, and University of Athens, Greece
Stavros J. Hamodrakas
University of Athens, Greece
abstract
-barrel outer membrane proteins constitute the second and less well-studied class of transmembrane
proteins. They are present exclusively in the outer membrane of Gram-negative bacteria and presumably
in the outer membrane of mitochondria and chloroplasts. During the last few years, remarkable advances
have been made towards an understanding of their functional and structural features. It is now wellknown that -barrels are performing a large variety of biologically important functions for the bacterial
cell. Such functions include acting as specific or non-specific channels, receptors for various compounds,
enzymes, translocation channels, structural proteins, and adhesion proteins. All these functional roles
are of great importance for the survival of the bacterial cell under various environmental conditions or
for the pathogenic properties expressed by these organisms. This chapter reviews the currently available
literature regarding the structure and function of bacterial outer membrane proteins. We emphasize the
functional diversity expressed by a common structural motif such as the -barrel, and we provide evidence
from the current literature for dozens of newly discovered families of transmembrane -barrels.
Bacterial -Barrel Outer Membrane Proteins
INTR OD UCTI ON
Integral membrane proteins are divided into two distinct structural classes, the -helical membrane
proteins and the -barrel membrane proteins. -helical membrane proteins class is the more abundant
and well studied, since such proteins are located mostly in the cell membranes of both prokaryotic and
eukaryotic organisms, performing a variety of biologically important functions. Proteins of this class
have their membrane spanning regions forming -helices consisting mainly of hydrophobic residues
(von Heijne 1999). These proteins have been studied extensively in a computational manner during
the last few years and a variety of prediction algorithms have been proposed (Mller, Croning et al.
2001). Members of the latter class (-barrel membrane proteins) are located in the outer membrane of
Gram-negative bacteria, and presumably in the outer membrane of chloroplasts and mitochondria, a
fact explained by the theory of endosymbiosis. The members of this class are having their membrane
spanning segments formed by antiparallel amphipathic -strands, creating a channel in a form of barrel
that spans the outer membrane (Schulz 2002).
A continuously increasing number of -barrel proteins located in the bacterial outer membrane are
characterized, and a number of structures have been solved at atomic resolution (Schulz 2002). These
proteins have been shown to perform a wide variety of functions such as active ion transport, passive
nutrient uptake, membrane anchoring, adhesion, and catalytic activity. Considering the fact that a large
number of pathogens are actually bacteria belonging to the Gram-negative bacteria class and the important biological functions in which outer membrane proteins are involved in, it is not a surprise that
these proteins attract an increased medical interest.
In the following sections we will fist try to describe briefly the structural features observed so far in
the -barrel outer membrane proteins with known three dimensional structure. Then, we will discuss
the available computational methods used for the prediction of the transmembrane strands of -barrel outer membrane proteins as well as for the discrimination of such proteins from water-soluble and
alpha-helical membrane proteins. Afterwards, we will discuss in detail the functional roles in which
-barrel outer membrane proteins are implicated into. Emphasis will be given in newly characterized
families of -barrel outer membrane proteins that are involved in a series of crucial for the survival of
the bacterial cell functions and in the implications for the pathogenicity of these organisms.
STR UCT URA L FEAT URES OF -BARRE
LS
The -barrel is a protein fold occurring in soluble proteins as well in transmembrane ones. A -barrel
may be considered as a -sheet that twists and coils to form a closed barrel-shaped structure, which is
stabilized by the hydrogen bonds formed by the sheet edges (first and last strands). The observed so far
transmembrane -barrels preferentially lay their axis along the membrane normal and are exclusively
composed of meandering all-next-neighbor antiparallel -strands, suggesting a repeating -hairpin
structural motif. It has been shown that any type of -barrel can accurately be described solely by two
parameters, namely the number of -strands n and the shear number S. S is a measure of the stagger
of the strands in the sheet. Theoretical analysis combined with available three dimensional structures
proved that these two parameters determine all other features of the -barrel (Murzin, Lesk et al. 1994;
Murzin, Lesk et al. 1994). Currently, available high-resolution structures of transmembrane -barrel
proteins include -barrels of varying features, with 8 n 22 and 8 S 24 (Table 1). Furthermore,
183
it is worth mentioning that all transmembrane -barrels observed so far consist of an even number of
strands.
Remarkable advances have recently been made towards the understanding of bacterial -barrel forming transmembrane protein structure and function. Their functional roles and the biological processes
they are involved in are diverse and may differ between organisms. X-ray crystallography has revealed
a number of unique 3-dimensional protein structures representatives of large functionally related
families. Long mobile loops resistant to proteolysis (OmpA) or rigid extensions of the barrel-forming
-strands (OmpX) in the extracellular space, are known to provide molecular recognition sites. Porins
of various families are known to mediate the passive transport of small molecules under different environmental conditions (OmpF, PhoE, OmpW, OmpG, OprP) or active translocation of larger molecules
(TonB-dependent receptors, FhuA, FepA, FecA, FptA, FpvA). Known examples of proteins promoting
virulence through adhesion to host cells are the Neisserial OpcA and NspA. In the type V secretion
pathway (autotransporters, NalP), a C-terminal -barrel domain is necessary to form the pore in the
outer membrane, in order to allow the translocation of the secreted mature protein, which is called the
passenger domain. Other outer membrane proteins with known 3-dimensional structures constitute
examples of active transport of long-chain fatty acids (FadL), and specific receptors for nucleosides
(TsX), whereas specific porins are mediating the intake of various carbohydrates (maltose-Maltoporin,
sucrose-Sucroporin). Furthermore, -barrel transmembrane proteins have been reported to exhibit key
enzymatic activities, either as extracellular proteases (OmpT), phospholipases (OmpLA) or enzymes
implicated in the acylation of lipid A (PagP, PagL). Several of these proteins have been shown to function
as monomers, but there are known cases where oligomerisation is required for their proper function. In
Table 1, we list some representative members of the available -barrel outer membrane proteins whose
structure has been determined at atomic resolution, along with a brief description of their structural
and functional properties.
Among the available three-dimensional structures, some could be considered as unusual cases of
transmembrane -barrel forming proteins. In these proteins, the transmembrane -barrels are formed
by more than one aminoacid chains. In the type I secretion pathway (Sec-independent), outer membrane
proteins (TolC, VceC) are functioning in conjunction with cytoplasmic membrane transporters to promote the export of various solutes such as metals, drugs, secreted proteins, across the two envelopes
of the Gram-negative bacterial cell envelope. A protein belonging to this class is Escherichia coli TolC
(Koronakis, Sharff et al. 2000). TolC is a mixed -barrel and -helical protein, which spans both the
outer membrane and the periplasmic space of gram-negative bacteria. Three TolC protomers assemble
to form a continuous, solvent accessible conduit, a channel-tunnel over 140 long. Each monomer of
the trimer contributes 4 -strands to the 12-strand -barrel. Another protein belonging to this class is
the subtle case of -haemolysin from Staphylococcus aureus and other microbial toxins such as aerolysin and the anthrax-protective antigen. Regarding the case of -haemolysin it has been shown (Song,
Hobaugh et al. 1996) that is active as a transmembrane heptamer, where the transmembrane domain is a
14-strand antiparallel -barrel, in which two strands are contributed by each monomer. This endotoxin
causes disease by forming pores on the infected cell membrane leading to cell lysis or to the destruction
of small molecule concentration gradients. Recently, the structure of a Mycobacterial (Gram-positive)
outer membrane channel (MspA) has been determined at atomic resolution (Faller, Niederweis et al.
2004). The Mycobacterial outer membranes are the thickest biological membranes known to date, and
present a decreased fluidity toward the periplasmic side of the membrane as opposed to the outer membrane of Gram-negative bacteria (Liu, Rosenberg et al. 1995). More recently, the structure of the HiA
184
Table 1. A list of representative outer membrane proteins with structures determined at atomic resolution. Note that, although some proteins seem to be related based on their structural features (i.e. OmpX
and NspA) they exhibit no significant sequence similarity and thus are listed separately. Other families
(general porins, TonB-dependent Receptors) are over-represented in PDB with members showing weak
or modest sequence similarity (i.e. OmpF with Omp32, OmpC and OmpK36) and thus we list only one
representative. Each protein in the table is a representative of a single PFAM family. Note though, that
PFAM-B codes (marked with an asterisk) are subject to change from version to version.
Protein name
function
Number of strands
PDB code (Berman,

Battistuz et al. 2002)
PFAM code (Finn,

Mistry et al. 2006)
Organism
OmpA
Structural protein
1QJP
PF01389
Escherichia coli
OmpX
Adhesion
1QJ8
PF06316
Escherichia coli
NspA
Adhesion
1P4T
PF02462
Neisseria Meningitidis
PagP
Enzyme
1MM4
PF07017
Escherichia coli
PagL
Enzyme
2ERV
PB038312 *
Pseudomonas aeruginosa
OmpW
General Porin
2F1T
PF03922
Escherichia coli
OmpT
Enzyme
10
1I78
PF01278
Escherichia coli
OpcA
Adhesion
10
1K24
PF07239
OmpLA
Enzyme
12
1QD5
PF02253
Escherichia coli
NalP
Autotransporter
12
1UYN
PF03797
Tsx
Transporter
12
1TLY
PF03502
Escherichia coli
OmpG
General Porin
14
2F1C
PB051875 *
Escherichia coli
FadL
Transporter
14
1T1L
PF03349
Escherichia coli
OprP
General Porin
16
2O4V
PF07396
Pseudomonas aeruginosa
OmpF
General Porin
16
2OMF
PF00267
Escherichia coli
Porin
General Porin
16
2POR
PB028487 *
Rhodobacter capsulatus
Maltoporin
Specific Porin
18
2MPR
PF02264
Salmonella typhimurium
FepA
TonB-dependent
Receptor
22
1FEP
PF00593
Escherichia coli
autotransporter of Haemophilus influenzae has been solved revealing a -barrel with 12 transmembrane
-strands, comprised by four strands from each subunit (Meng, Surana et al. 2006). The central channel has a pore of 1.8 nm in diameter that is traversed by three N-terminal alpha-helices, one from each
subunit. This structure is considered to be representative of the Autotransporter-2 family.
The analysis of observed three-dimensional structures of -barrel outer membrane proteins has
provided us with a set of rules describing the structural features of this class of proteins. These are:
1.
2.
The transmembrane -strands are mainly amphipathic showing an alternation of hydrophobic

and (mostly) polar residues. The hydrophobic residues interact with the hydrophobic lipid chains,
whereas the polar residues are facing toward the barrel interior, and hence interact with the aqueous environment of the pore.
The aromatic residues have a greater tendency to be located in the interfaces with the polar heads
of the lipids, forming the so-called aromatic belts around the perimeter of the barrel.
185
3.
4.
5.
6.
7.
Both the N-terminal and the C-terminal of the proteins are located in the periplasmic space (inside
with respect to the outer membrane). In some cases, the N-, and C-terminal tails of the protein,
may be formed by more than 100 residues-long stretches.
The segments connecting the transmembrane -strands that are located in the periplasmic space
(inside loops) are generally shorter, from those of the extracellular space (outside loops). The periplasmic loops are having a length no longer than twelve residues, whereas the extracellular loops
may be significantly longer, with lengths exceeding thirty residues, occasionally. This observation
is possible due to the meander arrangement observed in currently available structures.
The length of the transmembrane strands varies according to the inclination of the strand with
respect to the lipid bilayer, and ranges between six and twenty-two residues. However, in some
cases only a small portion of the strand is embedded in the lipid bilayer, and the rest of it protrudes
far away from the membrane, to the extra-cellular space, forming flexible hairpins.
-barrel outer membrane proteins show great sequence variability in their amino acid sequences.
This, in general, is larger than that of the globular proteins, and it is even larger when referring to
the extracellular loops, which are often used as antigenic epitopes.
Adjacent strands are connected by a network of hydrogen bonds, stabilizing the barrel.
Computational prediction and discrimination of transmembrane -barrel proteins is in principle

harder than the prediction of -helical transmembrane segments. Despite the fact that transmembrane
-strands in available high-resolution structures are placed with relatively large angles with respect to
the normal to the lipid bilayer, they are significantly shorter than transmembrane -helices due to their
extended conformation, their lengths being typically between six and twenty-two residues. A -strand
of between seven and nine residues might be sufficiently long to span the hydrophobic core of the membrane. Additionally, transmembrane -strands face different environments (the hydrophobic exterior of
the -barrel opposed to the aqueous pore interior), often resulting in alternating hydrophobic-hydrophilic
residues. This alternation is not always exact, since residues on the outer surface of the barrel (facing the
apolar lipidic environement) tend to be hydrophobic, whereas residues pointing to the barrel interior are
not always polar. Even though hydrophobicity peaks in a classical hydropathy plot are well correlated
with the location of transmembrane -strands (Zhai and Saier 2002) their average hydrophobicity is
significantly lower than those of transmembrane -helical segments. This fact should be related with
the underlying translocation mechanism, since in the opposite case outer membrane proteins might be
trapped in the inner membrane during the translocation process. Additionally, oligomerisation of -barrel domains inside the lipid bilayer weakens the necessity for a hydrophobic barrel exterior, since polar
side-chains may provide favourable interactions in the interaction interface.
Summarising the above factors, the sequence signal to be detected is rather weak. Furthermore, common structural features with globular water-soluble proteins with a -barrel in their three dimensional
structures might result in a big number of undesired false positives. Nevertheless, if a protein sequence
of such a protein is carefully examined, several structural characteristics, for example the predomination
of aromatic residues at the interfacial positions, might accurately reveal the location of transmembrane
-strands (for excellent reviews see (Schulz 2002; Schulz 2003; Wimley 2003)).
186
C OMPUTATI ONA L PREDICTI
ON METH ODS
During the last few years, several prediction algorithms have been developed aiming specifically at the
outer membrane proteins. The prediction algorithms all utilize the rules that we mentioned in the previous paragraphs. However, there is a large variation on the algorithmic techniques used for this purpose.
Furthermore, there are two major classes of prediction methods, the methods aiming at predicting the
location of the transmembrane -strands and the methods aiming at discriminating -barrel outer membrane proteins from other classes of proteins such as globular and alpha-helical membrane ones.
From a historical perspective, the prediction algorithms are divided in three categories. The first
consists of methods that were using hydrophobicity plots and the alteration of hydrophobic and polar
residues. Such methods with various modifications were proposed by Vogel an Jahning (Vogel and
Jahnig 1986), Schirmer and Cowan (Schirmer and Cowan 1993), Gromiha and Ponnuswamy (Gromiha
and Ponnuswamy 1993) and Zhai and Saier (Zhai and Saier 2002). Another important class of methods
consists of predictors that use statistical properties of the amino-acids occurring in the -barrel outer
membrane proteins. Such algorithms are the rule-based algorithm of Gromiha and coworkers (Gromiha,
Majumdar et al. 1997), methods using the Gibbs sampler (Neuwald, Liu et al. 1995; Mannella, Neuwald et al. 1996), the window-based method of Wimley (Wimley 2002) and various methods using the
amino-acid and dipeptide composition of the proteins (Liu, Zhu et al. 2003; Bagos, Liakopoulos et al.
2004; Gromiha, Ahmad et al. 2005; Gromiha and Suwa 2005). More advanced such methods are the
BOMP program that uses a combination of regular expression patterns, the -barrel score of Wimley,
and Principal Component Analysis (Berven, Flikka et al. 2004), and the TMB-Hunt (Garrow, Agnew
et al. 2005) program that uses evolutionary information and a K-NN classifier.
However, various machine learning methods such Hidden Markov Models (HMMs) and Neural
Networks (NNs) have been shown to achieve higher accuracy especially for locating the transmembrane
strands. The first application of a NN for predicting the location of transmembrane strands was performed by Diederichs and coworkers (Diederichs, Freigang et al. 1998), followed by the development of
the B2TMPRED (Jacoboni, Martelli et al. 2001) and TMBETA-NET methods (Gromiha, Ahmad et al.
2004), whereas the TBBPred method (Natt, Kaur et al. 2004) uses a combination of NNs and Support
Vector Machines. The highest scoring algorithms however have been shown to be the Hidden Markov
Models (HMMs). The first such method was the HMM-2BTMR (Martelli, Fariselli et al. 2002), followed by the method of Liu and coworkers (Liu, Zhu et al. 2003), the PRED-TMBB method (Bagos,
Liakopoulos et al. 2004; Bagos, Liakopoulos et al. 2004) and the Prof-TMB method (Bigelow, Petrey et
al. 2004; Bigelow and Rost 2006).
Finally we have to mention the consensus algorithms that combine the results of various individual
predictors. Such an algorithm is ConBBPRED (Bagos, Liakopoulos et al. 2005) that specifically aims
to locate the transmembrane strands and the TMB-Hunt2 (Garrow and Westhead 2007) and PSORT-B
(Gardy, Spencer et al. 2003) algorithms that are oriented towards a better accuracy in discrimination of
-barrels. However, not all algorithms cited here are available to the scientific community. In Table 2,
we list the available prediction servers along with a short description.
F UNCTI ONA L DI VERSITY
OF -BARRE
LS
Besides the relative few observed 3-dimensional structures of -barrel membrane proteins, there is additionally a plentiful of examples of proteins representatives of large families, whose structure has not
187
Table 2. The currently available methods for predicting the transmembrane strands of beta-barrel outer
membrane proteins as well as discriminating them form globular ones. For explanation of the methods
see at the text and in a recent evaluation (Bagos, Liakopoulos et al. 2005).
Method
Reference
TM
Strands
TM Strands +
Orientation
Discrimination
URL
B2TMPRED
(Jacoboni, Martelli et al.

2001)
http,//gpcr.biocomp.unibo.it/cgi/predictors/
outer/pred_outercgi.cgi
BOMP
(Berven, Flikka et al.

2004)
http,//www.bioinfo.no/tools/bomp
ConBBPRED
(Bagos, Liakopoulos et
al. 2005)
http,//bioinformatics.biol.uoa.gr/
ConBBPRED/
HMM-B2TMR
(Martelli, Fariselli et al.

2002)
http,//gpcr.biocomp.unibo.it/biodec/
MCMBB
(Bagos, Liakopoulos et
al. 2004)
http,//bioinformatics.biol.uoa.gr/mcmbb
OM_Topo_predict
(Diederichs, Freigang et
al. 1998)
http,//strucbio.biologie.uni-konstanz.
de/~kay/om_topo_predict2.html
PRED-TMBB
(Bagos, Liakopoulos
et al. 2004; Bagos,
Liakopoulos et al. 2004)
http,//bioinformatics.biol.uoa.gr/PREDTMBB/
ProfTMB
(Bigelow, Petrey et al.

2004)
http,//cubic.bioc.columbia.edu/services/
proftmb/
PSORT-B
(Gardy, Spencer et al.

2003))
http,//www.psort.org
TMB-Hunt
(Garrow, Agnew et al.

2005)
http,//www.bioinformatics.leeds.ac.uk/
betaBarrel/
TMB-Hunt2
(Garrow and Westhead

2007)
http,//www.bioinformatics.leeds.ac.uk/TMBWeb/TMB-Hunt2
TBBpred
(Natt, Kaur et al. 2004)
http,//www.imtech.res.in/raghava/tbbpred/
TMBETA-NET
(Gromiha, Ahmad et al.

2004)
http,//psfs.cbrc.jp/tmbeta-net/
been determined yet but there is enough experimental evidence to suggest their localization to the outer
membrane forming -barrels. The experimental techniques used, range from subcellular fractionation
to determine the localization to the outer membrane, liposome swelling activity to determine channel
properties, circular dichroism to determine the secondary structure content, low-resolution electron
microscopy and antibodies experiments, deletion constructs and protease protection activities to locate
surface exposed loops. During the last few years dozens of such families have been characterized
experimentally even though the protein databases have not incorporated yet the available information
concerning the functional annotation.
In the following sections we describe the currently available information reported in the literature
concerning the functional diversity of -barrel transmembrane proteins. We classify the functional roles
of such proteins as general porins, specific porins, structural proteins, translocation channels, and various
uncharacterized or unclassified proteins. Although the classification is based on functional properties,
the sequence and structure characteristics of such proteins are also diverse even within each class.
188
G eneral Porins
A large portion of such proteins constitutes various non-specific porins, having however no detectable
sequence similarity to the already well-known families. In Chlamydia, MomP constitutes the largest
portion of the OM mass, and it is speculated that this protein possesses 16 transmembrane strands
(Hughes, Shaw et al. 2001; Rodriguez-Maranon, Bush et al. 2002; Findlay, McClafferty et al. 2005; Yen,
Pal et al. 2005), whereas PorB (Kubo and Stephens 2000; Kawa and Stephens 2002; Kawa, Schachter
et al. 2004) shows also pore-forming activity although with different properties compared to MomP.
In Campylobacter, MomP besides acting as a porin, it is involved in the structural organization of the
outer membrane, and acts as an adhesin (Moser, Schroeder et al. 1997; De, Jullien et al. 2000; Zhang,
Meitzler et al. 2000; Bolla, Saint et al. 2004), whereas Omp50 shows a pore forming activity with cationselective channel properties (Bolla, De et al. 2000; Dedieu, Pages et al. 2004). In Borellia, omp66 and
its homologues are large MW (66-kDa) outer membrane proteins that exhibit porin activity. The average single-channel conductance predicts a rather large pore diameter of 2.6 nm and prediction methods
suggest a large number (perhaps larger than 22) of transmembrane strands (Skare, Mirzabekov et al.
1997; Bunikis, Luke et al. 1998; Exner, Wu et al. 2000).
In Fusobacterium, FomA is a trimeric porin, which exhibits channel properties similar to that of
the other general porins. Results obtained from limited proteolysis of purified FomA protein, indicate
that the N-terminal part of the FomA protein is not an integral part of the -barrel, but forms instead
a periplasmic domain (Kleivdal, Puntervoll et al. 2001; Puntervoll, Ruud et al. 2002). In Leptospira,
OmpL1 and its homologues, are heat modifiable porins that form trimeric structures. OmpL1 has been
reconstituted in planar lipid bilayers showing an average single channel conductance similar to those
of the major porin activity of native leptospiral membranes. OmpL1, is expressed during infection and
thus it has a role in the induction and persistence of leptospiral interstitial nephritis. Sequence analysis
suggests that the protein has 10-12 transmembrane strands (Shang, Exner et al. 1995; Barnett, Barnett
et al. 1999; Haake, Mazel et al. 1999). The 37-kDa outer membrane porin OmpH of the deep-sea marine
bacterium Photobacterium profundum strain SS9 is been synthesized in response to elevated hydrostatic
pressure and it seems that is responsive to changes in the pressure regime of the deep-sea bacterium The
results from reconstitution of OmpH in liposome bilayers as well as mutational studies are consistent
with the hypothesis that OmpH functions as a relatively large, nonspecific diffusion channel. Prediction
methods suggest that OmpH possesses a 16-stranded transmembrane -barrel, similar to other general
porins (Bartlett, Chi et al. 1993; Bartlett and Chi 1994; Macdonald, Martinac et al. 2003).
Specific Porins
Another important class of outer membrane proteins includes the specific channels, either those that are
responsible for resistance to various antibiotics, or those needed for intake of metabolites. In the first
case we encounter the CarO OMP which is responsible for carbapenem and imipenem resistance in the
Moraxellaceae family of gamma-proteobacteria (Mussi, Limansky et al. 2005) and the distantly related
family of 33-36 kDa OMPs members of which are implicated in the resistance against Imipenem (Clark
1996; Limansky, Mussi et al. 2002; Siroy, Molle et al. 2005; Siroy, Cosette et al. 2006). Other examples
are members of the Pseudomonas OprH family, which are small outer membrane proteins (21-kDa),
when overexpressed under Mg2+ starvation conditions cause resistance to polymyxin B, gentamicin,
and EDTA. There is experimental evidence (Circular Dichroism, PCR-based site-directed deletion and
189
epitope insertion mutagenesis) suggesting a -barrel structure consisting of 8 transmembrane strands

(Bell, Bains et al. 1991; Rehm and Hancock 1996). We also have to mention the bacterial copper resistance
proteins. Copper is essential as it serves as cofactor for a variety of enzymes. However, excess of copper
is toxic and leads to radical formation and oxidation of biomolecules. CopB serves to extrude copper
when it approaches toxic levels. The protein CopB is located in the outer membrane, and seems to form
a -barrel with 10-12 transmembrane strands. The N-terminal, domain (~100 residues) is predicted to be
periplasmic, suggesting a structural resemblance with TonB dependent Receptors however, no sequence
homology is apparent (Cha and Cooksey 1991; Lim and Cooksey 1993; Bissig, Voegelin et al. 2001).
In Pseudomonas, OprD facilitates the diffusion of basic amino acids as well as the structurally
analogous -lactam antibiotics such as imipenem. Mutational inactivation of the OprD gene is associated
with carbapenem resistance in Pseudomonas aeruginosa, whereas the C-terminal portion of OprD, and
in particular, the hypothetical loop L7, was responsible for the unusual meropenem hyper-susceptibility.
Various members of the family have shown different specificities in the uptake of metabolites including
glycine-glutamate, histidine, proline, tyrosine, cis-aconitate and pyroglutamate (Yoshihara, Yoneyama
et al. 1998; Ochs, Lu et al. 1999; Epp, Kohler et al. 2001; Hancock and Brinkman 2002; Pirnay, De Vos
et al. 2002; Tamber, Ochs et al. 2006). The Pseudomonas aeruginosa porin B (OprB), is another substrate-selective channel for a variety of different sugars (Wylie and Worobec 1994). This protein may
facilitate diffusion of a variety of diverse compounds, but is probably restricted to carbohydrates and
does facilitate glucose fusion across the outer membrane (Wylie and Worobec 1995). The high-affinity
glucose transport system is primarily specific for glucose and well conserved although the outer membrane components may differ in channel architecture and specificity for other carbohydrates (Adewoye,
Tschetter et al. 1998; Adewoye and Worobec 1999). Similarly to other porins, these proteins are predicted
to have 16 transmembrane strands. Homologues are also found Pseudomonas chlororaphis, Burkholderia
cepacia, Pseudomonas fluorescens, Acinetobacter calcoaceticus and Xanthomonas campestris.
KdgM is an oligogalacturonate-specific porin protein found in Erwinia chrysanthemi (Blot, Berrier
et al. 2002). This phytopathogenic Gram-negative bacterium, secretes pectinases, which are able to
degrade the pectic polymers of plant cell walls, and uses the degradation products as a carbon source
for growth. KdgM is a major outer membrane protein, whose synthesis is induced in the presence of
the pectic derivatives. KdgM behaves like a voltage-dependent porin that is slightly selective for anions
and that exhibits fast block in the presence of trigalacturonate (Condemine, Berrier et al. 2005). KdgM
seems to be monomeric, and topological models suggest that possesses a 14-stranded -barrel, with six
rather short extracellular loops and a larger one that restricts the size of the pore (Pellinen, Ahlfors et
al. 2003).
CymA of Klebsiella oxytoca, is another specific porin essential for growth on cyclodextrins. However, it can also complement the deficiency of a LamB mutant of Escherichia coli for growth on linear
maltodextrins, indicating that both cyclic and linear oligosaccharides are accepted as substrates. CymAinduced membrane conductance decreased considerably upon addition of alpha-cyclodextrin, whereas
the affinity was lower for -cyclodextrin and even lower for gamma-cyclodextrin. Unlike most bacterial
porins, CymA does not form trimeric complexes in lipid membranes and shows no tendency to trimerize
in solution. However, it seems to form homotetramers with a central pore, and therefore lacks the typical
trimeric structure of most porins (Pajatsch, Andersen et al. 1999; Orlik, Andersen et al. 2003)
In E. coli, RafY is part of an operon that includes also a permease and the enzymes required for raffinose utilization. RafY forms an ion-permeable channel with a single-channel, which is approximately
twice of that of the general diffusion pores OmpF and OmpC. Since RafY is able to accomodate the
190
diffusion of other oligo-sacharides, it seems that it is a general diffusion pore with a diameter, larger
than that of the general diffusion porins, allowing the diffusion of high-molecular-mass carbohydrates
through the outer membrane (Ulmke, Lengeler et al. 1997; Andersen, Krones et al. 1998). Lastly, we have
to mention FmdC, which is a porin involved in the transport of short-chain amides and urea through the
outer membrane of Methylophilus methylotrophus under conditions where these nitrogen sources are
present at very low concentration. Its synthesis is inducible by short chain amides and urea, and the protein is thought to transport these molecules across the outer membrane (Mills, Wyborn et al. 1997).
S tructural Proteins
There are also several structural proteins conferring to the stability of the outer membrane through the
interaction with peptidoglycans, having though no sequence similarity to the OmpA family of proteins
in their transmembrane domain (Porphyromonas and Acinetobacter MomPs, Pseudomonas OprF).
MomP of Porphyromonas gingivalis resembles the members of the OmpA family in their C-terminal
domain that protrudes to the periplasmic space. The N-terminal membrane anchoring domain though,
does not show any significant sequence similarity. Two-dimensional, diagonal electrophoresis and
chemical cross-linking experiments with or without a reducing agent clearly showed that this protein
mainly forms stable heterotrimers via intermolecular disulfide bonds. It has been suggested that they
play an important role in outer membrane integrity, and similarly to members of the OmpA family, are
likely to function as a stabilizer of the cell wall rather than as a major porin in this organism (Ross,
Czajkowski et al. 2004; Nagano, Read et al. 2005)
In Acinetobacter radioresistens KA53 the OmpA (MomP) was shown to constitute most of the
emulsifying activity. The transmembrane -barrel is likely to possess 8 membrane spanning -strands,
whereas the C-terminal domain is probably acting in interactions with the peptidoglycan layer (OforiDarko, Zavros et al. 2000; Toren, Orr et al. 2002; Gribun, Nitzan et al. 2003; Pessione, Giuffrida et al.
2003; Vashist and Rajeswari 2006; Akimana and Lafontaine 2007).
A third, structurally similar, family of proteins includes homologues of the OprF of P. aeruginosa.
It has been shown that members of this family (similar to members of the OmpA family), are showing
a modular architecture, composed of two distinct structural domains. A N -terminal -barrel domain
formed by 8 -strands with short turns at the periplasmic ends, and long flexible loops at the external
ends, that anchors the protein to the outer membrane, and a C-terminal domain that protrudes into the
periplasmic space interacting with peptidoglycans. There is also evidence for a pore-forming activity
of the -barrel domain, possibly larger than that of the OmpA of E. coli (Brinkman, Bains et al. 2000;
El Hamel, Freulet et al. 2000; Saint, El Hamel et al. 2000).
In Geobacter sulfurreducens, OmpJ is the most abundant protein isolated from the outer membrane.
Deletion of the ompJ gene results in a strain that grew in fumarate, but could not grow with metals, such
as soluble or insoluble Fe(III) and insoluble Mn(IV) oxide. The presumed effect of OmpJ in extracellular
electron transfer is indirect, as OmpJ is required to keep the integrity of the periplasmic space necessary
for proper folding and functioning of periplasmic and outer membrane electron transport components
(Afkar, Reguera et al. 2005). Prediction methods indicate that OmpJ possess 20 transmembrane -strands
and homologues are found only in bacteria members of the Geobacteraceae family.
In cyanobacteria, the tandemly lying genes somA and somB that were initially identified in Synechococcus PCC 6301 encode two functionally characterized outer membrane porins, that are predicted
to form probably a 16-stranded -barrel. Homologues that were also identified in other cyanobacteria,
191
share an N-terminal motif, with similarity to S-layer homology (SLH) domains, that probably form
periplasmic extensions connecting the outer membrane to the peptidoglycan layer. The C-terminal part,
forms a -barrel and it has been shown to form pores in lipid bilayers with single-channel conductances
(Umeda, Aiba et al. 1996; Hansel, Pattus et al. 1998; Hansel and Tadros 1998).
S ecretion and Protein T ranslocation C hannels

Secretins form large pores in the outer membrane, participating in protein secretion during the type
II, and type III terminal branch of the General Secretion Pathway (GSP) of Gram-negative bacteria
(Koster, Bitter et al. 2000; Thanassi 2002). The type II secretion pathway is dependent on the Sec
system, since the secreted proteins must carry a signal peptide sufficient for the translocation through
the inner membrane, and responsible for the secretion of toxins and exoenzymes. Type III secretion
pathway is Sec-independent and allows the translocation of effector proteins from bacteria to the eukaryotic target cells. Members of the family, include PilQ of N. meningitidis, PulD of K. oxytoca, GspD
of E. coli, the pIV protein playing a role in the assembly of the filamentus bacteriophage, and other
proteins of Gram-negative bacteria. Electron microscopy, suggests that secretins form large channels
in the outer membrane with an internal diameter of approximately 7 nm. The structure of PilQ, has
been determined to 12 resolution, suggesting a 12-meric structure (Collins, Frye et al. 2004). It is
believed that the C-terminal part of secretins forms the transmembrane -barrel domain; however there
is a lack of a high-resolution three-dimensional model, and thus the true folding state of the channel is
ambiguous. There is also evidence that secretins, play an active role in the assembly of the pili, besides
acting solely as channels.
Ushers are another family of integral outer membrane proteins participating in the chaperone/usher
secretion pathway, a terminal branch of the General Secretion Pathway (GSP) dedicated to the biogenesis of adhesive surface structures associated with pathogenesis (Thanassi 2002). As the name of the
pathway implies, the Ushers work in conjuction with a periplasmic chaperone in order to assemble and
secrete more than 30 different surface molecules in a broad range of gram-negative bacteria. The most
studied example of the chaperone/usher pathway is the assembly and biogenesis of the type 1 and P pili,
expressed by the uropathogenic E. coli. The members of the Usher family (PapC, FasD, FaeD, FimD
etc), display properties of the -barrel proteins, and are predicted to form channels consisting of 24-32
-strands. Recently, it has been proposed that the Ushers form twin dimeric pores in the outer membrane,
and the C-terminal parts of the sequences are responsible for the dimerisation (Henderson, So et al.
2004). In addition, the N-terminal 100-120 aminoacids are believed to be involved in the recognition
of the periplasmic chaperone (Ng, Akman et al. 2004).
The Neisserial Omp85 and their homologues in other species constitute a family of proteins involved
in protein translocation through the outer membrane. Transmembrane -barrels are synthesized in the
cytoplasm and transferred through the inner membrane using the Sec system. In the periplasmic space,
chaperones such as SurA and Skp, are binding the OMP and target it to the outer membrane (Bos and
Tommassen 2004). It is believed that the highly conserved Omp85 acts in vivo, in a way that facilitates
the correct folding and the insertion of other -barrels in the outer membrane (Gentle, Gabriel et al. 2004;
Voulhoux and Tommassen 2004). Homologues of Omp85 (which is sometimes referred as Bacterial
Surface Antigen D15) are also found in mitochondria (Sam50/Tob55) and chloroplasts (OEP75) (Paschen,
Waizenegger et al. 2003). The transmembrane domain (probably having 14 -strands), is located in the
C-terminus, whereas in the N-terminus, Omp85 possesses 5 conserved POTRA (Polypeptide Transport)
192
domains. These domains are believed to interact with the transferred -barrel, and the final insertion
into the OM is performed with lateral diffusion (Moslavac, Mirus et al. 2005).
Another distantly related family of proteins constitutes the transporters of the two-partner secretion
system. The two-partner secretion (TPS) system (which along with autotransporters comprises the type V
secretion pathway in bacteria) is composed of two separate proteins, with TpsA being the secreted protein
and TpsB its specific transporter (Newman and Stathopoulos 2004). The secreted proteins are exported
in a Sec-dependent manner across the inner membrane, after which they cross the outer membrane with
the help of their cognate transporters. Translocation appears to be folding-sensitive, indicating that TpsA
proteins cross the periplasm and the outer membrane in non-native conformations and fold progressively
at the cell surface. A major difference of the TPS pathways compared to the AT arises from the manner
by which specificity is established between the secreted protein and its transporter. The TPS pathway
has solved the question of specific recognition between the TpsA proteins and their transporters by the
addition to the TpsA proteins of an N-proximal module, the conserved TPS domain, which represents
a hallmark of the TPS pathway. The exoproteins of the TPS system have been reported to be adhesins,
haem-binding proteins, antigenic factors and haemolysins/cytolysins (Jacob-Dubuisson, Fernandez et al.
2004). The structure of the TPS domain has been solved and it is seen to fold as a -helix, but currently
there is no available structure for the transmembrane domain of the TpsB proteins (Clantin, Hodak et
al. 2004). The latter is predicted to have approximately 16-18 transmembrane strands.
A third, also distantly related, family is the Haemophilus influenza (hmw1B) Outer Membrane Translocator Family. The members of this family are proteins distantly related to the proteins of the Omp85
family as well as to the TpsB family. There is evidence suggesting that these proteins are involved in the
translocation of proteins through the outer membrane. The C-terminal part of the sequence is speculated,
based on experimental evidence, to possess a -barrel structure that forms a pore, whereas the N-terminal part forms a periplasmic chaperone which in the most of the cases consists of a single POTRA
domain (Surana, Grass et al. 2004; Surana, Buscher et al. 2006). The first characterized member of the
family was found in Haemophilus influenzae, but homologues are found in other Proteobacteria such
as Escherichia, Yiersinia, Pseudomonas, Caulobacter and Railstonia. In Porphyromonas gingivalis
the PorT outer membrane protein is implicated in the leaderless secretion of various virulence factors
(Gingipains/adhesins). PorT seems to be membrane-associated and exposed to the periplasmic space, as
revealed by subcellular fractionation and immunoblot analysis using anti-PorT antiserum (Sato, Sakai et
al. 2005). Prediction methods suggest that the proteins possess nearly 8-10 transmembrane -strands.
In E.coli, Wzi was found to be involved in the surface assembly of the Escherichia coli K30 group 1
capsule, participating in the translocation of group 1 capsular polysaccharide in co-operation with the
Wza complex (Nesper, Hill et al. 2003). Wzi is an outer membrane monomeric -barrel protein (Rahn,
Beis et al. 2003), predicted to possess 16 transmembrane strands. Homologues are also found in Klebsiella pneumoniae, Acinetobacter, Psychrobacter, Shewanella oneidensis, Rhodobacter sphaeroides,
Idiomarina loihiensis and Microbulbifer degradans.
In Gram-negative bacteria, the components of the outer membrane are synthesized in the cytoplasm
or the inner membrane and must thus traverse the inner membrane and the periplasm on the way to their
final destination (Bos and Tommassen 2004). LPS (Lipopolysacharide) is an essential component of the
bacterial outer membrane and consists of a hydrophobic membrane anchor, lipid A, substituted with an
oligosaccharide core region that can be extended in some bacteria by a repeating oligosaccharide, the
O-antigen. An OMP is required for the appearance of LPS at the bacterial cell surface (Braun and Silhavy 2002). This protein is known as Imp (increased membrane permeability) or OstA (organic solvent
193
tolerance) because E. coli strains expressing mutant versions of this protein showed altered membrane
permeability (Bos, Tefsen et al. 2004). It forms probably a -barrel and it is predicted to have 22 transmembrane strands. Lastly, we have to mention AlgE from P. aeruginosa and AlgJ from Azotobacter
vinelandii that are believed to export alginate (an acidic polysaccharide), across the outer membrane of
these Gram-negative bacteria (Rehm, Boheim et al. 1994; Rehm 1996). These proteins are predicted to
have 18 transmembrane -strands spanning the outer membrane, thus forming a -barrel. However the
channel properties are different from those of the general porins.
A dhesion Outer Membrane Proteins

There are also a large and diverse collection of newly discovered examples of various types of adhesion proteins implicated in pathogenesis. These outer membrane adhesions however, do not show any
significant similarity to the already known examples of OmpX, NspA and OpcA of which we know
the 3-dimensional structure. One such example is the Moraxella OmpJ (Hays, van Selm et al. 2005), a
protein that has been shown to be implicated in the bacterial clearance from the lungs. The members of
this family possess a distant similarity with members of the OmpA family and predictions indicate that
their membrane spanning -barrel is formed by 8 transmembrane strands. Intimins are adhesins found
mainly in Enterohemorrhagic E. coli (EHEC) and Enteropathogenic E. coli (EPEC), whereas invasins are
adhesins found in Yersinia pseudotuberculosis (Niemann, Schubert et al. 2004). Intimins and Invasins,
share homology in their N-terminal (500 aminoacids length) segment that probably forms a -barrel
anchoring the protein to the outer membrane (Touze, Hayward et al. 2004). At the C-terminal, these
proteins have some Ig-like domains and a C-type lectin domain. Invasin, binds to Integrin, whereas
Intimin binds to the Translocated Intimin receptor (Tir) (Luo, Frey et al. 2000).
The intracellular bacteria Surface antigen family, includes a number of antigens expressed on the
surface of intracellular endosymbiotic pathogens belonging to various species members of the Rickettsiales, such as Anaplasma, Wolbachia and Ehrlichia. It is a diverse family, believed to consist of
adhesins having no more than 8 transmembrane strands. The WSP protein of Wolbachia (Braig, Zhou
et al. 1998; Baldo, Lo et al. 2005), the P28 antigen of Ehrlichia (Zhang, Guo et al. 2004) and the P44
antigen of Anaplasma (Oberle and Barbet 1993; Huang, Wang et al. 2007) are well-studied examples
of such proteins.
The Treponema Major Surface Protein is a rather large (474 amino-acids) protein acting as a surface
antigen in the outer sheath of Treponema denticola. It is mainly an adhesin, but additionally it has been
shown to exhibit channel activity. The channel is estimated to have a pore diameter of 3.4 nm, and prediction methods suggest that the protein possess a transmembrane domain composed of 22 -strands.
Homologues are found also in other pathogenic species of Treponema. However, the -barrel domain
may be considerably shorter since a large portion of the N-terminal is shown to be surface exposed and
associated with antigenicity (Egli, Leung et al. 1993; Park, Heuner et al. 2002; Batista da Silva, Lee et
al. 2004; Edwards, Jenkinson et al. 2005).
The Legionella MomP and its homologues also serve as adhesive molecules for host cells, suggesting
and that these proteins play a major role in the virulence of the particular bacterium (Hoffman, Seyer et
al. 1992; High, Torosian et al. 1993; Krinos, High et al. 1999). Finally, in Helicobacter pylori which is
the causative agent of gastritis and peptic ulceration in humans, there have been identified several outer
membrane proteins that are not present in any other Gram-negative bacteria. These OMPs of H. pylori
are characterized mainly as adhesins and porins and extensive C-terminal sequence similarity between
194
these proteins has been used to define two different families (Doig, Exner et al. 1995; Exner, Doig et al.
1995; Tomb, White et al. 1997; Peck, Ortkamp et al. 2001; Hofreuter, Karnholz et al. 2003)
Uncharacterized Outer Membrane Proteins

In this last section we list briefly some examples of probable -barrel outer membrane proteins that are
not yet functionally characterized but there is enough convincing experimental evidence suggesting
that they are truly outer membrane proteins.
In Acetobacter xylinus BcsC is member of a cellulose synthase operon composed also by bcsA, bcsB
and bcsD genes (Wong, Fear et al. 1990). Mutants in the bcsC and bcsD genes were impaired in cellulose
production in vivo, even though they had the capacity to make all the necessary metabolic precursors
and cyclic diguanylic acid, the activator of cellulose synthase, and exhibit cellulose synthase activity in
vitro. S. Typhimurium and E. coli have also cellulose as component of the extracellular matrix (Zogaj,
Nimtz et al. 2001), and BcsC is also present in these organisms. Recently, a proteomic analysis of E.
crysanthemi (Babujee, Venkatesh et al. 2007), revealed that BcsC is located in the outer membrane.
The postulated -barrel domain is located in the C-terminal of the large protein and probably serves as
a membrane anchoring region or a channel for cellulose export. Other domains were also found in the
protein, including several tetratricopeptide repeat (TPR) domains, responsible for protein-protein-interactions, and various domains of unknown function. Prediction methods suggest that the C-terminal
-barrel is formed by 16 transmembrane strands.
NfrA of E. coli was found to be one of the two genes whose products are required for bacteriophage
N4 adsorption (Kiino, Singer et al. 1993). The nfrA gene encodes a 990-residue long outer membrane
protein which presumably serves as the phage receptor, whereas the nfrB gene encodes an inner membrane protein and may be a component of the receptor. Besides acting as receptor for N4 bacteriophage
the physiological role of nrfA is unknown (Molloy, Herbert et al. 2000). The transmembrane domain
of the large protein is located in the C-terminus and prediction methods suggest a barrel of 12 strands.
A large N-terminal periplasmic domain of unknown function is also present also as a tetratricopeptide
repeat (TPR) domain, which is known to be implicated in protein-protein-interactions.
Lastly, we have to mention some examples of newly discovered outer membrane -barrels with yet
unknown function. Such proteins are the Serpulina variable surface protein (McCaman, Auer et al. 1999;
McCaman, Auer et al. 2003), the E. coli YfaZ and YaiO outer membrane proteins (Marani, Wagner et
al. 2006), and the Salt-stress induced outer membrane protein from Rhodobacter sphaeroides (Xu, Abo
et al. 2001; Xu, Kadokura et al. 2001; Tsuzuki, Xu et al. 2005).
C ONC LUSI ON
It is now evident that -barrel outer membrane proteins even though share some remarkable structural
features, constitute a large and highly diverse (in terms of functional roles) superfamily of membrane
proteins. By reviewing the recent literature we have presented evidence for dozens of previously uncharacterized families of outer membrane proteins implicated in almost any functional role of the bacterial
outer membrane. Thus, there are general and specific porins of various families, specific transporters
and receptors, translocation channels, enzymes, and adhesion proteins. The importance of identifying
such proteins in the completely sequenced genomes is clearly illustrated since such proteins could be
195
responsible for the pathogenicity of some medically important Gram-negative bacteria, whereas other
proteins could serve as potential targets for drugs or vaccines.
One issue that has to be pursued in the near future is the incorporation of such a detailed annotation
of the various families, in the publicly available databases. Given that outer membrane proteins are
continuously been characterized and the available knowledge is accumulated, we anticipate that such
a level of functional annotation should be incorporated soon. Genome-wide computational studies are
also needed in order to fully address the issues regarding the distribution of the various families of barrel outer membrane proteins in the bacteria with completely sequenced genomes.
Concerning prediction methods, the emergence of newly characterized -barrel outer membrane
proteins should be used in addition to the progress of algorithmic techniques used for prediction.
Here, two things are of particular interest. First, the coupling of experimentally derived information
regarding the topology and the development of prediction algorithms that directly incorporate experimentally derived information in the prediction, similar to the methods implemented for prediction of
alpha-helical membrane proteins. Secondly, the development of methods that can locate accurately the
transmembrane domain in multidomain proteins. It is worth noting that most of the currently available
prediction algorithms fail to correctly predict the transmembrane topology in case of a protein with large
periplasmic N- and C-terminal parts. Such prediction methods could be used in turn to identify novel
outer membrane proteins that share no sequence similarity with the known families and these proteins
could be further submitted as potential targets to experimentalists.
REFERENCES
Adewoye, L. O., Tschetter, L., et al. (1998). Channel specificity and secondary structure of the glucoseinducible porins of Pseudomonas spp. J Bioenerg Biomembr, 30(3), 257-67.
Adewoye, L. O., & Worobec, E. A. (1999). Multiple environmental factors regulate the expression of the
carbohydrate-selective OprB porin of Pseudomonas aeruginosa. Can J Microbiol, 45(12), 1033-42.
Afkar, E., Reguera, G., et al. (2005). A novel Geobacteraceae-specific outer membrane protein J (OmpJ)
is essential for electron transport to Fe(III) and Mn(IV) oxides in Geobacter sulfurreducens. BMC
Microbiol, 5, 41.
Akimana, C., & Lafontaine, E. R. (2007). The Moraxella catarrhalis outer membrane protein CD contains
two distinct domains specifying adherence to human lung cells. FEMS Microbiol Lett, 271(1), 12-9.
Andersen, C., Krones, D., et al. (1998). The porin RafY encoded by the raffinose plasmid pRSD2 of
Escherichia coli forms a general diffusion pore and not a carbohydrate-specific porin. Eur J Biochem,
254(3), 679-84.
Babujee, L., Venkatesh, B., et al. (2007). Proteomic analysis of the carbonate insoluble outer membrane
fraction of the soft-rot pathogen Dickeya dadantii (syn. Erwinia chrysanthemi) strain 3937. J Proteome
Res, 6(1), 62-9.
Bagos, P. G., Liakopoulos, T. D., et al. (2004). Finding beta-barrel outer membrane proteins with a
Markov Chain model. WSEAS Transactions on Biology and Biomedicine, 2(1), 186-189.
196
Bagos, P. G., Liakopoulos, T. D., et al. (2005). Evaluation of methods for predicting the topology of
beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics, 6, 7.
Bagos, P. G., Liakopoulos, T. D., et al. (2004). A Hidden Markov Model method, capable of predicting
and discriminating beta-barrel outer membrane proteins. BMC Bioinformatics, 5(29).
Bagos, P. G., Liakopoulos, T. D., et al. (2004). PRED-TMBB, a web server for predicting the topology
of beta-barrel outer membrane proteins. Nucleic Acids Res, 32(Web Server Issue), W400-W404.
Baldo, L., Lo, N., et al. (2005). Mosaic nature of the wolbachia surface protein. J Bacteriol, 187(15),
5406-18.
Barnett, J. K., Barnett, D., et al. (1999). Expression and distribution of leptospiral outer membrane
components during renal infection of hamsters. Infect Immun, 67(2), 853-61.
Bartlett, D., & Chi, E. (1994). Genetic characterization of ompH mutants in the deep-sea bacterium
Photobacterium sp. strain SS9. Arch Microbiol, 162(5), 323-8.
Bartlett, D. H., Chi, E., et al. (1993). Sequence of the ompH gene from the deep-sea bacterium Photobacterium SS9. Gene, 131(1), 125-8.
Batista da Silva, A. P., Lee, W., et al. (2004). The major outer sheath protein of Treponema denticola
inhibits the binding step of collagen phagocytosis in fibroblasts. Cell Microbiol, 6(5), 485-98.
Bell, A., Bains, M., et al. (1991). Pseudomonas aeruginosa outer membrane protein OprH, expression
from the cloned gene and function in EDTA and gentamicin resistance. J Bacteriol, 173(21), 6657-64.
Berman, H. M., Battistuz, T., et al. (2002). The Protein Data Bank. Acta Crystallogr D Biol Crystallogr,
58(Pt 6 No 1), 899-907.
Berven, F. S., Flikka, K., et al. (2004). BOMP, a program to predict integral b-barrel outer membrane
proteins encoded within genomes of Gram-negative bacteria. Nucleic Acids Res, 32(Web Server Issue),
W394-W399.
Bigelow, H., & Rost, B. (2006). PROFtmb, a Web server for predicting bacterial transmembrane beta
barrel proteins. Nucleic Acids Res, 34(Web Server issue), W186-8.
Bigelow, H. R., Petrey, D. S., et al. (2004). Predicting transmembrane beta-barrels in proteomes. Nucleic
Acids Res, 32(8), 2566-77.
Bissig, K. D., Voegelin, T. C., et al. (2001). Tetrathiomolybdate inhibition of the Enterococcus hirae
CopB copper ATPase. FEBS Lett, 507(3), 367-70.
Blot, N., Berrier, C., et al. (2002). The oligogalacturonate-specific porin KdgM of Erwinia chrysanthemi
belongs to a new porin family. J Biol Chem, 277(10), 7936-44.
Bolla, J. M., De, E., et al. (2000). Purification, characterization and sequence analysis of Omp50,a new
porin isolated from Campylobacter jejuni. Biochem J, 352 Pt 3, 637-43.
Bolla, J. M., Saint, N., et al. (2004). Crystallization and preliminary crystallographic studies of MOMP
(major outer membrane protein) from Campylobacter jejuni. Acta Crystallogr D Biol Crystallogr, 60(Pt
12 Pt 2), 2349-51.
197
Bos, M. P., Tefsen, B., et al. (2004). Identification of an outer membrane protein required for the transport
of lipopolysaccharide to the bacterial cell surface. Proc Natl Acad Sci U S A, 101(25), 9417-22.
Bos, M. P., & Tommassen, J. (2004). Biogenesis of the Gram-negative bacterial outer membrane. Curr
Opin Microbiol, 7(6), 610-6.
Braig, H. R., Zhou, W., et al. (1998). Cloning and characterization of a gene encoding the major surface
protein of the bacterial endosymbiont Wolbachia pipientis. J Bacteriol, 180(9), 2373-8.
Braun, M., & Silhavy, T. J. (2002). Imp/OstA is required for cell envelope biogenesis in Escherichia
coli. Mol Microbiol, 45(5), 1289-302.
Brinkman, F. S., Bains, M., et al. (2000). The amino terminus of Pseudomonas aeruginosa outer membrane protein OprF forms channels in lipid bilayer membranes, correlation with a three-dimensional
model. J Bacteriol, 182(18), 5251-5.
Bunikis, J., Luke, C. J., et al. (1998). A surface-exposed region of a novel outer membrane protein (P66)
of Borrelia spp. is variable in size and sequence. J Bacteriol, 180(7), 1618-23.
Cha, J. S., & Cooksey, D. A. (1991). Copper resistance in Pseudomonas syringae mediated by periplasmic
and outer membrane proteins. Proc Natl Acad Sci USA, 88(20), 8915-9.
Clantin, B., Hodak, H., et al. (2004). The crystal structure of filamentous hemagglutinin secretion domain
and its implications for the two-partner secretion pathway. Proc Natl Acad Sci USA, 101(16), 6194-9.
Clark, R. B. (1996). Imipenem resistance among Acinetobacter baumannii, association with reduced
expression of a 33-36 kDa outer membrane protein. J Antimicrob Chemother, 38(2), 245-51.
Collins, R. F., Frye, S. A., et al. (2004). Structure of the Neisseria meningitidis outer membrane PilQ
secretin complex at 12 resolution. J Biol Chem, 279(38), 39750-6.
Condemine, G., Berrier, C., et al. (2005). Function and expression of an N-acetylneuraminic acid-inducible outer membrane channel in Escherichia coli. J Bacteriol, 187(6), 1959-65.
De, E., Jullien, M., et al. (2000). MOMP (major outer membrane protein) of Campylobacter jejuni; a
versatile pore-forming protein. FEBS Lett, 469(1), 93-7.
Dedieu, L., Pages, J. M., et al. (2004). Use of the omp50 gene for identification of Campylobacter species
by PCR. J Clin Microbiol, 42(5), 2301-5.
Diederichs, K., Freigang, J., et al. (1998). Prediction by a neural network of outer membrane beta-strand
protein topology. Protein Sci, 7(11), 2413-20.
Doig, P., Exner, M. M., et al. (1995). Isolation and characterization of a conserved porin protein from
Helicobacter pylori. J Bacteriol, 177(19), 5447-52.
Edwards, A. M., Jenkinson, H. F., et al. (2005). Binding properties and adhesion-mediating regions of
the major sheath protein of Treponema denticola ATCC 35405. Infect Immun, 73(5), 2891-8.
Egli, C., Leung, W. K., et al. (1993). Pore-forming properties of the major 53-kilodalton surface antigen
from the outer sheath of Treponema denticola. Infect Immun, 61(5), 1694-9.
198
El Hamel, C., Freulet, M. A., et al. (2000). Involvement of the C-terminal part of Pseudomonas fluorescens
OprF in the modulation of its pore-forming properties. Biochim Biophys Acta, 1509(1-2), 237-44.
Epp, S. F., Kohler, T., et al. (2001). C-terminal region of Pseudomonas aeruginosa outer membrane porin
OprD modulates susceptibility to meropenem. Antimicrob Agents Chemother, 45(6), 1780-7.
Exner, M. M., Doig, P., et al. (1995). Isolation and characterization of a family of porin proteins from
Helicobacter pylori. Infect Immun, 63(4), 1567-72.
Exner, M. M., Wu, X., et al. (2000). Protection elicited by native outer membrane protein Oms66 (p66)
against host-adapted Borrelia burgdorferi, conformational nature of bactericidal epitopes. Infect Immun, 68(5), 2647-54.
Faller, M., Niederweis, M., et al. (2004). The structure of a mycobacterial outer-membrane channel.
Science, 303(5661), 1189-92.
Findlay, H. E., McClafferty, H., et al. (2005). Surface expression, single-channel analysis and membrane
topology of recombinant Chlamydia trachomatis Major Outer Membrane Protein. BMC Microbiol, 5(1),
5.
Finn, R. D., Mistry, J., et al. (2006). Pfam, clans, web tools and services. Nucleic Acids Res, 34(Database
issue), D247-51.
Gardy, J. L., Spencer, C., et al. (2003). PSORT-B, Improving protein subcellular localization prediction
for Gram-negative bacteria. Nucleic Acids Res, 31(13), 3613-7.
Garrow, A. G., Agnew, A., et al. (2005). TMB-Hunt, an amino acid composition based method to screen
proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics, 6, 56.
Garrow, A. G., & Westhead, D. R., (2007). A consensus algorithm to screen genomes for novel families
of transmembrane beta barrel proteins. Proteins.
Gentle, I., Gabriel, K., et al. (2004). The Omp85 family of proteins is essential for outer membrane
biogenesis in mitochondria and bacteria. J Cell Biol, 164(1), 19-24.
Gribun, A., Nitzan, Y., et al. (2003). Molecular and structural characterization of the HMP-AB gene
encoding a pore-forming protein from a clinical isolate of Acinetobacter baumannii. Curr Microbiol,
47(5), 434-43.
Gromiha, M. M., Ahmad, S., et al. (2004). Neural network-based prediction of transmembrane betastrand segments in outer membrane proteins. J Comput Chem, 25(5), 762-7.
Gromiha, M. M., Ahmad, S., et al. (2005). Application of residue distribution along the sequence for
discriminating outer membrane proteins. Comput Biol Chem, 29(2), 135-42.
Gromiha, M. M., Majumdar, R., et al. (1997). Identification of membrane spanning beta strands in
bacterial porins. Protein Eng, 10(5), 497-500.
Gromiha, M. M., & Ponnuswamy, P. K. (1993). Prediction of transmembrane beta-strands from hydrophobic characteristics of proteins. Int J Pept Protein Res, 42(5), 420-31.
199
Gromiha, M. M., & Suwa, M. (2005). A simple statistical method for discriminating outer membrane
proteins with better accuracy. Bioinformatics, 21(7), 961-8.
Haake, D. A., Mazel, M. K., et al. (1999). Leptospiral outer membrane proteins OmpL1 and LipL41
exhibit synergistic immunoprotection. Infect Immun, 67(12), 6572-82.
Hancock, R. E., & Brinkman, F. S. (2002). Function of pseudomonas porins in uptake and efflux. Annu
Rev Microbiol, 56, 17-38.
Hansel, A., Pattus, F., et al. (1998). Cloning and characterization of the genes coding for two porins in
the unicellular cyanobacterium Synechococcus PCC 6301. Biochim Biophys Acta, 1399(1), 31-9.
Hansel, A., & Tadros, M. H. (1998). Characterization of two pore-forming proteins isolated from the
outer membrane of Synechococcus PCC 6301. Curr Microbiol, 36(6), 321-6.
Hays, J. P., van Selm, S., et al. (2005). Identification and characterization of a novel outer membrane protein (OMP J) of Moraxella catarrhalis that exists in two major forms. J Bacteriol, 187(23), 7977-84.
Henderson, N. S., So, S. S., et al. (2004). Topology of the outer membrane usher PapC determined by
site-directed fluorescence labeling. J Biol Chem, 279(51), 53747-54.
High, A. S., Torosian, S. D., et al. (1993). Cloning, nucleotide sequence and expression in Escherichia
coli of a gene (ompM) encoding a 25 kDa major outer-membrane protein (MOMP) of legionella pneumophila. J Gen Microbiol, 139(8), 1715-21.
Hoffman, P. S., Seyer, J. H., et al. (1992). Molecular characterization of the 28- and 31-kilodalton subunits
of the Legionella pneumophila major outer membrane protein. J Bacteriol, 174(3), 908-13.
Hofreuter, D., Karnholz, A., et al. (2003). Topology and membrane interaction of Helicobacter pylori ComB
proteins involved in natural transformation competence. Int J Med Microbiol, 293(2-3), 153-65.
Huang, H., Wang, X., et al. (2007). Porin activity of Anaplasma phagocytophilum outer membrane
fraction and purified P44. J Bacteriol, 189(5), 1998-2006.
Hughes, E. S., Shaw, K. M., et al. (2001). Mutagenesis and functional reconstitution of chlamydial major
outer membrane proteins, VS4 domains are not required for pore formation but modify channel function. Infect Immun, 69(3), 1671-8.
Jacob-Dubuisson, F., Fernandez, R., et al. (2004). Protein secretion through autotransporter and twopartner pathways. Biochim Biophys Acta, 1694(1-3), 235-57.
Jacoboni, I., Martelli, P. L., et al. (2001). Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based predictor. Protein Sci, 10(4), 779-87.
Kawa, D. E., Schachter, J., et al. (2004). Immune response to the Chlamydia trachomatis outer membrane
protein PorB. Vaccine, 22(31-32), 4282-6.
Kawa, D. E., & Stephens, R. S. (2002). Antigenic topology of chlamydial PorB protein and identification
of targets for immune neutralization of infectivity. J Immunol, 168(10), 5184-91.
Kiino, D. R., Singer, M. S., et al. (1993). Two overlapping genes encoding membrane proteins required
for bacteriophage N4 adsorption. J Bacteriol, 175(21), 7081-5.
200
Kleivdal, H., Puntervoll, P., et al. (2001). Topological investigations of the FomA porin from Fusobacterium nucleatum and identification of the constriction loop L6. Microbiology, 147(Pt 4), 1059-67.
Koronakis, V., Sharff, A., et al. (2000). Crystal structure of the bacterial membrane protein TolC central
to multidrug efflux and protein export. Nature, 405(6789), 914-9.
Koster, M., Bitter, W., et al. (2000). Protein secretion mechanisms in Gram-negative bacteria. Int J Med
Microbiol, 290(4-5), 325-31.
Krinos, C., High, A. S., et al. (1999). Role of the 25 kDa major outer membrane protein of Legionella
pneumophila in attachment to U-937 cells and its potential as a virulence factor for chick embryos. J
Appl Microbiol, 86(2), 237-44.
Kubo, A., & Stephens, R. S. (2000). Characterization and functional analysis of PorB, a Chlamydia
porin and neutralizing target. Mol Microbiol, 38(4), 772-80.
Lim, C. K., & Cooksey, D. A. (1993). Characterization of chromosomal homologs of the plasmid-borne
copper resistance operon of Pseudomonas syringae. J Bacteriol, 175(14), 4492-8.
Limansky, A. S., Mussi, M. A., et al. (2002). Loss of a 29-kilodalton outer membrane protein in Acinetobacter baumannii is associated with imipenem resistance. J Clin Microbiol, 40(12), 4776-8.
Liu, J., Rosenberg, E., et al. (1995). Fluidity of the Lipid Domain of Cell Wall From Mycobacterium
chelonae. PNAS, 92(24), 11254-11258.
Liu, Q., Zhu, Y., et al. (2003). Identification of beta-barrel membrane proteins based on amino acid
composition properties and predicted secondary structure. Comput Biol Chem, 27(3), 355-61.
Liu, Q., Zhu, Y. S., et al. (2003). A HMM-based method to predict the transmembrane regions of betabarrel membrane proteins. Comput Biol Chem, 27(1), 69-76.
Luo, Y., Frey, E. A., et al. (2000). Crystal structure of enteropathogenic Escherichia coli intimin-receptor
complex. Nature, 405(6790), 1073-7.
Macdonald, A. G., Martinac, B., et al. (2003). Patch-clamp experiments with porins extracted from a
marine bacterium (Photobacterium profundum strain SS9) and reconstituted in liposomes. Cell Biochem
Biophys, 37(3), 157-67.
Mannella, C. A., Neuwald, A. F., et al. (1996). Detection of likely transmembrane beta strand regions
in sequences of mitochondrial pore proteins using the Gibbs sampler. J Bioenerg Biomembr, 28(2),
163-9.
Marani, P., Wagner, S., et al. (2006). New Escherichia coli outer membrane proteins identified through
prediction and experimental verification. Protein Sci, 15(4), 884-9.
Martelli, P. L., Fariselli, P., et al. (2002). A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics, 18 Suppl 1, S46-53.
McCaman, M. T., Auer, K., et al. (1999). Sequence characterization of two new members of a multi-gene
family in Serpulina hyodysenteriae (B204) with homology to a 39 kDa surface exposed protein, vspC
and D. Vet Microbiol, 68(3-4), 273-83.
201
McCaman, M. T., Auer, K., et al. (2003). Brachyspira hyodysenteriae contains eight linked gene copies
related to an expressed 39-kDa surface protein. Microbes Infect, 5(1), 1-6.
Meng, G., N. Surana, K., et al. (2006). Structure of the outer membrane translocator domain of the
Haemophilus influenzae Hia trimeric autotransporter. EMBO J, 25(11), 2297-304.
Mills, J., N. Wyborn, R., et al. (1997). An outer-membrane porin inducible by short-chain amides and
urea in the methylotrophic bacterium Methylophilus methylotrophus. Microbiology, 143(7), 2373-9.
Mller, S., Croning, M. D., et al. (2001). Evaluation of methods for the prediction of membrane spanning
regions. Bioinformatics, 17(7), 646-53.
Molloy, M. P., Herbert, B. R., et al. (2000). Proteomic analysis of the Escherichia coli outer membrane.
Eur J Biochem, 267(10), 2871-81.
Moser, I., Schroeder, W., et al. (1997). Campylobacter jejuni major outer membrane protein and a 59kDa protein are involved in binding to fibronectin and INT 407 cell membranes. FEMS Microbiol Lett,
157(2), 233-8.
Moslavac, S., Mirus, O., et al. (2005). Conserved pore-forming regions in polypeptide-transporting
proteins. FEBS J, 272(6), 1367-78.
Murzin, A. G., Lesk, A. M., et al. (1994). Principles determining the structure of beta-sheet barrels in
proteins. I. A theoretical analysis. J Mol Biol, 236(5), 1369-81.
Murzin, A. G., Lesk, A. M., et al. (1994). Principles determining the structure of beta-sheet barrels in
proteins. II. The observed structures. J Mol Biol, 236(5), 1382-400.
Mussi, M. A., Limansky, A. S., et al. (2005). Acquisition of resistance to carbapenems in multidrug-resistant clinical strains of Acinetobacter baumannii, natural insertional inactivation of a gene encoding
a member of a novel family of beta-barrel outer membrane proteins. Antimicrob Agents Chemother,
49(4), 1432-40.
Nagano, K., Read, E. K., et al. (2005). Trimeric structure of major outer membrane proteins homologous
to OmpA in Porphyromonas gingivalis. J Bacteriol, 187(3), 902-11.
Natt, N. K., Kaur, H., et al. (2004). Prediction of transmembrane regions of beta-barrel proteins using
ANN- and SVM-based methods. Proteins, 56(1), 11-8.
Nesper, J., Hill, C. M., et al. (2003). Translocation of group 1 capsular polysaccharide in Escherichia
coli serotype K30. Structural and functional analysis of the outer membrane lipoprotein Wza. J Biol
Chem, 278(50), 49763-72.
Neuwald, A. F., Liu, J. S., et al. (1995). Gibbs motif sampling, detection of bacterial outer membrane
protein repeats. Protein Sci, 4(8), 1618-32.
Newman, C. L., & Stathopoulos, C. (2004). Autotransporter and two-partner secretion, delivery of
large-size virulence factors by gram-negative bacterial pathogens. Crit Rev Microbiol, 30(4), 275-86.
Ng, T. W., Akman, L., et al. (2004). The usher N terminus is the initial targeting site for chaperone-subunit complexes and participates in subsequent pilus biogenesis events. J Bacteriol, 186(16), 5321-31.
202
Niemann, H. H., Schubert, W. D., et al. (2004). Adhesins and invasins of pathogenic bacteria, a structural
view. Microbes Infect, 6(1), 101-12.
Oberle, S. M., & Barbet, A. F. (1993). Derivation of the complete msp4 gene sequence of Anaplasma
marginale without cloning. Gene, 136(1-2), 291-4.
Ochs, M. M., Lu, C. D., et al. (1999). Amino acid-mediated induction of the basic amino acid-specific
outer membrane porin OprD from Pseudomonas aeruginosa. J Bacteriol, 181(17), 5426-32.
Ofori-Darko, E., Zavros, Y., et al. (2000). An OmpA-like protein from Acinetobacter spp. stimulates
gastrin and interleukin-8 promoters. Infect Immun, 68(6), 3657-66.
Orlik, F., Andersen, C., et al. (2003). CymA of Klebsiella oxytoca outer membrane, binding of cyclodextrins and study of the current noise of the open channel. Biophys J, 85(2), 876-85.
Pajatsch, M., Andersen, C., et al. (1999). Properties of a cyclodextrin-specific, unusual porin from
Klebsiella oxytoca. J Biol Chem, 274(35), 25159-66.
Park, K. K., Heuner, K., et al. (2002). Cloning and characterization of a major surface protein (MspTL)
of Treponema lecithinolyticum associated with rapidly progressive periodontitis. FEMS Microbiol Lett,
207(2), 185-92.
Paschen, S. A., Waizenegger, T., et al. (2003). Evolutionary conservation of biogenesis of beta-barrel
membrane proteins. Nature, 426(6968), 862-6.
Peck, B., Ortkamp, M., et al. (2001). Characterization of four members of a multigene family encoding
outer membrane proteins of Helicobacter pylori and their potential for vaccination. Microbes Infect,
3(3), 171-9.
Pellinen, T., Ahlfors, H., et al. (2003). Topology of the Erwinia chrysanthemi oligogalacturonate porin
KdgM. Biochem J, 372(Pt 2), 329-34.
Pessione, E., Giuffrida, M. G., et al. (2003). Membrane proteome of Acinetobacter radioresistens S13
during aromatic exposure. Proteomics, 3(6), 1070-6.
Pirnay, J. P., De Vos, D., et al. (2002). Analysis of the Pseudomonas aeruginosa oprD gene from clinical
and environmental isolates. Environ Microbiol, 4(12), 872-82.
Puntervoll, P., Ruud, M., et al. (2002). Structural characterization of the fusobacterial non-specific
porin FomA suggests a 14-stranded topology, unlike the classical porins. Microbiology, 148(Pt 11),
3395-403.
Rahn, A., Beis, K., et al. (2003). A novel outer membrane protein, Wzi, is involved in surface assembly
of the Escherichia coli K30 group 1 capsule. J Bacteriol, 185(19), 5882-90.
Rehm, B. H. (1996). The Azotobacter vinelandii gene algJ encodes an outer-membrane protein presumably involved in export of alginate. Microbiology, 142(4), 873-80.
Rehm, B. H., Boheim, G., et al. (1994). Overexpression of algE in Escherichia coli, subcellular localization, purification, and ion channel properties. J Bacteriol, 176(18), 5639-47.
203
Rehm, B. H., & Hancock, R. E. (1996). Membrane topology of the outer membrane protein OprH from
Pseudomonas aeruginosa, PCR-mediated site-directed insertion and deletion mutagenesis. J Bacteriol,
178(11), 3346-9.
Rodriguez-Maranon, M. J., Bush, R. M., et al. (2002). Prediction of the membrane-spanning beta-strands
of the major outer membrane protein of Chlamydia. Protein Sci, 11(7), 1854-61.
Ross, B. C., Czajkowski, L., et al. (2004). Characterization of two outer membrane protein antigens
of Porphyromonas gingivalis that are protective in a murine lesion model. Oral Microbiol Immunol,
19(1), 6-15.
Saint, N., El Hamel, C., et al. (2000). Ion channel formation by N-terminal domain, a common feature
of OprFs of Pseudomonas and OmpA of Escherichia coli. FEMS Microbiol Lett, 190(2), 261-5.
Sato, K., Sakai, E., et al. (2005). Identification of a new membrane-associated protein that influences
transport/maturation of gingipains and adhesins of Porphyromonas gingivalis. J Biol Chem, 280(10),
8668-77.
Schirmer, T., & Cowan, S. W. (1993). Prediction of membrane-spanning beta-strands and its application
to maltoporin. Protein Sci, 2(8), 1361-3.
Schulz, G. E. (2002). The structure of bacterial outer membrane proteins. Biochim Biophys Acta, 1565(2),
308-17.
Schulz, G. E. (2003). Transmembrane beta-barrel proteins. Adv Protein Chem, 63, 47-70.
Shang, E. S., Exner, M. M., et al. (1995). The rare outer membrane protein, OmpL1, of pathogenic Leptospira species is a heat-modifiable porin. Infect Immun, 63(8), 3174-81.
Siroy, A., Cosette, P., et al. (2006). Global comparison of the membrane subproteomes between a multidrug-resistant Acinetobacter baumannii strain and a reference strain. J Proteome Res, 5(12), 3385-98.
Siroy, A., Molle, V., et al. (2005). Channel formation by CarO, the carbapenem resistance-associated
outer membrane protein of Acinetobacter baumannii. Antimicrob Agents Chemother, 49(12), 4876-83.
Skare, J. T., Mirzabekov, T. A., et al. (1997). The Oms66 (p66) protein is a Borrelia burgdorferi porin.
Infect Immun, 65(9), 3654-61.
Song, L., Hobaugh, M. R., et al. (1996). Structure of Staphylococcal alpha -Hemolysin, a Heptameric
Transmembrane Pore. Science, 274(5294), 1859-1865.
Surana, N. K., Buscher, A. Z., et al. (2006). Translocator proteins in the two-partner secretion family
have multiple domains. J Biol Chem, 281(26), 18051-8.
Surana, N. K., Grass, S., et al. (2004). Evidence for conservation of architecture and physical properties
of Omp85-like proteins throughout evolution. Proc Natl Acad Sci U S A, 101(40), 14497-502.
Tamber, S., Ochs, M. M., et al. (2006). Role of the novel OprD family of porins in nutrient uptake in
Pseudomonas aeruginosa. J Bacteriol, 188(1), 45-54.
Thanassi, D. G. (2002). Ushers and secretins, channels for the secretion of folded proteins across the
bacterial outer membrane. J Mol Microbiol Biotechnol, 4(1), 11-20.
204
Tomb, J. F., White, O., et al. (1997). The complete genome sequence of the gastric pathogen Helicobacter
pylori. Nature, 388(6642), 539-47.
Toren, A., Orr, E., et al. (2002). The active component of the bioemulsifier alasan from Acinetobacter
radioresistens KA53 is an OmpA-like protein. J Bacteriol, 184(1), 165-70.
Touze, T., Hayward, R. D., et al. (2004). Self-association of EPEC intimin mediated by the beta-barrelcontaining anchor domain, a role in clustering of the Tir receptor. Mol Microbiol, 51(1), 73-87.
Tsuzuki, M., Xu, X. Y., et al. (2005). SspA, an outer membrane protein, is highly induced under saltstressed conditions and is essential for growth under salt-stressed aerobic conditions in Rhodobacter
sphaeroides f. sp. denitrificans. Appl Microbiol Biotechnol, 68(2), 242-50.
Ulmke, C., Lengeler, J. W., et al. (1997). Identification of a new porin, RafY, encoded by raffinose
plasmid pRSD2 of Escherichia coli. J Bacteriol, 179(18), 5783-8.
Umeda, H., Aiba, H., et al. (1996). SomA, a novel gene that encodes a major outer-membrane protein
of Synechococcus sp. PCC 7942. Microbiology, 142(8), 2121-8.
Vashist, J., & Rajeswari, M. R. (2006). Structural investigations on novel porin, OmpAb from Acinetobacter baumannii. J Biomol Struct Dyn, 24(3), 243-53.
Vogel, H., & Jahnig, F. (1986). Models for the structure of outer-membrane proteins of Escherichia coli
derived from raman spectroscopy and prediction methods. J Mol Biol, 190(2), 191-9.
von Heijne, G. (1999). Recent advances in the understanding of membrane protein assembly and structure. Q Rev Biophys, 32(4), 285-307.
Voulhoux, R., & Tommassen, J. (2004). Omp85, an evolutionarily conserved bacterial protein involved
in outer-membrane-protein assembly. Res Microbiol, 155(3), 129-35.
Wimley, W. C. (2002). Toward genomic identification of beta-barrel membrane proteins, composition
and architecture of known structures. Protein Sci, 11(2), 301-12.
Wimley, W. C. (2003). The versatile beta-barrel membrane protein. Curr Opin Struct Biol, 13(4), 40411.
Wong, H. C., Fear, A. L., et al. (1990). Genetic organization of the cellulose synthase operon in Acetobacter xylinum.Proc Natl Acad Sci USA, 87(20), 8130-4.
Wylie, J. L., & Worobec, E. A. (1994). Cloning and nucleotide sequence of the Pseudomonas aeruginosa
glucose-selective OprB porin gene and distribution of OprB within the family Pseudomonadaceae. Eur
J Biochem, 220(2), 505-12.
Wylie, J. L., & Worobec, E. A. (1995). The OprB porin plays a central role in carbohydrate uptake in
Pseudomonas aeruginosa. J Bacteriol, 177(11), 3021-6.
Xu, X., Abo, M., et al. (2001). Salt-stress-responsive membrane proteins in Rhodobacter sphaeroides f.
sp. denitrificans IL 106. J Biosci Bioeng, 91(2), 228-30.
205
Xu, X. Y., Kadokura, H., et al. (2001). Cloning and sequencing of a gene encoding a novel salt stressinduced membrane protein from Rhodobacter sphaeroides f. sp. dentrificans. Appl Microbiol Biotechnol,
56(3-4), 442-7.
Yen, T. Y., Pal, S., et al. (2005). Characterization of the disulfide bonds and free cysteine residues of
the Chlamydia trachomatis mouse pneumonitis major outer membrane protein. Biochemistry, 44(16),
6250-6.
Yoshihara, E., Yoneyama, H., et al. (1998). Identification of the catalytic triad of the protein D2 protease
in Pseudomonas aeruginosa. Biochem Biophys Res Commun, 247(1), 142-5.
Zhai, Y., & Saier Jr., M. H. (2002). The beta-barrel finder (BBF) program, allowing identification of outer
membrane beta-barrel proteins encoded within prokaryotic genomes. Protein Sci, 11(9), 2196-207.
Zhang, J. Z., Guo, H., et al. (2004). Expression of members of the 28-kilodalton major outer membrane
protein family of Ehrlichia chaffeensis during persistent infection. Infect Immun, 72(8), 4336-43.
Zhang, Q., Meitzler, J. C., et al. (2000). Sequence polymorphism, predicted secondary structures, and
surface-exposed conformational epitopes of Campylobacter major outer membrane protein. Infect Immun, 68(10), 5679-89.
Zogaj, X., Nimtz, M., et al. (2001). The multicellular morphotypes of Salmonella typhimurium and
Escherichia coli produce cellulose as the second component of the extracellular matrix. Mol Microbiol,
39(6), 1452-63.
K ey T erms
Beta-Barrel Outer Membrane Proteins: They constitute one of the two major structural classes
of transmembrane proteins (the other being the alpha-helical membrane proteins). They have their
membrane-spanning segments entirely composed of short amphipathic beta-strands that twist and coil
in order to form a barrel (beta-barrel). They have been found up to now only in the outer membranes
of Gram-negative Bacteria and presumably (based on low resolution experimental data) on the outer
surfaces of mitochondria and chloroplasts. They perform a series of very important biological functions such as membrane transport, receptor activity, and enzyme function or have a structural role. The
smallest known transmembrane beta-barrels are composed of 8 beta-strands, whereas the largest are
composed of 24 beta-strands even though it is possible that barrels with a larger number exist.
Gram-Negative Bacteria: Traditionally, Bacteria are divided according to their behaviour to the
Gram staining. Bacteria that are stained by the Grams method are commonly referred to as Grampositive, whereas others (that are not stained) are referred to as Gram negative. Gram-negative Bacteria
include a number of important classes such as proteobacteria, cyanobacteria, spirochaetes, green sulfur
and green non-sulfur bacteria. They all share a common structure in their surface which is composed
of a cytoplasmic membrane, a periplasmic space and an outer membrane.
Outer Membrane: The term refers to the external (outside) membranes of Gram-negative bacteria,
chloroplasts and mitochondria. The outer membrane of Gram-negative bacteria has a unique and un-
206
usual structure. The outer leaflet of the membrane is composed of a complex lipopolysaccharide (LPS)
whose lipid portion acts as an endotoxin. Another noticeable difference of the outer membrane (at least
in Bacteria) is the fact that proteins that are embedded in it (integral membrane proteins), are having
their membrane-spanning segments entirely composed of beta-strands (beta-barrel outer membrane
proteins), as opposed to integral membrane proteins in any other membrane which have their membrane
spanning segments formed by hydrophobic alpha-helices (alpha-helical membrane proteins).
Porins: This is the oldest known and well-studied super-family of transmembrane beta-barrels.
Porins are large enough to allow passive diffusion i.e., they act as channels that are specific to different
types of molecules. They are found in the outer membranes of Gram-negative bacteria, mitochondria
and chloroplasts. The amino acid composition of the transmembrane beta strands is unique since polar
and non-polar residues alternate along them. This way, the non-polar residues face outwards in order to
interact with the non-polar lipid bilayer, while the polar residues face the barrel interior in order to interact
with the aqueous channel. Porins typically control the diffusion of small metabolites like sugars, ions,
and amino acids. The archetypical general diffusion porins of Gram-negative Bacteria are composed of
16-stranded beta-barrels and are active only as homo-trimers (trimeric porins). However, other families
have been found deviating from this motif such as the sugar porins which are trimeric porins composed
of 18-stranded beta-barrels or the monomeric porins composed of 14-stranded beta-barrels.
207
Section IV
Experimental Techniques for

Systems Biology
209
Chapter XI
Clustering Methods for GeneExpression Data

L.K. Flack
University of Queensland, Australia
G.J. McLachlan
University of Queensland, Australia
abstract
Clustering methods are used to place items in natural patterns or convenient groups. They can be used
to place genes into clusters to have similar expression patterns across the tissue samples of interest.
They can also be used to cluster tissues into groups on the basis of their gene profiles. Examples of the
methods used are hierarchical agglomerative clustering, k-means clustering, self organizing maps, and
model-based methods. The focus of this chapter is on using mixtures of multivariate normal distributions
to provide model-based clusterings of tissue samples and of genes.
INTR OD UCTI ON
DNA microarrays are collections of microscopic DNA spots arrayed on a solid surface. Each of these
DNA spots will hybridize with a particular target RNA or DNA sequence. Optical measurements are
made of fluorophores attached to the target RNA or DNA. DNA microarrays allow us to simultaneously read expression levels of expression levels on thousands of genes. They and other high throughput
measurement methods bring many new opportunities in data analysis, but they also create difficulties
in taking advantage of this amount of data.
A variety of multivariate methods have been used to look for relationships among the genes and
tissue samples. Cluster analysis has been one of the most frequently used of these methods. It has been
useful in the discovery of gene function and of groups of interconnected biological processes; see Eisen
et al. (1998) for examples.
Clustering Methods for Gene-Expression Data
In medical applications, we are usually interested in the supervised and unsupervised grouping of
tissue samples on the basis of the genes expressed. In the latter context, the intent is to identify what
subtypes of cancer or other diseases exist, with the aim of assigning patients to these subgroups in order
to aid their prognosis and therapy. In biological studies, we are usually interested in partitioning the
genes into clusters in which the genes display similar patterns of gene expression across the relevant
tissue samples (or cell lines). Genes in the same cluster are perhaps likely to be part of the same biological pathway or otherwise related.
It can be seen there are two distinct but related clustering problems with microarray data. One problem
concerns the clustering of the tissues on the basis of the genes; the other concerns the clustering of the
genes on the basis of the tissues. This duality in cluster analysis is quite common.
The aim of clustering is to put items into groups so that they are more similar to each other than they
are to members of other clusters. One of the difficulties of clustering is that the notion of clustering is
vague. A useful way to think about the different clustering procedures is in terms of the shape of the
clusters. The majority of the existing clustering methods assume that a similarity or distance measure
or metric is known a priori; often the Euclidean metric is used. But clearly, it would be more appropriate to use a metric that depends on the shape of the clusters. As pointed out by Coleman et al. (1999),
the difficulty is that the shape of the clusters is not known until the clusters have been found, and the
clusters cannot be effectively identified unless the shapes are known.
We will give a brief overview of clustering before we describe its application to microarray data.
More detailed accounts of clustering can be found in the many books on this topic; for example, Everitt
(1993), Hartigan (1975), and Kaufman and Rousseeuw (1990).
So me H e uristic
Clu stering
Meth ods
In cluster analysis, we wish to group a number (n) of entities into a smaller number (g) of groups on the
basis of measurements of some variables associated with each entity. We let yj = (y1j, , ypj)T be the observation or feature vector of p measurements y1j, , ypj made on the jth entity ( j = 1, , n) to be clustered.
In discriminant analysis the data belong to g known classes and we wish to create an allocation rule to
allow us to assign an unclassified entity to one of these classes on the basis of its feature vector.
In cluster analysis, we have no prior knowledge of group membership or structure, except possibly
the number of classes. Clustering can have either or both of two aims. We might wish to split the data
into several groups with no implication that these groups are a natural division of the data. We might do
this for the sake of convenience or mathematical tractability. In this case intergroup boundaries do not
necessarily have to be in regions of the feature space with a relatively low density of points. The feature
space will be divided into contiguous and at least in some sense compact regions. This is sometimes
called dissection or segmentation. Alternatively, we might wish to find a natural subdivision of the
entities into groups. In this case the clusters will be regions of the feature space with a relatively high
density of points separated by regions with relatively low densities of points. Sometimes the distinction
between the two aims is stressed. But often it is not made, particularly as most methods for finding
natural clusters are also useful for segmenting the data.
Clustering methods can be categorized as hierarchical or nonhierarchical. With a hierarchical
clustering method every cluster obtained is a split or merger of clusters obtained at the previous stage.
Hierarchical clustering methods can be agglomerative, starting with g = n clusters or divisive starting
210
with a single cluster. In practice with hierarchical clustering we usually use agglomerative methods
since hierarchical divisive methods can computationally prohibitive unless the sample size n is very
small. (There are 2(n--1) 1 ways of making the first split.) There are methods of carrying out hierarchical divisive clustering that are less computationally intensive, but they will usually not give the optimal
partition (Hastie et al., 2001).
To apply a hierarchical agglomerative clustering, procedure we first calculate a matrix of distances
or of similarities between pairs of observations. We need to choose an appropriate distance metric.
Pearson correlation coefficients are common choice for genetic data. They are equivalent to the squared
Euclidean distances on data normalized to have zero means and unit variances. We next choose a linkage metric for the between cluster distance. Some examples are single linkage, average linkage, and
complete linkage. Single linkage uses the shortest distance between any two objects (one from each
cluster). Complete linkage uses the longest distance between any two objects (one from each cluster).
Average linkage uses the average distance between pairs of objects (one from each cluster).
For agglomerative clustering, we join the two closest clusters, update the intercluster distances and
repeat until we have only one cluster. These clusterings can be represented as a tree. We cut the tree at
a level that gives a clustering with which we are satisfied; ideally, one which gives well separated coherent clusters. Another approach is Wards procedure where we join clusters so as to minimize withincluster variance. With hierarchical clustering poor choices of a join or a split early in the clustering
process cannot be corrected by later splits or joins. With single linkage clustering there is a tendency
for neighbouring clusters to join into chains. These are often not a natural division of the feature space
and the resulting clusters are not compact. Complete linkage tends to create compact groups and seems
to perform better than single or average linkage for gene expression data.
One of the most popular nonhierarchical clustering algorithms is k-means. It seeks to find k = g
clusters that minimize the sum of the squared Euclidean distances between the observations yj and their
cluster means; that is, it seeks to minimize the trace of W, trW, where:
g
W = zij (y j yi )(y j yi )
i =1 j =1
(1)
is the pooled within-cluster sums of squares and products matrix, and:

n
yi = zij y j
j =1
z
j =1
ij
(2)
is the sample mean of the ith cluster. Here zij is a zero-one indicator variable which takes the value one
if yj belongs to the ith cluster and zero if it does not.
It is impractical to consider all partitions of the observations into g clusters unless n is very small since
the number of such partitions is approximately gn /g!; see Kaufman and Rousseeuw (1990). We usually
implement k-means by iteratively moving points between clusters so as to minimize trW. One method
is to assign each observation yj to the cluster with the nearest centre (sample mean) and then update the
cluster centre before repeating the process with the next observation. The initial centre estimates are
often a random subsample of k points from the data set to be clustered.
A related clustering method is k-medioids (Kaufman and Rousseeuw, 1990) which is similar to kmeans, but which only allows observations yj as cluster centres. Another related method is the self-or-
211
ganizing map which constrains the cluster centres to lie on a smooth two-dimensional surface. Another
possibility is to seek to minimize the determinant of W, |W|, rather than its trace (Friedman and Rubin,
1967). This will tend to divide the data into ellipsoidal clusters rather than the spherical clusters that
k-means tends to create.
A difficulty with all the heuristic clustering methods described so far is that they have no objective
means of choosing the number of clusters.
C lustering Using Mixture Models

We will be focussing on model-based clustering via mixtures of normal densities. In this approach, each
observation vector yj is assumed to have a g-component normal mixture density:
g
f (y j ; )=
i =1
(y ;
j
, i )

(3)
where (yj ;i , i ) denotes the p-variate normal density function with mean i and covariance matrix i
and the i denote the mixing proportions which are nonnegative and sum to one. The vector of unknown
parameters, consists of the mixing proportions i, the elements of the component means i, and the
elements of the component-covariance matrices i. This vector can be estimated by its maximum likeli calculated via the EM algorithm; see McLachlan and Basford (1988) and McLachlan
hood estimate
and Peel (2000). This gives a probabilistic clustering defined in terms of the estimated posterior prob ), where the latter is the estimated probability that the feature
abilities of component membership i(yj ;
vector with observed value yj belongs to the ith component of the mixture. An outright clustering can
be obtained by assigning yj to the component to which it has the greatest estimated posterior probability
of belonging.
Since clustering methods based on mixture models have an explicit statistical model underpinning
them, the clustering obtained is more easily described and interpreted than one not derived from a
parametric model. We can perform tests of significance on the results in order to determine whether
we have a natural clustering or just a segmentation of the data. We can compare different clusterings
(Aitkin et al., 1981; McLachlan, 1987). The clusters can overlap as we would expect some gene and
tissue clusters to do. The assumption of normality for the cluster distributions is a potential limitation.
However, if an appropriate normalization is done this would appear to be a reasonable assumption for
microarray data.
We can also use the likelihood ratio to test for the number of components in the mixture, although
regularity conditions do not hold for -2log to have its usual null distribution of chi-squared with degrees of freedom equal to the difference in the number of parameters between the null and alternative
hypotheses, For normal mixture models the most useful of the closed form likelihood based criteria
is the BIC. This is equal to -2log + d log n where d is the number of fitted parameters. We choose the
value of g which maximizes this statistic. While this depends on regularity conditions that do not hold, it
often works reasonably well in practice. We can also use a resampling approach (McLachlan, 1987).
An advantage of mixture models with elliptically symmetrical distributions such as the normal or
more robust models using the t distribution (McLachlan and Peel, 2000) is that the clustering obtained
is not affected by changes of location, scale, or by rotation.
Proposals have been made for approaches that allow both genes and tissues to be simultaneously
clustered (Pollard and van der Laan, 2002; Getz et al., 2000).
212
If the number of elements per vector p is large we might not be able to fit the mixture model described
above. This is because the component-covariance matrices i are highly parameterized with p(p+1)
distinct elements each. Hence if we wish to fit normal mixture models to high-dimensional data we
need to use some form of dimension reduction or regularization.
A common method of reducing dimensionality is to perform a principal component analysis and use
only the first few principal components if they account for a large proportion of the variance. However,
the principal components are based on a global model and so they do not necessarily give the directions
in the feature space best for revealing the underlying group structure.
A global nonlinear method of dimension reduction can be obtained by using mixtures of linear submodels, such as with mixtures of factor analysers. This model provides local dimensionality reduction
by imposing on the component-covariance matrices the constraint:
i = Bi Bi + Di (0 = 1,...,g)
(4)
where Bi is a p x q matrix of factor loadings and Di is a diagonal matrix (i = 1, , g). Within the ith
component of the mixture, it effectively approximates the correlations between the variables by their
linear dependence on a small number of (unobservable) latent variables (factors).
Model-based methods allow us to perform tests to determine whether the number of clusters g allows a satisfactory division of the data set. If we are using a mixture of factor analysers model we can
also test for the number of factors per cluster required for an adequate representation of the variance
structure of the clusters
For a given number of components g we can make a test for the number of factors using the likelihood ratio , where = L(0) / L(1) where L(1) and L(0) are the likelihoods under the alternative
and null hypotheses respectively. Regularity conditions do hold for this test for a fixed number of components For the null hypothesis that H0 : q = q0 versus the alternative H1 : q = q + 1, the statistic -2log is
asymptotically chi-squared with d = g(p-q0) degrees of freedom. However where n is not large relative
to the number of unknown parameters we prefer to use the BIC (Bayesian information criterion). In
this context, it means that twice the increase in the log likelihood (-2log) has to be greater than dlogn
for the hypothesis to be rejected.
D ata T ype
We now proceed to consider the application of the methods described above to the clustering of microarray data. Normalized expression data from microarray or other genetic experiments can be written
as a matrix with M columns and N rows. The M columns correspond to the tissue samples and the N
rows correspond to the N genes in the experiment. The expression signature of a tissue sample is the
level of the N genes. The expression profile of a gene is the M expression levels across different tissue
samples.
T issue C lustering
The observation vectors yj used in tissue clustering are the expression signatures of the tissue samples,
where yj contains the expression levels of the N Genes. For hierarchical clustering the distance matrix
is quite tractable for tissue data as it requires only O(M2N) calculations and storage of a matrix with
213
O(M2) values. The number of tissue samples M is usually less than 100; N could be tens of thousands.
A problem with these methods is how to choose the number of clusters. Another, especially with single
linkage clustering, is that they can put most of the observations into a single large cluster with several
singletons. The best performance appears to be from complete linkage (Gibbons and Roth, 2002). We
see no reason why we should expect genomic data to form hierarchical groupings.
Nonhierarchical methods such as k-means and self-organizing maps typically do better than hierarchical methods, but do tend to divide the data into spherical clusters even when this is a suboptimal
representation of the data. Model based methods have recently become a focus of attention (McLachlan
et al., 2002). They offer soundly based tests and can reflect the fuzzy nature of some groups. The chief
problem with applying them directly to clustering of tissues is that they cannot be directly fitted if the
number of genes is large. This is because we have more parameters to be estimated than we have observations to estimate them with. Some form of dimension reduction is necessary, such as with mixtures
of factor analysers as described above.
The EMMIX-GENE program of McLachlan et al (2002) is designed for the clustering of gene samples
using mixtures of factor analysers. However, since the number of tissue samples is usually quite small
relative to the number of genes, it might not be practical to fit mixtures of factor analysers to the tissue
samples on the basis of all the genes as it would involve a considerable amount of computation time.
Thus initially some of the genes may have to be removed. Indeed, the simultaneous use of too many
genes in the cluster analysis may only serve to create noise that masks the effect of a smaller number of
genes. Using different subsets of the genes can lead to different clusterings of the tissues (Pollard and
van der Laan, 2002; Friedman and Meulman, 2004). For example, we might be able to cluster samples
by tissue type (cancerous or healthy) and also create a different clustering for the same tissues based
for example on progression through the cell cycle (Belitskaya-Levy, 2006).
Thus EMMIX-GENE has two optional preliminary steps. The first is to individually screen the genes
to eliminate ones which are expected to be of little use in clustering. This is done by testing each gene
for the hypotheses that it has a single component t distribution over the tissue samples. After this is
done there may still be too many variables. Hence there is a second step available in which the selected
genes are clustered into groups on the basis of Euclidean distance. This is done by fitting a mixture of
normal distributions with the covariance matrices restricted to being equal to a multiple of the identity
matrix.
We now cluster the tissue samples on the basis of all or some of the genes in a chosen cluster. Alternatively, we can replace the gene clusters by their means and base our clustering on some or all of these
means. The clustering of the genes in this step is more a segmentation for computational convenience
than an attempt to seek an informative natural clustering of the genes.
G ene C lustering
The data used in gene clustering are the expression profiles of the genes. While hierarchical methods
can be used for the clustering of gene profiles, they are less computationally tractable than they are for
tissues. This is because the distance matrix is usually much larger, O(N2), and takes more calculations
to create, O(N2M), since N is so much larger than M. Difficulties for hierarchical methods with choosing the appropriate number of groups and with chain formation apply to gene clustering as well as to
tissue clustering.
214
Gene shaving (Hastie et al., 2000) is an example of a nonhierarchical method. This method creates
tight clusters of genes with large variation between nonreplicated samples. Replicated samples need
to be replaced by their means. It does not necessarily put all the genes into clusters, only the ones with
significant differences between samples. The same gene can occur in more than one cluster.
Mixtures of normal distributions are again a useful clustering approach. However, up until now, we
have clustered the data under two assumptions; namely:
a.
b.
There are no replications on any particular entity.

All the observations on the entities are independent of one another.
The difficulty here is that condition (b) will not hold for the clustering of gene profiles, since not all
the genes are independently distributed, and condition (a) will often not hold as there are likely to be
replicates of tissues or multiple measurements over time on a tissue. These could be handled by appropriately specifying the component-covariance matrices, but the models created this way may be hard to
fit. For example, the M-step may not exist in closed form leading to computational difficulties.
The relaxation of assumptions (a) and (b) above can be overcome by using mixtures of linear mixed
models, as proposed by Ng et al. (2006). They have developed the program EMMIX-WIRE for the fitting
of these models. To describe their approach, suppose that in a gene microarray experiment, we have for
T
T
T
the jth gene a feature vector (the gene profile) y j = (y1Tj ,..., ymj
) , where ytj = y1tj ,..., yrttj (t = 1,...m ),
and where rt is the number of replicates of the tth distinct tissue. Conditional on its membership of the
ith component of the mixture, we assume that the distribution of yj can be modelled by a linear mixedeffects model:
yj = X
+ Ubij + Vci + ij
(5)
In (5), the elements of i are fixed effects, and bij (a qb -dimensional vector) and cj (a qc -dimensional
vector) are the unobservable gene- and tissue-specific random effects, respectively. These random effects
T
represent the variation due to the heterogeneity of genes and samples, (corresponding to bi = biT1 ,...,Tin
and ci, respectively. The random effects bj and ci and the measurement error vector ij are assumed to be
mutually independent, and X, U, and V are the design matrices for the fixed and random effects.
The distributions of bi and ci are assumed to be multivariate normal, N qb (0, H i ) and N qc 0, ci I qc ,
respectively, where Hi is a qb x qb covariance matrix and I qc is the qc x qc identity matrix. The measureNM(0,Ai ), where Ai is a diagonal matrix
ment error vector ij is also taken to be multivariate normal
T
constructed from the vector Wi, where i = i21 ,..., iq2 e and W is a M x qe zero-one design matrix. That
is, we allow the ith component-variance to be different among the M microarray experiments.
The vector of unknown parameters can be obtained by the maximum likelihood approach via the
EM algorithm, proceeding conditionally on the tissue specific random effects ci as formulated in Ng et
al. (2006). The gene-profile vectors yi are not independently distributed due to the presence of the cluster
random effects terms ci in the likelihood function. Hence there is difficulty in calculating the latter. It
is not needed in the application of the EM-algorithm, but is needed in an assessment of the number of
components. In our application of BIC for the number of components, we
215
E xamp les
As an illustration of the use of the EMMIX-GENE procedure, we applied it to the colon cancer data
set of Alon et al. (1999). This consists of n = 2000 genes and p=62 tissue samples. The samples came
from 40 tumours and 22 normal tissues. After select-genes was applied to this set, there were 446 genes
remaining in the set. These genes were then clustered into 20 groups. These groups were ranked on
the basis of -2 log , where is the likelihood ratio statistic for testing g = 1 versus g = 2 components
in the mixture model. If the tumours are clustered on the basis of the second ranked group G2 we have
a partition of the tissues in which one cluster contains 37 tumours and 3 normal samples and the other
cluster contains 3 tumours and 19 normal samples. This gives an error rate of 6 out of 62 tissues. We
have displayed this as heat map where the colour of a square indicates the expression level for a particular gene in a specified tissue.
To illustrate the EMMIX-WIRE procedure, we use an example from Ng et al. (2006) who used it to
cluster time-course data from the yeast cell-cycle study of Spellman et al. (1998). The data consist of
the expression levels of 612 genes at M =18 time points. The design matrix, X used was an 18 x 2 matrix
with the (l + 1)th row (l = 0, , 17)
Figure 1. Heat map of gene expressions from data in Alon et al. (1999)
216
(cos 2 (7l/(+) sin 2 (7l/(+))

where the period of the cell cycle was taken to be 53 and the phase offset was set to zero. The design matrices for the random effects parts were specified as U = 118 and V = I18. That is, it was assumed
there were random gene effects bij with qb = 1 and cluster random effects, ci1 ,..., ciqc with qc = m = 18.
The cluster random effects reflect possible dependence among expression levels within the same cluster
at the same time. In the specification of the error matrix, we took W = 118 and i = 2i (qe = 1), so that
the component variances are common among the m = 19 experiments. The number of components g
was determined by using BIC for model selection. It indicated that there were twelve clusters. For each
cluster we have plotted the expression level against time for its constituent genes.
Fu t ure T rends
With the ever increasing emergence of high-throughput data in biology, there will be increasing use of
cluster analysis to reduce the dimensionality of such data to aid in the biological inferences to be drawn.
Clustering methods will also be used to study specific problems in genomics, such as the analysis of
time-course experiments.
Figure 2. Gene-expression level against time for 12 clusters of genes
217
Characterizing the dynamic regulation of gene expression by time is becoming more and more
important. Standard methods of cluster analysis are not really applicable to such data as they would
give the same clusterings regardless of the order in which the expressions have been observed. Also,
they assume the data (the gene profiles) to be clustered are independently distributed. Hence increasing
attention is being given to model-based methods, such as mixtures of linear mixed models, that allow
the expression levels in the same gene profile to be correlated. They also allow for nonzero correlations
between gene profiles in the same cluster (Ng et al., 2006).
Model-based methods such as mixtures of normal distributions have already been widely adopted
in the clustering of the tissue samples on the basis of the genes. Given that the latter can be present in
very large numbers, attention has been given to variable (gene) selection methods in order for the normal
components in the mixture model to be fitted without restrictions such as those that take the component
covariances to be diagonal (McLachlan et al., 2002).
Co nc lusi on
As more high throughput data sets become available, cluster analysis will become more important as
part of their analysis. The most commonly used clustering methods have been hierarchical algorithms,
but these have problems when they are used to cluster gene-array data. There is no reason to expect
clusters of genes or tissues to form a hierarchy and the overlapping nature of likely groupings can lead
to poor performance from hierarchical clustering algorithms. Also, these methods provide no criterion
for determining the number of clusters.
Nonhierarchical heuristic methods do not force a hierarchy on the clustering and can correct a misclassification made at an early stage in the algorithm. However, they often impose other restrictions on
the clusters found, such as a tendency towards sphericity. Again they do not provide criteria to decide
how many clusters the observations should be partitioned amongst.
Mixture model based clustering allows tests of the number of clusters and can reflect differences
in dispersion and orientation between clusters. Used in conjunction with some form of variable selection, mixtures of factor analysers provide a good method for clustering tissues, used with some form
of variable selection. When clustering genes, correlations between genes and samples should be taken
into account. This can be done by fitting mixtures of as linear mixed models.
The authors wish to thank Dr. Ian Wood for his comments on this chapter.
R eferences
Aitkin, M., Anderson, D., & Hinde, J. (1981). Statistical modelling of data on teaching styles (with
discussion). Journal of the Royal Statistical Society A, 144, 419-461.
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999).
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences USA,
96, 6745-6750.
218
Belitskaya-Levy, I. (2006). A generalized clustering problem, with application to DNA microarrays.

Statistical Applications in Genetics and Molecular Biology, 5, Article 2.
Coleman, D., Dong, X.P., Hardin, J., Rocke, D.M., Woodruff, D.L. (1999). Some computational issues
in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31, 1-11.
Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D. (1998). Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences USA, 95, 14863-14868.
Everitt, B.S. (1993). Cluster analysis, 3rd edition. London: Edward Arnold.
Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American
Statistical Association, 62, 1159-1178.
Friedman, J.H., & Meulman, T.J. (2004). Clustering objects on subsets of attributes (with discussion).
Journal of the Royal Statistical Society B, 66, 815-849.
Getz, G., Levine, E., & Domany, E. (2000). Coupled two-way clustering analysis of gene microarray
data. Cell Biology, 97, 12079-12084.
Gibbons, F.D., & Roth, F.P. (2002). Judging the quality of gene expression-based clustering methods
using gene annotation. Genome Research, 12, 1574-1581.
Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.
Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., & Botstein, D.
(2000). Gene shaving as a method for identifying distinct sets of genes with similar expression patterns.
Genome Biology, 1, research 003.1-003.21.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York:
Springer-Verlag.
Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: An introduction to cluster analysis.
New York: Wiley.
McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components
in a normal mixture. Applied Statistics, 36, 318-324.
McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering.
New York: Marcel Dekker.
McLachlan, G.J., Bean, R.W., & Peel, D. (2002). A mixture model-based approach to the clustering of
microarray expression data. Bioinformatics, 18, 413-422.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Ng, S.K., McLachlan G.J., Wang, K., Ben-Tovim Jones, L., & Ng, S-W. (2006). A mixture model
with random-effects components for clustering correlated gene-expression profiles. Bioinformatics, 22, 1745-1752.
Pollard, K.S., & van der Laan, M.J. (2002). Statistical inference for simultaneous clustering of gene
expression data. Mathematical Biosciences, 176, 99-121.
219
Spellman, P., Sherlock, G., Zhang, M.Q., Iyer, V.I., Anders, K., Eisen, M.B., Brown, P.O., Botstein,
D., & Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9, 3273-3297.
K ey terms
EM Algorithm: A method for calculating maximum likelihood estimates of parameters in statistical
models in situations where the observed data can be usefully viewed as being incomplete. It proceeds
by consideration of the complete data log likelihood, which is formed on the basis of the complete data.
The latter comprises the observed data and the missing data. It is implemented iteratively by alternating two steps known as the expectation (E) step and the maximization (M) step. On the E-step the
Q-function is calculated by averaging the complete data log likelihood over the conditional distribution
of the complete data given the observed data, using the current value for the parameter vector. This is
followed by the M-step in which the current estimate of the parameter vector is updated to that value
which globally maximizes the Q-function.
Euclidean Distance: A measure of distance between two points. It is the square root of the sum of
the squares of the differences between the coordinates of the two points.
Factor Analysis: Factor analysis is a statistical technique in which the correlation between the variables
is approximated by the linear dependence of the latter on a set of unobservable (latent) variables.
Heuristic: An empirical method of solving a problem which does not necessarily reflect the underlying nature of the problem.
Likelihood: The likelihood function is found by evaluating the joint density of the random variables
in the model defining the random phenomenon under study at their observed values.
Maximum Likelihood: The maximum likelihood estimate of a parameter is obtained by consideration of the likelihood (equivalently the log likelihood) function. Typically in the case of a bounded
likelihood function it is given by the values of the parameters which globally maximize the likelihood
function.
Microarray: A slide which contains a grid consisting of a large number of microscopic spots of different DNAs, each of which will hybridize with a particular target RNA or DNA sequence. The target
RNA or DNA is generally attached to a fluorescent marker. When the target RNA or DNA binds to
the complementary DNA on the slide this binds the fluorescent marker to the slide. The measurements
taken are of the intensity of the fluorescence from these markers.
Multivariate: A multivariate problem one has more than one response variables.
Principal Components: The principal components of data set are the projections of the data vectors onto new coordinate axes that result from a rotation of the centered data set. This rotation is done
in such a way that the first principal component (the projection onto the first coordinate axis) has the
largest possible variance, the second principal component has the next largest, and so on.
220
221
Chapter XII
Uncovering Fine Structure in

Gene Expression Profile by
Maximum Entropy Modeling of
cDNA Microarray Images and
Kernel Density Methods
George Sakellaropoulos
University of Patras, Greece
Antonis Daskalakis
George Nikiforidis
Christos Argyropoulos
University of Pittsburgh Medical Center, USA
abstract
The presentation and interpretation of microarray-based genome-wide gene expression profiles as complex biological entities are considered to be problematic due to their featureless, dense nature. Furthermore microarray images are characterized by significant background noise, but the effects of the latter
on the holistic interpretation of gene expression profiles remains under-explored. We hypothesize that
a framework combining (a) Bayesian methodology for background adjustment in microarray images
with (b) model-free modeling tools, may serve the dual purpose of data and model reduction, exposing
hitherto hidden features of gene expression profiles. Within the proposed framework, microarray image
restoration and noise adjustment is facilitated by a class of prior Maximum Entropy distributions. The
resulting gene expression profiles are non-parametrically modeled by kernel density methods, which not
only normalize the data, but facilitate the generation of reduced mathematical descriptions of biological
variability as mixture models.
Uncovering Fine Structure in Gene Expression Profile
INTR OD UCTI ON
The advent of complementary DNA (cDNA) microarray technologies enabled the simultaneous and
specific assessment of the expression levels of thousands of genes (Southern, Mir, & Shchepinov, 1999).
The conventional approach to analyze such datasets is to explore quantitative co-expression relations
across a variety of experimental conditions prior to invoking putative similarities in gene regulation
or function (DeRisi, Iyer, & Brown, 1997; Eisen, Spellman, Brown, & Botstein, 1998). The alternative
viewpoint considers gene expression profiles from specific conditions to be informative of distinct molecular signatures that characterize cellular states. Such genome wide, transcriptional signatures have
been used to distinguish normal from abnormal samples in benign developmental conditions (Barnes
et al., 2005), solid tumors and hematologic malignancies (Febbo et al., 2005; Valentini, 2002) and differentiate distinct disease states of renal allografts (Sarwal et al., 2003). It has been suggested that the
thousands of expression values in a microarray experiment are too dense and irregular to be directly
interpreted in a holistic manner and that alternative transformations of the normalized gene profiles should
be sought after (Guo, Eichler, Feng, Ingber, & Huang, 2006). Nevertheless one could justifiably argue
that the irregularity of the gene profiles is due to incomplete modeling and adjustment for the presence
of measurement noise. However this alternative hypothesis has not been adequately addressed in the
current literature. These considerations underline the impetus for the present work, which aims to:
1.
2.
Establish the role of microarray image background in the irregularity and featureless appearance
of gene expression profiles (GEP) from individual experimental states.
Propose a data and model reduction framework for the analysis of GEP consisting of:
a. A probabilistic Bayesian algorithm for background adjustment of microarray images based
on Maximum Entropy distributions.
b. Non-parametric kernel density estimation methods for the mathematical representation and
exploration of the resultant gene expression profiles.
BACKGR
OUND
The basic microarray procedure involves hybridization of complementary nucleic acid molecules, one of
which (target) has been immobilized in a solid substrate (e.g. glass) using a robotically controlled device
(arrayer). Such targets form spots at the vertices of a rectangular lattice on the solid substrate surface;
each spot then serves as a highly specific and sensitive detector of the corresponding gene. Technical
factors operating at different stages of the microarray pipeline impart the final microarray image with
uneven and non-negligible background. Background correction has been considered a necessary step
in the microarray pipeline (L. Qin, Rueda, Ali, & Ngom, 2005), which may in fact have a significantly
higher bearing on the final results than normalization. This viewpoint is supported by a number of empirical studies showing that conventional approaches to background correction may lead to nonsensical
results (e.g. negative gene expression measures) and hinder the ability to detect differential gene expression (L. X. Qin & Kerr, 2004). In spite of the nodal role of background adjustment methods upon the
quantification of single gene expression, the effects of de-noising upon the totality of the gene profile
remains unexplored. A notable exception is the study by (Kooperberg, Fazzio, Delrow, & Tsukiyama,
2002); the authors argued that incomplete background adjustment is expected to predominantly affect
222
the quantification of weakly expressed genes, leading to a reduction in the dynamic range of the GEP.
This in turn will result in a compression of the left tail of the GEP, which could very well account for
its featureless nature of the latter in that signal range. Consequently, proper noise reduction and dynamic
range expansion could uncover hidden structure in GEPs and aid the discovery of hitherto unknown
elements of transcriptional regulation. In order to facilitate such a discovery process though, it is imperative to supplement the noise reduction algorithms with mathematical model reduction methodologies
so as to generate a discrete number of experimentally testable hypotheses about the biological system
under investigation. Before addressing this mathematical modeling problem though it is important to
properly define the meaning of the phrase Gene Expression Profile. The definition we adopt herein is
an operational one i.e. we identify the GEP as the output of any step in the microarray pipeline (Figure
1) after the image operations stage. Consequently the GEP is a set of possibly noisy measurements, one
for each target printed on the surface of the microarray. Inference about the members of the set, which
constitutes the gene expression profile, proceeds in different stages that involve the normalization of
the measurements prior to exploratory and formal statistical/machine learning methods.
One challenging aspect of the process is how to best represent the GEPs emanating from replicate
experiments in a manner that captures the global features of the system under study but also preserve the
specific expression pattern within each sample. The issue is further complicated by the fact that a large
number of experimental noise sources remain poorly defined or are very difficult to model explicitly.
Figure 1. Microarray pipeline starting from individual cell population to machine learning and inference
C e ll S a m p le
C C D c a me ra s
C o n fo c a l
sca n n e rs
P h o s p h o Im a g e r
m R N A I s o la tio n
P32
L a b e lin g
F lu o r e s c e n t D y e s
H y b r id iz a tio n
C h r o m o g e n ic
D yes
Im a g e a c q u is itio n
Im a g e o p e r a tio n s
( d e - n o is in g ,
g r id d in g , s p o t
s e g m e n ta tio n )
N o r m a lis a tio n o f
m e a s u r e m e n ts
E x p lo r a to r y D a ta
A n a ly s is
M a c h in e L e a r n in g
S ta tis tic a l M o d e ls
C lu s te r in g
223
Conventional approaches usually address the effects of residual noise by compressing the GEP from
individual samples into a single numerical summary (Guo, Eichler, Feng, Ingber, & Huang, 2006), often
after a normalization procedure such as lowess has reduced the variability of the dataset in a more or
less ad hoc fashion (L. X. Qin & Kerr, 2004). It is important to note that both approaches amount to a
considerable degree of data falsification since they operate by discarding possibly relevant information
immanent in the complex, higher-order genome-wide expression pattern(Guo, Eichler, Feng, Ingber, &
Huang, 2006). On the other hand, verbatim use of the GEP is likely to lead to erroneous inferences as
sources of noise other than the background of microarray images are not taken into account. Even if it
were possible to measure the GEP by a perfect noiseless experiment, intrinsic biological variability
would carry over to the final results and manifest as jitter in a, possibly multidimensional, representation of the latter. It follows that unless one can faithfully control or model all sources of variability
(experimental and biological), the best that one can hope for is a probabilistic representation that is based
on the gene expression measures obtained in each experiment. However unless one specifies the space
that such probabilistic descriptions occupy and the objects that they concern, one has an undefined
problem with multiple possible and even conflicting solutions.
We postulate that a data driven, bottom-up methodology abstracting the most general features of
microarray technology and genome wide transcript abundance quantification is the optimal approach to
microarray data and model reduction. The main focus of the present work is the presentation of a bottomup bi-partite quantitative inference approach to address the issues identified in the previous paragraphs.
The two pillars of the proposed methodology are a) a probabilistic methodology that utilizes Bayesian
methods with the Maximum Entropy prior (Jaynes, 2003), and b) function approximation methods in
general and non-parametric kernel density estimation algorithms in particular (Silverman, 1986). The
two elements of this framework are used to handle microarray image noise (de-noising) and model
residual variability in the global expression profiles respectively. The operational characteristics of this
approach are illustrated by applying it on publicly available microarray set (DeRisi, Iyer, & Brown, 1997).
We demonstrate that when used in combination, these two different methodologies may facilitate the
detection of weak, albeit reproducible, signals, decrease the variability of measures of gene expression
and facilitate the detection of global features in the corresponding profiles.
AN OVER VIEW OF BAYESIAN AND MAXIM UM ENTR OPY METH ODS

From an epistemological viewpoint, noise reduction in microarray experiments is a problem of quantitative inductive inference that involves the update from a prior probability distribution to posterior
probability distribution when new information becomes available. This approach is contingent upon
the interpretation of the relevant probabilities in a Bayesian context i.e. as real valued measures of the
degree of belief in the truth of a proposition. Two methods to effect the updating of probabilities in a
consistent manner have been mathematically established: Bayes theorem (Cox, 1946) and Edwin T.
Jaynes method of Maximum Entropy (Jaynes, 2003). The choice between the two methods is dictated by
the nature of the information at hand (Caticha & Preuss, 2004). When the available information is given
by conditional probability distribution between observable quantities and unobserved parameters we
should use the Bayes theorem. If however the information at hand concerns expectations about aspects
of the general case, the method of Maximum Entropy (MaxEnt) is applicable. The two methods complement each other as evident from the general process of quantitative (Bayesian) inductive inference. In
224
order to carry out the latter and reach plausible conclusions in the presence of uncertainty, one ought
to proceed in three discrete steps:
1.
2.
3.
Clearly state what the models are, along with all the background information and data.
Assign prior (pre-data) probabilities to the different models-hypotheses investigated.
Use probability calculus in order to arrive to numerical values for the posterior probability of the
hypotheses in light of the available data (inference process).
The tool for updating ones beliefs about the plausibility of a hypothesis (H) given available evidence
(E) and background information (i.e. context I) is given by Bayes theorem:
P( H | E , I ) =
P( H | I ) P( E | H , I )

P( E | I )
(1)
The left-hand term, P(H|E,I) is called the posterior probability, and it gives the probability of the
hypothesis H after considering the effect of evidence E in context I. The term P(H|E,I) is just the prior
probability of H given I alone; that is, the belief in H before the evidence E is considered. The term
P(E|H,I) is called the likelihood, and it gives the probability of the evidence assuming the hypothesis
H and background information I are true. The denominator is independent of H, and can be regarded
as a normalizing or scaling constant. The information I is a conjunction of all of the other statements
(background knowledge) relevant to determining P(H|I) and P(H|I). Within such a framework, Jaynes
maximum entropy method corresponds to a formal deterministic, variational algorithm that transforms
pre-data constraints into prior probability assignments. The resulting distributions are least informative
or objective ones in the sense that they are most compatible with the pre-data constraints, while
being maximally noncommittal about the missing information. The maximum entropy algorithm, proceeds by maximization of the entropy H[p] functional of the prior distribution subject to q pre-data
constraints about the numerical values of expectations of functions f k ( f k ) that depend on the hypothesis H; each such constraint contributes a parameter k into the final solution thus fixing the functional
form of the prior distribution. If the hypothesis H concerns the numerical value of a random variable
x taking discrete values {x1,x2,...,xn}, the MaxEnt algorithm generates the following Maximum Entropy
prior distribution:
max H [ p ] =
fk =
log Z
,
k
q

1
pi log( pi ) pi = exp(
Z
j =1

fk
f j ( xi )), Z =
q
exp
i =1
j =1
n

f j ( xi )
f ( x) p( x ) = f ( x ) p
i =1
i =1
(2)
(3)
Entropy maximization is a convex optimization problem, and thus the distribution given in Equation
2, is unique. Conceptually the algorithm identifies the distribution that has the maximum uncertainty
subject to the constraints in a manner that avoids bias in the process. This can be traced to a mathematical property of MaxEnt distributions i.e. that they assign positive weight to every situation that
is not excluded by the given information. Hence by utilizing such distributions to encode the pre-data
constraints we are guaranteed not to miss any potential solutions that are compatible with the former.
225
A PR OBABI LISTIC DE -N OISING A LG ORITHM
F OR MICR OARRAY IMAGES
As a stepping stone for the proposed algorithm, we invoke a very simple additive noise model. This
model notes that in any given finite area of the microarray surface there a) may exist both specific and
non-specific hybridization sources of fluorescence, and b) fluorescent signals combine additively, and c)
there are no quenchers in the hybridization (i.e. the noise is positive). Hence, the true signal intensity
of a pixel in a microarray image (S), is related to the background noise (B) and the observed pixel value
(D) through the relation:
D = S + B, {S,B 0}
The Maximum Entropy algorithm takes into account pre-data constraints about the raw microarray
images in a formal manner in order to derive the distribution of the background generation process (B)
and the signal (S). Subsequently these distributions are utilized in Bayes theorem to derive the posterior
distribution for S; this process is repeated for all pixels in the microarray image, leading to a restored
version which is then subjected to further processing (spot identification, normalization etc). From an
operational standpoint, the proposed algorithm takes into account what is known about the microarray
pipeline (general case), as well as the data observed in a given experiment (particular case) to render
a denoised version of the microarray image.
The constraints that are built into the algorithm stem from the discrete nature of the microarray
measurement scale as well as the different scales of specific vs non-specific and cross hybridization
phenomena on microarrays. Qualitatively these constraints correspond to the following discrete predata pieces of information:
1.
2.
3.
Each microarray image can be segmented into two general classes of pixels i.e. Signal (S) and
background (B) generated by independent separate, distinct processes. This is nothing more than
a consequence of the fact that the arrayed spots occupy only a fraction of the microarray surface;
the corresponding pixels are classified as signal and all other ones are designated as background.
For the case of the S class the process is one of specific hybridization between labeled mRNA
species and spotted target molecules; for the background pixels, non-specific hybridization, noncDNA fluorescent molecules, lateral diffusion of labeled particles, dust etc are responsible.
The expected intensity value of the background process is small compared to the dynamic range
of the microarray image.
The intensity values for both signal and background are expressed in a discrete, bounded, positive, arbitrary scale with M+1 elements e.g. M=65535 for a 16-bit image.
In spite of their crudeness these pre-experimental facts are surprisingly powerful enough to generate
useful analytic forms for the prior probability density and likelihood functions of a microarray experiment. For example due to the independence property, one can factorize the probability of observing a
particular image (Im) as product density, that is:
P(Im) = P(S,B) = P(S)P(B)
226
(4)
In the absence of any specific information regarding S, the maximum entropy assignment for S
considers all possible values equally likely leading to the uniform prior, that is:
P(S) = 1/(M + 1)
(5)
The analytic form of the maximum entropy background distribution PME (Bm) can be established on
the basis of the general MaxEnt algorithm by setting q = 1, f1(x) = x, F1 = m in the system of equations
2-3. Stated in other terms the conditional distribution of the background is simply the discrete (maximum
entropy) density obeying the two constraints of a given expectation and proper normalization (to unity).
The resulting distribution is a function of a single Lagrangian given by:
m=
M
d
1
M +1
+
=m
(log( e b ))
d
e
1
1
e M ( +1)
b =0
log (1 + 1/ m )
(6)
The approximate relation between m, in the latter equation is a consequence of the second constraint
discussed previously. As long as m M/5, the error between the approximate value log(1+1/m) and the
exact value for obtained by numerically solving Equation 6 is less than 4%. Furthermore, by the
change of variables 1+1/m (m), PME (Bm) is identified as a member of the family of modified power
series distributions (MPSD) (Johnson, Kotz, & Kemp, 1992):
PME ( B | m) = MPSD( B | M , (m)) =
x =0
(7)
With the prior distributions for signal and background now established (equations 5-7), straightforward application of the Bayes theorem leads to the posterior distribution for S:
P ( S | D, m) =
P ( D | S , m) P ( S | m)
MPSD( D S | M , (m))
=
= * ( m) S
(
|
,
)
(
|
)
(
|
,
(
))
P
D
S
m
P
S
m
MPSD
D
S
M
m
s =0
( m) s ,
( m) =
1
( m)
(8)
Derivation of Equation (8) makes explicit use of the positivity constraint on the values of D, S and
the independence of S from B and thus m. Comparison of equations 7 and 8, shows that the posterior
distribution for S is also a modified power series one, albeit with a different parameter * and support
D. In order to restore a microarray image using this distribution, one has to select a numerical summary
from the latter. The posterior conditional expectation, is such a summary which can be recovered in a
closed and particularly simple form:
D
E[ S | D, m] = S P( S | D, m) = D m +
S =0
D +1
(m) D +1 1
(9)
One desirable feature of MPSD based models is that they are amenable to numerically tractable, and
even analytic procedures for estimation and learning. Specifically in order to estimate m from a sample
of noisy pixels B = {B1,B2,...,Bn} one has to solve the likelihood equation:
(m ) = B
(10)
227
The aforementioned equation is a functional relationship among the support of the MPSD (M), its
first moment (), the value of the maximum likelihood estimator (m ) and the sample mean (B ).The
value of m is in general close to B , which could thus be used (as we do here) as a convenient first order
approximation. For example, the relative error in approximating B with m is less than 3.5% when B
~20-25% of M and drops to 0.0002% for B = M/10.
Obtaining such a sample from the microarray image, is facilitated by the deployment of standard
clustering algorithms (e.g. Expectation Maximization) that classify image pixels into two classes on the
basis of their intensity values. The class with the lower average intensity then corresponds to the background, and its mean sample value is used to estimate the unobserved background intensity m. Having
classified each pixel into signal and background classes, one then uses Equation 9 to restore the value of
the former, while setting the value of the latter to zero. The end result is a de-noised microarray image
which enters the microarray pipeline without any further background adjustment in order to generate a
gene expression profile (Figure 1). It is precisely at this point that further open issues of representation
and processing arise that we now seek to formally address.
THE RATI ONA LE OF GENE EXPRESSI ON PR OFI LE REPRESENTATI ON BY

F UNCTI ON APPR OXIMATI ON METH ODS
Gene activity modeling in systems biology ultimately depends on a quantitative description of the amount
of transcripts in particular cellular states (Gibson & Jehoshua, 2001). Since cellular identity, judged by
both form and function, is determined by biophysical processes that involve bio-molecules interacting
in specific stoichiometries, it follows that control of the latter within certain limits is paramount if cellular phenotype is to be preserved. Biological and experimental variability may result in failure to detect
the expression of individual genes, but global features such as the relative ordering of gene expression
measures should be preserved for the majority of the assayed genes. Without detailed physicochemical
modeling one is forced to treat the measurement scale of microarray experiments as a relative one; normalization and representation of replicated profiles of individual states should be limited to a sliding scale
approach that seeks to align replicated profiles against each other. To do otherwise would be equivalent
to introducing particularly strong hypotheses about the behaviour of the system under examination
which may neither be supported by biology nor warranted in light of the available data.
In mathematical terms the task is to find a description for the multivariate global expression profile G
of an experiment that has quantified the abundance of n genes i.e. G={g1,g2,,gn} of a particular cellular
state. Each individual gene expression measure g j is restricted to lie in the interval [0, M] that is fixed
by the microarray apparatus used in the experiment. Since the gene expression measures reported by
the image segmentation software are usually moment statistics of a large number of pixels, the discrete
nature of the microarray measurement scale is blunted at this stage. Consequently one could treat each
g j or better yet log(g j) as real-valued entities; such treatment is facilitated by the positivity constraints
enforced by the Bayesian/MaxEnt background adjustment algorithm which ensures that there are no
genes with negative expression value.
The finite bounded measurement scale of microarray experiments then implies that replicate profiling
experiments of the same cellular state C={G1,G2 ,,Gk} cannot be naively characterized by their coordinates in a n-dimensional space spanned by the log expression measures of individual genes. Firstly,
note that as a result of gene silencing, random target failures and/or insufficient signal amplification,
228
there will be a finite number of genes whose expression is below the detection limit of the experimental
apparatus. Such genes will have g j values equal to zero and thus are censored from subsequent analyses
carried out in log-spaces. We remark that a similar effect will also be observed if one uses microarrays
with different target complements arrayed on their surface. Even if one profiles the same cellular state,
the fact that certain targets are missing from one or more array platforms means that the corresponding
genes will have a gene expression measure of zero. Secondly, one cannot a-priori rule out the possibility that the expression of certain genes will saturate the dynamic range of the apparatus. By a simple
re-ordering of the discrete increments in the digital intensity scale of microarray images, the expression
values of the latter genes can be set to zero in this new scale. Therefore symmetry considerations imply
that such saturated genes should also be disregarded from further analytic and inferential steps. As a
result of all these factors, the set of vectors {G1,G2 ,,Gk} that correspond to replicated experiments of
the same cellular state will be of different dimensionality, and thus are not directly comparable. Even
though it is relative easy to handle the issue of unequal dimensionalities i.e. by introducing special
symbols for the log-expression of zero and M, this solution is not without its own unique problems since
the resultant space will be a disconnected one, in need of considerably complicated analytic machinery
for further processing.
A way out of the conundrum of the previous paragraph is to completely forego multi-dimensional
descriptions, and treat each non-zero/non-saturated observation interchangeably. Rather than interpreting
each experiment as a point in a multi-dimensional space, one interprets the non-censored observations
g1,i , g 2,i ,..., g ni ,i in the ith gene expression profile as repeated measures of a univariate variable Gi. We
should point out that this interpretation is strictly epistemological, expressing the modelers perception
rather than ontological in nature. The latter would imply or -even worse- assert the physical existence
of a global transcriptional control mechanism that assigns genes to particular expression levels, which
constitutes a particularly strong and possibly unfounded statement about the biology of gene regulation.
By limiting ourselves to an epistemological interpretation of the gene expression profile though we are
free to handle Gi in any fashion that we deem admissible. For example, one could view Gi as a function
to be approximated by almost any statistical learning algorithms, e.g., neural networks, support vector
machines, parametric mixture models, Self-Organizing Maps, non-parametric kernel density estimators. To illustrate the potential of function approximation techniques for such a purpose, we opted to
use the latter model-free technique as an instrument for data and model reduction of gene expression
profiles.
DATA AND MODE L RED UCTI ON OF GENE EXPRESSI ON PR OFI LES BY

N ON -PARAMETRIC
KERNE L DENSITY ESTIMATI ON METH ODS
Non-parametric kernel density estimation methods (also known as Parzen windows) are model-free
techniques for the estimation of an empiric distribution from experimental data. Formally such estimators smooth out the contribution of each observed data point over a local neighborhood. In order to apply
this estimator to gene expression profile analysis, we assume that the data are of the form discussed in
the previous section i.e. g1,i , g 2,i ,..., g ni ,i : g j ,i ~ Gi. Stated in other terms, each gene expression profile
is modeled as an independent, identically distributed finite random sample from a random variable Gi.
The kernel density approximation ( fci ( g )) to the unknown distribution of the latter variable ( f (g)) at
each point g is given by the following equation:
229
fci ( g ) = fci (Gi = g )
1
ni ci
ni
g g j ,i
ci
K
j =1
(11)
In the latter expression, the function K is known as the kernel and is specified by the analyst while
the tunable parameter c is the bandwidth of the estimator and is learned from the actual data. The kernel
is usually taken to be a Gaussian function with zero mean and finite variance, although in principle any
smooth unimodal function may be utilized. In order for the estimator to be a proper density, an added
constraint is imposed on K, i.e. its integral over the space of possible values for Gi should be equal to
c. The magnitude of the bandwidth and the shape of the kernel control the contribution of each experimental data point g j,i to the final estimate at point g. Conceptually small values of c make the estimate
rough and demonstrate spurious features, while large values of c lead to over-smoothed, featureless
representations. There exist many criteria for the selection of the optimal size of the bandwidth parameter from data; the criterion most commonly employed is that of minimization of the Asymptotic
Mean Square Integrated Error (AMISE) i.e. the second order Taylor series approximation to the Mean
Square Integrated Error (MISE):
MISE (c) = E fc ( g ) f ( g ) dg
(12)
A cross-validation strategy is commonly adopted to estimate the value of c that minimizes AMISE
but is computationally expensive for datasets with thousands of data points as is typical of microarray
based gene profiles. Approximate plug-in or heuristic estimators come then into play, offering a very
favorable trade-off between ease and accuracy of computation. For example the normal reference rule
(Wasserman, 2006) is an approximate estimate of the optimal bandwidth of the Gaussian Kernel:
cGauss = 1.06 ni 1/ 5 ,
= min {s, Q75 25 1.34}
(13)
where s is the sample standard deviation and Q75-25 is the sample interquartile range, i.e., the difference
between the first and third quartiles of the sample data. Since the Gaussian kernel is not optimal when
derivatives of the density estimator are sought after (Hardle, Marron, & Wand, 1990), the tri-weight
kernel and bandwidth estimator were used instead:
KTW ( x) = 35(1 x 2 )3 / 32, x < 1, cTW = 2.978cGauss
(14)
In any case, the kernel estimator of Equation 11 may then be used as a tool for normalization (data
reduction), visual exploration and generation of discrete, possibly parametric hypotheses to further
reduce the complexity of representation (meta-modeling, model reduction). These roles are facilitated
by the closed form of the estimator which is amenable to hybrid symbolic numerical operations. The
least restrictive way to normalize individual replicate gene expression profiles amounts to an alignment
operation as discussed in the previous section. During that operation it is assumed that the individual
profiles are translated among each other as a result of residual experimental variation, and that estimation
of a single constant suffices to bring them in line. In microarray research it is customary to affect such
a translation by setting the mean or the median of the sample of intensity values equal to zero but this
is not robust with respect to outliers or departures from symmetry. Hence we propose using the mode
230
of the gene profile derived distribution for such a purpose, whose value is obtained by maximization
of the estimator of Equation 11 by repeated evaluation of derivatives of the latter equation by numeric
optimization libraries.
UNC OVERING
ST UDY
STR UCT URA L FEAT URES IN GENE PR OFI LES : A CASE
As a case study of the proposed methodology we reanalyzed the DeRisi et al dataset concerning the
diauxic shift of Saccharomyces cerevisiae (DeRisi, Iyer, & Brown, 1997). The particular dataset was
selected for the degree of replication (seven replicates) of its common reference channel i.e. a messenger
RNA pool prepared from a single cell state. This degree of replication allowed us to explore the effects
of the noise correction and profile representation algorithms in adjusting for the technical variability of
the data and identifying fine structural features respectively. The impact of the maximum entropy based
background correction image upon the seven replicate gene expression profiles is shown in Figure 2 for
three alternate background correction methods: a) no correction (NONE), b) the conventional method
of subtracting the mean background from the mean intensity for each spot in the image (SBC) and c)
the conditional expectation method (COND, Equation 9)
In order to generate the data of Figure 2, the seven images were restored with the COND method
before or after (SBC, NONE) segmentation with the software SPOT. After segmentation, the three gene
profiles were subjected to global mean normalization and the non-parametric density estimators were
Figure 2. Effect of different background correction methods upon non parametric density estimates of
the gene expression profile. Solid lines: replicate experiments, dashed line: standard normal distribution.
PDF: probability density function
231
generated using Equation 14. For the purpose of comparison, we also graph the probability density function of the standard normal distribution. As compared to the SBC and NONE methods, the maximum
entropy corrected images generated more normal looking datasets at least around the mode of the
distribution (central part of the density estimate). Two important conclusions may be drawn from this
figure: a) background correction tends to increase the dynamic range of the microarray measurement
scale irrespective of the method used b) the inter-array variation in the dynamic range effected by background adjustment varies for the different methods. With respect to the first point, we note that for the
data of Figure 2 NONE is associated with the narrowest dynamic range i.e. 5.71 bits. In contrast, method
COND increased the dynamic range (maximum minimum value in the x axis) by 210.83-5.7135 and the
SBC by 214.86-5.71568 times. However this performance advantage for the latter method is misleading
as it leads to dynamic range increases that are considerably higher than the range of the (12-bit) analogto-digital converter, used to scan the original microarrays. For such an unrealistic increase in the range
of a bounded quantity to occur, it is as if SBC injects noise in the dataset. Such an effect, which was
also noted in different datasets (L. X. Qin & Kerr, 2004) argues against further use of this method.
Subsequently we examined the effects of global mode normalization (GMoN) applied to background
un-corrected, and conditionally corrected images (upper row Figure 3). When compared to global
mean normalization, translation by the global mode rule leads to profiles that are effectively centered
on zero. Background adjustment by the MaxEnt based algorithm (Equation 9) has the added benefit of
normalizing the profiles.
This effect is readily appreciated in the graphs of the average gene expression profile of the experimental cellular state (lower row Figure 3). These graphs result from computing the average of the
non-parametric estimate of each replicate experiment:
n
n
f ( g ) = 1
f ( g ) = 1 1
c
n i =1 i
n i =1 ni ci
ni
g g j ,i
ci
K
j =1
(15)
The average graph of the global-mode-normalized, MaxEnt restored profiles closely parallels the
standard normal distribution at its central part that reflects the quantitative behaviour of the majority
of the expressed genes. On the other hand, the profiles from the unrestored images exhibit a far from
normal behaviour for the majority of the genes (except the ones with a normalized expression of 1 log2).
At the far edge of the centre both profiles exhibit a minor peak (around +3 log2), which is a consistent
feature in all replications of the experiment (upper row Figure 3).Mode normalization of the profiles
generated from the MaxEnt restored images uncovers two additional peaks at intensities -3.54 and -2.34
log2. The latter two peaks are present in all individual profiles (upper right graph of Figure 3), and are
also evident in the averaged profiles (lower right graph of Figure 3). It is possible to gather evidence for
the presence of these peaks from all experiments by averaging the derivatives of the non-parametric
kernel density estimators of the individual experiments, and thus obtain a symbolic data-driven peak
detector. For the GEP shown in Figure 3, the values of the derivative of Equation 15 at the points {-3.54,2.34, 0, 3.} is within absolute numerical tolerance of 10 -3 equal to zero (data not shown). Consequently
one could model the average profile with a five component Gaussian mixture (the additional component
used to represent residual variability); this reduced model is shown in Equation 16 and is graphed along
the average profile of Equation 15 in Figure 4):
232
Figure 3. Effect of global mode normalization upon non parametric density estimates of the gene expression profile. Solid lines: replicate experiments, dashed line: standard normal distribution. Upper row:
replicate profiles, lower row: average profile.
2
2
f ( g ) f approx ( g ) = 0.024e 14.3( g +3.54) + 0.046e 0.32( g + 2.34) +
2
0.18e 3.11g + 0.24e 0.45( g 0.43) + 0.024e 1.74( g 3)
(16)
In order to estimate the numerical parameters that appear in Equation 16, one regresses a sample
from Equation 15 against the finite Gaussian Mixture Model:
K +1
GMM ( x | p, , ) =
i =1
pi e 0.5
i ( x i )
(17)
with K=4 and , i.e., the centers of the Gaussians, fixed to values obtained from numerical evaluation
of the derivative of Equation 15, while all other parameters are allowed to vary. Once the (epistemological) clusters have been defined, one may invoke ontological models that correspond to experimentally
verifiable hypotheses. With respect to the ontological nature of the clusters in Figure 4 the following
should be noted:

The central massive peak corresponds to (thousands of) genes with average expression and
thus it is highly unlikely that there is a single mechanism that accounts for their expression. This
central peak likely reflects the modulating effect (positive or negative) of multiple transcriptional
regulators upon the basal transcriptional rate.
The right-most peak includes genes of a limited functional repertoire (e.g. ribosomal genes,
translational elongation factors) and thus it is very likely that a small number of transcriptional
233
Figure 4. Finite Mixture Estimation of average gene expression profile. Solid line: approximation of
Equation 16, dashed line: average profile of Equation 15).
mechanisms are responsible for its generation and maintenance (Table 1). Furthermore probes of
genes with similar function (e.g. the retrotransponson gag/pol genes) that are co-regulated are
correctly clustered together adding further weight to the hypothesis that such a peak is the product
of actual biological mechanisms i.e. it has an ontological substance.
The left most peaks include genes with unknown function and regulation; the nature of the corresponding clusters is purely epistemological at this point. To convert such an epistemic statement
into ontology would require targeted experiments that search for common transcriptional mechanisms that explain the common quantitative behaviour.
C ONC LUSI ON
In the present work, we propose a novel theoretical framework for data and model reduction of gene
expression profiles generated by microarray experiments. In order to avoid inconsistencies we adopted
a bottom-up statistical framework approach to the problem of background adjustment. This framework
operates on the raw microarray images and utilizes the maximum entropy priors in order to encode
testable pre-data information about the background features and the measurement scale of microarray
images. Under certain fairly generic conditions on the expected intensity of the background (m) relative
to the signal, one can approximate these priors with modified power series distributions paving the way
for efficient estimating algorithms for the single parameter of interest (m). Subsequently we formulate
the operation of background adjustment as a traditional additive noise inverse problem whose unique
solution is obtained by means of the Bayes theorem but which is properly constrained by the prior distributions. An empiric evaluation of the proposed image de-noising algorithm on a real world dataset
demonstrates the uncompressing effects that the method has upon the gene expression profile which
allows one to detect the existence of fine and reproducible structure in the latter.
234
Table 1. Accession number and function of genes corresponding to the rightmost peak in the profile of
Figure 4. Rows with multiple gene identifier entries correspond to probes with similar gene function.
Gene Identifier
Gene Function
YCR012W
3-phosphoglycerate kinase
YLR340W
Conserved ribosomal protein P0 similar to rat P0
YNL209W
Cytoplasmic ATPase that is a ribosome-associated molecular chaperone
YDR385W, YOR133W
Elongation factor 2 (EF-2)
YHR174W
Enolase II
YLR167W
Fusion protein that is cleaved to yield a ribosomal protein of the small (40S)
subunit and ubiquitin
YPL220W, YKL006W
N-terminally acetylated protein component of the large (60S) ribosomal subunit
YLR075W, YDR012W, YDR418W, YNL067W,

YPL131W,
Protein component of the large (60S) ribosomal subunit
YJR123W, YGL123W
Protein component of the small (40S) ribosomal subunit
YNL119W
Protein with a role in urmylation and in invasive and pseudohyphal growth
YML045W, YBR012W-B, YCL019W, YJR027W,

BL005W-A, YBR012W-A, YJR026W, YJR028W
Retrotransposon TYA Gag and TYB Pol genes
YDL130W
Ribosomal protein P1 beta
YOL039W
Ribosomal protein P2 alpha
YDR382W
Ribosomal protein P2 beta
YLR109W
Thiol-specific peroxiredoxin
YIL078W
Threonyl-tRNA synthetase
YBL005W
Transcriptional activator of the pleiotropic drug resistance network
YAL003W
Translation elongation factor 1 beta
YKL081W
Translation elongation factor EF-1 gamma
YLR249W
Translational elongation factor
YBR118W, YPR080W
Translational elongation factor EF-1 alpha
In the absence of detailed knowledge about the specific biological mechanisms that control the gene
expression profile, one has to forego ontological models of such structure and settle for epistemological
ones. By considering the discrete, bounded scale of microarray experimental setups and the potential
for over and under-saturation we employed function approximation methods in general, and non parametric kernel density estimators in particular, for the representation of gene expression profiles. Such
a representation enables the use of quantitative frequency information about the relative abundance of
different transcripts within the same experiment in order to a) normalize the measurements and b) suppress residual noise by averaging (inverse square root rule) and thus serve the role of data reduction.
The latter underlies the detection (uncovering) of subtle structural features in the gene profile which
pivot the simplification of the descriptions of the biological system under examination (meta modeling,
modeling reduction). Whereas the non-parametric description uses n+1 parameters (the actual data
points and the bandwidth of the kernel, i.e. in excess of 6000 variables in our case), the reduced model
utilizes only a handful (i.e. fifteen in the case study of the DeRisi dataset).
235
It is tempting to justify the use of Gaussian Mixtures parametric descriptions in other situations as a
result of their derivation as accurate reductions in one typical dataset. Whether normality is an emergent
property of our ontological bottom up (and top down as in Do, Muller, & Tang, 2005) descriptions,
or a property of the biological systems grounded in tangible experimentally verifiable mechanisms of
transcriptional control remains an open question.
A ckn ow ledgment
We thank the European Social Fund (ESF), Operational Program for Educational and Vocational Training II (EPEAEK II), and particularly the Program PYTHAGORAS , for funding the above work.
REFERENCES
Barnes, C. M., Huang, S., Kaipainen, A., Sanoudou, D., Chen, E. J., Eichler, G. S., et al. (2005). Evidence by molecular profiling for a placental origin of infantile hemangioma. Proc Natl Acad Sci USA,
102(52), 19097-19102.
Caticha, A., & Preuss, R. (2004). Maximum entropy and Bayesian data analysis: Entropic prior distributions. Physical Review E, 70(42), 046127-046121.
Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal of Physics,
14(1), 1-13.
DeRisi, J. L., Iyer, V. R., & Brown, P. O. (1997). Exploring the metabolic and genetic control of gene
expression on a genomic scale. Science, 278(5338), 680-686.
Do, K.-A., Muller, P., & Tang, F. (2005). A Bayesian mixture model for differential gene expression.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3), 627-644.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, 95(25), 14863-14868.
Febbo, P. G., Richie, J. P., George, D. J., Loda, M., Manola, J., Shankar, S., et al. (2005). Neoadjuvant
docetaxel before radical prostatectomy in patients with high-risk localized prostate cancer. Clin Cancer
Res, 11(14), 5233-5240.
Gibson, M. A., & Jehoshua, B. (2001). Modeling the activity of single genes. In J. M. Bower & H.
Bolouri (eds.), Computational Modeling of Genetic and Biochemical Networks (pp. 3-48). Cambridge,
MA: MIT Press.
Guo, Y., Eichler, G. S., Feng, Y., Ingber, D. E., & Huang, S. (2006). Towards a holistic, yet gene-centered
analysis of gene expression profiles: A case study of human lung cancers. Journal of Biomedicine and
Biotechnology, 2006, 69141.
Hardle, W., Marron, J. S., & Wand, M. P. (1990). Bandwidth choice for density derivatives. Journal of
the Royal Statistical Society. Series B (Methodological), 52(1), 223-232.
236
Jaynes, E. T. (2003). Discrete prior probabilities: The entropy principle. In G. L. Bretthorst (ed.), Probability theory: The logic of science (pp. 343-371). Cambridge University Press.
Johnson, N. L., Kotz, S., & Kemp, A. W. (1992). Power series distributions. In Univariate Discrete
Distributions (Second ed., pp. 70-76). John Wiley & Sons.
Kooperberg, C., Fazzio, T. G., Delrow, J. J., & Tsukiyama, T. (2002). Improved background correction
for spotted DNA microarrays. J Comput Biol, 9(1), 55-66.
Qin, L., Rueda, L., Ali, A., & Ngom, A. (2005). Spot detection and image segmentation in DNA microarray data. Appl Bioinformatics, 4(1), 1-11.
Qin, L. X., & Kerr, K. F. (2004). Empirical evaluation of data transformations and ranking statistics for
microarray analysis. Nucleic Acids Res, 32(18), 5471-5479.
Sarwal, M., Chua, M. S., Kambham, N., Hsieh, S. C., Satterwhite, T., Masek, M., et al. (2003). Molecular
heterogeneity in acute renal allograft rejection identified by DNA microarray profiling. N Engl J Med,
349(2), 125-138.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. CRC Press.
Southern, E., Mir, K., & Shchepinov, M. (1999). Molecular interactions on microarrays. Nat Genet,
21(1 Suppl), 5-9.
Valentini, G. (2002). Gene expression data analysis of human lymphoma using support vector machines
and output coding ensembles. Artif Intell Med, 26(3), 281-304.
Wasserman, L. (2006). Density estimation. In All of nonparametric statistics (pp. 125-144). Springer.
K ey TERMS
Bayesian Probability: An interpretation of the colloquial term probability, which identifies the latter
with the degree of belief of a proposition about the world. This interpretation is firmly grounded in the
rules of Aristotelian logic and in fact extends the latter in situations of uncertainty i.e. when the truth or
falsity of propositions cannot be ascertained completely. Stated in other terms, the construct of Bayesian
Probability and the supporting theory is nothing more than common sense reduced to numbers. The
major instrument for updating ones prior beliefs to posterior inferences in light of new information is
the computational machinery of the Bayes theorem.
Epistemological Modeling: Modeling that quantifies ones perception about the world rather than
the world per se. The object of such modeling is to generate coherent descriptions of ones knowledge
usually in the face of uncertainty.
Gene Expression Profile: Defined operationally as the output of any step in the microarray pipeline
after the image operations stage. Hence the global gene expression profile is a set of possibly noisy measurements, one for each target printed on the surface of the microarray. Inference about the members of
the set, which constitutes the gene expression profile, proceeds in different stages that involve the normalization of the measurements prior to exploratory and formal statistical/machine learning methods.
237
Maximum Entropy Prior: The distribution that results from application of the variational maximum
entropy algorithm. The latter uniquely determines the least biased epistemic (Bayesian probability
distribution) that encodes certain testable information is by maximizing the convex functional that
information negative entropy defines. The resulting distributions are least informative or objective
ones in the sense that they are most compatible with the pre-data constraints, while being maximally
noncommittal about the missing information.
NonParametric Kernel Density Estimation: Non-parametric kernel density estimation methods
are model free techniques for the estimation of an empiric distribution from experimental data. Formally
such estimators smooth out the contribution of each observed data point over a local neighborhood.
Normalization: The process in which mathematical transformations of the microarray data are
undertaken to reduce variability in the expression levels and make data from different experiments
directly comparable.
Ontological Modeling: Modeling that implies the existence of certain objects in the physical natural
world. The distinction between ontological and epistemological modeling is a subtle one; whereas the
former is an investigation about natural objects and properties, the latter concerns the analysis of (usually) subjective statements about models of the world.
Power Series Distribution: Discrete probability distributions with probability mass function given
by: P( y | ) = a y
a
x =0
1, y M
ay =
. Modified power series distribution (MPSD) are more general
0, y > M
distributions which arise when is a function of another (simple) parameter. In such a case we define
the power parameter (m) and series function () by: ( ) a y

x =0
x =0
.Particular choices of the
power parameter render power distributions that are analytic approximations to Maximum Entropy
priors over finite domains.
238
239
Chapter XIII
Gene Expression Profiling with

the BeadArrayTM Platform
Wasco Wruck
abstract
This chapter describes the application of the BeadArrayTM technology for gene expression profiling. It
introduces the BeadArrayTM technology, shows possible approaches for data analysis, and demonstrates
to the reader how the technology performs in comparison to alternative microarray platforms. With this
technique, high quality results can be achieved so that many researchers consider employing it for their
projects. It can be expected that the technology will gain much importance in the future. The author hopes
that this rsum will introduce researchers to this novel way of performing gene expression experiments,
thus giving them a profound base for judging which technology to employ.
INTR OD UCTI ON
Microarrays have emerged as the most popular technology for performing gene expression profiling. They
provide researchers with the means to screen all genes of an organism simultaneously thus alleviating
the investigation of complex diseases like cancer or diabetes. Closely coupled with the development of
the technology are the reliability issues culminating in the publication of contradictory results achieved
on different microarray platforms. While in the starting phase of the microarray technology reliability
issues arose from the imperfect technique, e.g. distorted spotting needles, thus requiring a compensation by a sophisticated image analysis (Steinfath,M. et al., 2001, Kamberova, G., 2002) a continuous
process of improvement using proved industrial methods gives better but far from perfect results. The
BeadArrayTM technology described in this chapter proves that an industrial production process facilitates
reliable experiments of high quality.
The rest of this chapter is organized as follows. We motivate the BeadArrayTM technology. Subsequently, we describe viable approaches for analyzing data produced with this technology, introducing
Gene Expression Profiling with the BeadArrayTM Platform
the proprietary software BeadStudio and freely available solutions from the R Bioconductor environment. We introduce our pipeline design for the analysis of bead-summary data. Then we demonstrate
the position of the BeadArrayTM technology in comparison to other microarray platforms. Finally, we
summarize our experiences with the new technique.
T he B ead A rra y TM techn ology

In the first phase of microarray technology probes were spotted by robots at known locations onto a
dedicated substrate. Probes were then hybridized with a radioactive or fluorescent labeled target. Thus,
the abundance of hybridized material was transformed into a signal which could be read by a scanner.
Many error sources influenced the results achieved by this technique: twisted spotting needles, needles
transferring different DNA volumes, labelling differences for, e.g. red and green channel, dust on the
substrates, systematic local background changes, bad signal-to-noise rates in the scanned image, etc. .
A great improvement in quality was achieved by synthesizing oligonucleotides using photolithographic
processes known from the semiconductor industry. As a logical consequence the product was called
chip as a reference to the origin of its manufacturing technique.
An alternative approach claiming to reach similar quality benchmarks as the chips is the BeadArTM
ray technology. It takes advantage from the ability of beads to be randomly assembled at very high
densities. In the literature it has been described that beads of 300nm have been randomly assembled
into 500nm wells (Michael et al., 1998). For the BeadArrayTM technology at present a size of 3m is
used for the silica beads. The beads are generated by joining oligonucleotides to their surfaces and are
pooled in libraries. They are self-assembled into etched substrates.
With the beads randomly distributed over the array the problem of decoding each beads information content arises. The solution of this problem is described in (Gunderson, K. et al., 2004) and is an
essential precondition for the employment of the technique.
The probes consist of a gene-specific 50-mer oligonucleotide and a 23-mer oligonucleotide address
(see Figure 1). They are immobilized to the beads and decoded by a minimized amount of hybridization
steps in the manufacturing process. The number of hybridization stages s is given by:
s = logbn
where n is the number of different beads and b is the number of different states that can be detected
after the image analysis, e.g. b=2 if only red and green can be detected or b=4 if red, green, yellow
and black can be detected Thus, every chip is unique and is delivered bundled with media containing
the information of the bead locations. (Gunderson,K. et al., 2004) claim to have a misclassification
rate of the beads of 1.2 x 10-5 in the mean and 1.4 x 10-4 in the worst case in a random sampling of 100
manufactured array matrices. The misclassification problem can be tackled by using a sufficiently high
number of replicates of beads. It is guaranteed that in average there are more than thirty replicates
and in minimum there are more than five. In a random sample of twenty chips the author calculated
a median number of 42 replicates of beads.
The address oligonucleotides are designed to have low similarity to genetic sequences, minimal
complementary sequences versus each other and similar GC content. The probe oligonucleotides, on
240
Figure 1. Address and gene-specific part of the more than 100000 oligonucleotides joined to a bead
the other hand, are designed to be specific - also taking into account single nucleotide polymorphisms
(SNPs) and alternative splicing.
Depending on the spacing between the beads scanners with different resolutions are needed. Two
formats exist, one containing 96 samples using a spacing of 6m works with a scanner resolution of 1m
while the other format containing eight samples using a spacing of 20 works with a scanner resolution
of 5m. For our investigations we used the latter resolution allowing to scan images containing about
1 million x 8 beads.
DATA ANA LYSIS

In principle, data analysis can be performed in two different ways:
a.
b.
Data analysis using the proprietary software solution BeadStudio.

Bead-level and bead-summary-level data analysis using manufacturer-independent software, e.g.
The R/Bioconductor (Ihaka et al. 1996, Gentleman et al. 2004).
D ata A nalysis with the B eadS tudio

The BeadStudio offers a vast variety of functions of which only the main topics will be described here.
The software operates on the summarized bead-level-data generated as output from the image analysis
modules of the scanner software. Usually this data is in binary format so that it is not readable by the
user. By changing the settings in a configuration file also human readable bead-level data can be generated. The BeadStudio enables the access to the data on a per chip-section-basis and the interactive
grouping of the data per drag and drop or via a predefined sample sheet. Two types of analysis can be
performed:
1.
The gene expression analysis delivers minimum, maximum and mean intensities, standard deviations/errors of samples and of beads, the number of samples, the number of beads and the detection
p-values. The detection p-value gives the significance level at which a probe can be distinguished
from the background.
241
2.
The differential expression analysis delivers the same results as gene expression analysis plus a
so-called DiffScore as well as the Concordance value. The DiffScore tells if the probe in the group
of interest - usually the treatment group - is significantly higher or lower expressed than in the
reference group. The Concordance value compares the number of probes with upregulated signal
to the number of probes with downregulated signal.
The analysis results are presented and saved in a spread sheet that can be easily imported into external
table calculation software. For further analysis and quality control linear and logarithmic scatter plots
can be generated which provide several functionalities for managing the visualization of the data, e.g.
filtering by the detection call or displaying fold change lines. Another feature for follow up analysis is
the clustering functionality. Data can be clustered by samples or probes. Thus, it can be evaluated which
genes have similar patterns of expression. The same holds for the samples: similar samples regarding the
expression patterns are grouped together and displayed in a dendrogram indicating the distances between
the samples. For the similarity metrics four different methods are provided: Correlation measures the
similarity in terms of the Pearson correlation. Absolute correlation uses also the Pearson correlation,
but the absolute value of it thus, giving good values also for reciprocal dependencies. The Manhattan
distance measures the distance as the sum of the edges in a rectangular grid. The Euclidean distance
uses the shortest distance between two points which is the square root of the sum of the squares of the
components.
Quality control information gained from dedicated beads can be displayed in a summarized view but
it can be also refined down to the single section. In the summarized view mean and standard deviation
of all sections are displayed as bar charts with error bars. The quality control view gives information
about hybridization controls, background and noise, Biotin versus high stringency controls, housekeeping genes versus all genes and low stringency controls for perfect matches and mismatches.
D ata A nalysis with the R /B ioconductor

Data can be analysed with the R/Bioconductor on bead-summary-level and on bead-level. While the
bead-summary-level analysis can be performed on the conventional output of the BeadStudio the gene
profiles the bead-level-analysis is based upon information about single beads which can be generated
by the scanner software changing dedicated configuration parameters in a XML configuration file.
B ead-S ummary-A nalysis with the R /B ioconductor

The R/Bioconductor packages beadarray (Dunning,M. et al. submitted), lumi (Lin,S.M. et al. 2007)
and BeadExplorer are developed with the purpose of data analysis for the BeadArrayTM technology.
The authors suggest to save the bead-summary-data without background correction and normalization via the BeadStudio. The background correction of the BeadStudio produces a higher variance for
the lower intensities. This can be examined in a logarithmic scatter-plot of two background-corrected
replicates (see Figure 2).
The BeadExplorer package offers a GUI (graphical user interface) dialog to specify the bead-summary files while the beadarray and the lumi package are driven by the command line.
The lumi package provides the normalization method RSN (Robust Spline Normalization) which combines the quantile (Bolstad, B et al. 2003) normalization and the Loess (Cleveland, W. 1979) normaliza-
242
Figure 2. Logarithmic scatter plots of two biological replicates. In the plot to the left the background
correction of the BeadStudio was performed, in the plot to the right the correction was omitted. The left
plot clearly shows the higher variance in the area of the lower values.
tion methods. However, besides the two packages data can be simply imported into the R/Bioconductor
via basic R reading functions for tabulator- or comma-delimited files. Once the data is imported the
comprehensive fund of normalization, statistical and plotting functions of R/Bioconductor can be applied. For gene expression analysis the limma package (Smyth, 2005) including several normalization
methods as well as the VSN normalization (Huber, 2002) are a good base.
B ead-Level-A nalysis with the R /B ioconductor

The bead-level-analysis is only implemented in the beadarray package. For this type of analysis the
files containing bead-level data are required. They are generated by the scanner software with dedicated
settings in the XML configuration file. It can be considered as an disadvantage that these files cannot
be generated a posteriori - after the scanning software finished its work. The bead-level analysis
provides further facilities like quality control and normalization on a bead-level base. The beadarray
package can be used to find local problems on the chips by examining images with the spatial bead
distributions. Outlier beads can be removed a feature also integrated in the standard platform software. On the other hand, the bead-level-data uses large amounts of disk space and a standard analysis
without the presence of many bead outliers will yield nearly the same results as the platform software
by averaging the bead intensities.
Thus, unless the bead-level-data is used for normalization one can stick to the bead-summary-data.
Nevertheless bead-level-data should be stored for potential further investigations when disk space is
not an issue.
A Pipeline D esign for B ead-S ummary-A nalysis

The author established a pipeline for the evaluation of a large series of Illumina bead arrays which is
illustrated by Figure 3.
243
Figure 3. Pipeline design for the evaluation of BeadArray experiments. The first steps from the experiment to the image analysis are under control of Illumina. The summarized bead data can be generated
optionally by a Bioconductor package if the bead-level data is available. Results of the pipeline are
on the one hand quality control features like correlation tables and plots and on the other hand tables
containing significant genes and pathways as well as statistical and biological descriptive parameters
associated with them.
The first steps of the pipeline from the experiments to the image analysis are carried out on
the Illumina BeadArray platform. The standard procedure includes the bead summarization in the
BeadStudio but it is also possible to execute this step via the Bioconductor package beadarray as long
as the bead-level data is available. In the next step the data is annotated. Here, one could rely on the
lookup tables provided by Illumina including RefSeq ids, mouse symbols, descriptions, gene ontolo-
244
gies, etc. However, further annotation is required when more complex evaluations are planned, e.g.
mapping of chromosomal locations. Furthermore a BLAST (Basic Local Alignment Search Tool) of
the oligonucleotide sequences from Illumina against the newest versions of sequence databases keeps
track of recent changes whose influences should not be underestimated.
After the annotation process the categories of the experiments are determined and integrated into a
naming scheme for the bead summary data, e.g. collecting information about tissue, sex and strain in
a project comparing differences between strains. These categories again are used to specify data sets
from the bead summary data which are incorporated in the dedicated test. The test set itself is usually
built up from a control group and a treatment group more generally speaking - it is built up from two
conditions to be compared versus each other.
Quality control is regarded as a very important step in the evaluation of the genrerated large data
sets. Therefore the evaluation pipeline incorporates multiple quality control methods:

Inter-array correlation plots and tables

Intra-array correlation plots (section correlation, replicates) and tables
Heat maps
Cluster dendrograms
Sections containing biological or technical replicates are assumed to yield very high correlations.
Thus, the average of a sections correlations to all other sections can be used as quality criterion. This
criterion is used to filter bad quality sections applying a dedicated threshold for the correlation coefficient.
An automated quality control is enabled this way. However, a visual inspection should complement the
automated process at least for some random samples.
The quality control phase is followed by the normalization process. The normalization operates on
the test sets containing all quality-checked replicates of the control and treatment set. Thus, only the
data of interest for the dedicated comparison is considered and external influences are minimized. This
would be the drawback of combined normalization of the total data set consisting of all single data sets.
By default, we use the quantile normalization algorithm (Bolstad, B et al. 2003) which ranks the values
of the data sets, calculates the median values afterwards and finally copies back the median values to
the original data sets in the original order. This method brings the data sets intensity distributions into
line so that the run of the curve is the same for all data sets. The median normalization is a more simple
method multiplying each single experiment with a constant factor calculated via the median of the data
set so that the median of all data sets is the same.
Several statistical tests are performed on the normalized data including Students t-test (Press,W et
al. 1992), the Welch test, the Wilcoxon test (Wilcoxon, 1945) and a permutation test. The statistical tests
result in a p-value addressing the probability that control and treatment have the same mean. Thus, a low
p-value indicates a high significance that the dedicated gene is differentially expressed. A threshold of
0.05 or more strictly 0.01 is usually applied. The permutation test arbitrarily permutes the samples
of the control and the treatment set and determines the percentage of test results less than the Wilcoxon
tests p-value. This percentage itself delivers the p-value of the permutation test. By choosing the equality
of control and treatment as null hypothesis the false-positive-rate can be controlled, i.e. the amount of
genes yielding a positive test result while in reality they are not differentially expressed. For a series of
multiple simultaneous tests the significance is decreased by the multiplicity. This problem can be managed by the false discovery rate adjustment which is calculated by the qvalue method (Storey, 2002).
245
In a pathway analysis the significance of the disregulation of pathways is tested. This is achieved by
a hypergeometric test comparing the number of significant genes in the dedicated pathway to the total
number of significant genes.
The pipeline can be parametrized by the threshold for p-value, ratio and detection value. The detection value is delivered by the image analysis telling how good the dedicated beads separate from
the background and thus giving a quality criterion of the reliability of the probe. In general, a good
detection value corresponds to high intensities. The threshold parameters are used for generating lists
of significant genes.
B eadA rrayTM T echnology in C omparison to other Microarray Platforms

Ever since microarray technology has been developed reliability has been the crucial factor. Concerning the reliability it can be postulated that experiments should be reproducible on different technology
platforms and within one platform a perfect reproducibility is expected.
The MicroArray Quality Control (MAQC) project was initiated to investigate reliability issues of
microarrays. The project participants tested several microarray and alternative technology platforms
including the BeadArrayTM platform each at three laboratories selected by the microarray platform
provider. They evaluated the intra-platform consistency and the inter-platform comparability across
multiple test sites.
The tested samples were universal human reference RNA and human brain reference RNA and two
predefined mixtures containing 75% universal human reference RNA : 25% human brain reference RNA
and 25% universal human reference RNA : 75% human brain reference RNA. The goal of this experimental design was not to perform a representative genetic experiment but to investigate the characteristics
and limits of the technologies. The four sample types were tested using five replicate assays.
A common subset from all platforms included 12091 Entrez genes extracted from 15615 RefSeq
probes contained in the union of all platforms. Genes were matched to probes via a simplifying oneprobe-to-one-gene lookup list.
The data for each platform was evaluated with the proprietary software. Therefore the resulting
number of probes differed between the platforms depending on the total number of probes provided
and the detection calls. Each software solution handles the calculation of the detection calls and the
filtering via a threshold in its own way following deviating philosophies regarding the reliability of data.
For the study only genes that were detected in at least three of five replicate assays were used for most
analyses. The number of genes detected lay in the range of 8000 to 12000 depending on the platform.
Within one platform there was only a very small variation in the number of detected genes Different
numbers of probes certainly have an impact on the data analysis.
In a rough overview the results can be summarized as being measures for the reproducibility within
platforms and in between platforms and the accuracy in comparison to a reference technique. As a
measure or the reproducibility the coefficients of variation (cv) are calculated telling how much results
vary between different platforms and test sites. Here, the values for BeadArrayTM are in the lower range
of all platforms at about 0.1 .
Another measure is the amount of overlapping genes detected as differentially expressed under
predefined conditions in terms of p-value and ratio in different platforms or test sites. Here, the BeadArrayTM platform performed quite well with some overlaps to other platforms and all overlaps within the
platform greater than 80%.
246
Furthermore, the reproducibility can be measured in terms of the correlation coefficient. Here, the
Spearman rank correlation of the log ratios was used. It is calculated by ranking the logarithmic ratios
of treatment samples (human brain reference) versus control samples (universal human reference) before
calculating the correlation coefficient.. The correlation coefficient ranges between 1 and 1. Good values
are close to one, values near zero represent a random dependency and negative values a reciprocal dependency. In this discipline the BeadArrayTM platform performed best for the median to all other platforms.
However, the correlation coefficients where only slightly better than those from other competitors and
in the intra-platform comparison there were better results from another platform but both were ranging
on an excellent level. For these results from the MAQC project see Figure 4.
C ONC LUSI ON
The BeadArrayTM platform is a novel promising technology for gene expression profiling. It provides
high-throughput data screening and high-quality experiments at a relatively low cost. For data analysis
a quite comfortable interactive software solution is bundled with the platform. Furthermore, free software packages from the R/Bioconductor environment can be applied to evaluate the data in combination with the proprietary solution or in a stand-alone fashion. However, since the platform is young of
age the software development is in the starting phase and further improvements can be expected, e.g. a
state-of-the-art normalization method taking into account GC-content and avoiding negative values, like
GCRMA (Wu et al., 2004) for the Affymetrix1 platform. For the advancement of alternative software
Figure 4. Spearman rank correlations of log ratio values from dedicated platforms compared in the
MAQC project. Only genes expressed in both test sites and both sample types were used. ILM_1, ILM2
and ILM_3 are three different test sites for the BeadArrayTM platform. Correlation coefficients less than
0.8 (worst) are the darkest, between 0.8 and 0.9 are the lightest and greater than 0.9 (best) are mid-range.
The last row shows the median of the rows above yielding best values for the BeadArrayTM test sites.
247
packages the open accessibility of technical specifications, e.g. detailed descriptions of bead-level-data
and control probes would be profitable. Nevertheless the existing tools allow for gene expression profiling delivering results of high quality.
The data quality as assessed by the MAQC project is located on a high level. In terms of coefficient
of variation, overlap of significant genes and correlation coefficient between platforms and within the
same platform it always ranges in the upper region of all competitors and sometimes is best. Since the
MAQC consortium wanted to investigate the characteristics and limits of the technologies it can be expected that in dedicated biological experiments the quality will not reach the high level of this project.
The results of the experiments evaluated by the author confirm this hypothesis. Correlation coefficients
of data from technical replicates could reach 0.99. However, when some parameters were changed like
the labeling or the batch of chips the quality could be slightly inferior.
ackn ow ledgment
Part of this work was funded by the EU project METASTEM.
REFERENCES
Bolstad, B., Irizarry, R., Strand, M., & Speed, T. (2003). A comparison of normalization methods for
high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-93.
Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association, 74(368), 829-836
Dunning, M. J., Smith, M. L., Ritchie, M. E., & Tavar, S. (2007). Beadarray: R classes and methods
for Illumina bead-based data. Bioinformatics, June 22.
Gentleman, R., Carey, V. J., Bates, D. M., et al. (2004), Bioconductor: Open software development for
computational biology and bioinformatics. Genome Biology, 5, R80.
Gunderson, K.L., et al. (2004). Decoding randomly ordered DNA arrays. Genome Res, 14, 870-877
Huber, W., von Heydebreck, A., Sueltmann, H., Poustka, A., & Vingron, M. (2002). Variance stabilization
applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, 96-S104.
Ihaka, R. & Gentleman, R. (1996). R: A language for data analysis and graphics. J. Comput. Graph.
Stat, 5(3), 299-314.
Kamberova, G., & Shah, S. (2002). DNA array image analysis. Nuts & Bolts. DNA Press
Lewin, B. (2002), Molekularbiologie der gene. Spektrum Akademischer Verlag. (p. 985).
Lin, S.M., Du, P., Kibbe, W.A. (2007), Model-based variance-stabilizing transformation for Illumina
microarray data. Accepted by Nucleic Acid Research.
248
MAQC Consortium (2006). The microarray quality control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nature Biotechnology, 24(9), 1151-1161.
Michael, K.L., Taylor, L.C., Schultz, S.L., & Walt, D.R.(1998). Randomly ordered addressable highdensity optical sensor arrays. Anal.Chem, 70, 1242-1248.
Press, W. H., Teukolsky, S.A., Vetterling, W.T., & Flannery, B.P. (1992). Numerical recipes in C: The
art of scientific computing. Cambridge University Press.
Smyth, G. K. (2005). Limma: Linear models for microarray data. In R. Gentleman, V. Carey, S. Dudoit,
R. Irizarry, W. Huber (eds), Bioinformatics and computational biology solutions using R and bioconducto (pp. 397-420). New York: Springer.
Steinfath, M., Wruck, W., Seidel, H., Lehrach, H., Radelof, U., &. OBrien, J. (2001) Automated image
analysis for array hybridization experiments. Bioinformatics, 17, 634-641.
Storey, J.D. (2002). A direct approach to false discovery rates. J R Stat Soc B, 65, 479-498.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1, 80-83.
Wu, Z., Irizarry, R. A., Gentleman, R., Martinez Murillo, F., & Spencer, F. A. (2004). Model-based
background adjustment for oligonucleotide expression arrays. Johns Hopkins University, Dept. of
Biostatistics Working Papers.
K ey T erms
Affymetrix: Affymetrix is a trademark of Affymetrix, Inc. .
Bead Array: A bead array is an array of randomly assembled beads covered with oligonucleotides
representative for an organisms genome.
BeadArray: BeadArray is a trademark of Illumina, Inc. .
BLAST: (Basic Local Alignment Search Tool) is an bioinformatics algorithm applied to align sequences versus each other.
Clustering: Clustering is an analysis method for grouping of probes and samples by similarity.
Similar data sets fall into the same cluster while dissimlar data sets fall into different clusters. For the
hierarchical clustering a hierarchy of clusters is determined. Thus, one large cluster comprising all
data sets is stepwise subdivided into smaller clusters down to singletons which are clusters containing
a single data set.
Coefficient of Variation (cv): The coefficient of variation is the standard deviation divided by the
mean. It is a measure for the reproducibility. For a low cv data has a good reproducibility, for a high
coefficient of variation data contains much variation and thus measured values can only be reproduced
very inprecisely.
Correlation Coefficient: The correlation coefficient tells how good two data sets correlate. The correlation coefficient ranges between 1 and 1. Good values are close to one, values near zero represent
249
a random dependency, negative values a reciprocal dependency. The correlation coefficient calculated
directly from the data is called Pearson correlation coefficient. The Spearman correlation coefficient first
ranks the data before calculating the (Pearson) correlation coefficient. Thus, it is more robust against
outliers than the Pearson correlation coefficient.
Gene Chip: A gene chip contains a matrix of photolithographically assembled oligonucleotides
representative for an organisms genome on a dedicated substrate.
Gene Expression: Gene expression is the transformation of a genes information by transcription
and translation (Lewin, 2002)
Microarray: A microarray is a collection of genetic substances, - mostly DNA(Deoxyribonucleic
acid) - integrated at a large scale..Microarrays can be generated in high-throughput using robots transferring minimal amounts of probes onto substrates. The sample being tested is radioactively or fluorescently labelled and hybridizeded to the probes. A scanner scans the microarrays delivering signals
for the labelled and hybrized substance. Chips (see Chips) and bead arrays (see bead array) can be
regarded as subsets of microarrays.
VSN (Variance Stabilizing Normalization): The VSN (Variance stabilizing normalization) transforms the data in such a way that the variance remains nearly constant over the whole intensity spectrum.
Without this (or another) normalization a dependency between intensity and variance can be observed
in may cases which deteriorates the analysis results.
XML: XML is the extended markup language. It is a more general version of the Internet description
language HTML (hypertext markup language) allowing the detailed description of documents.
endn ote

250
Affymetrix is a trademark of Affymetrix Inc.
251
Chapter XIV
The Affymetrix GeneChip

Microarray Platform
Djork-Arn Clevert
Charit Universitaetsmedizin Berlin, Germany
and Johannes Kepler University Linz, Austria
Axel Rasche
Max-Planck-Institute for Molecular Genetics, Germany
abstract
Readers shall find a quick introduction with recommendations into the preprocessing of Affymetrix
GeneChip microarrays. In the rapidly growing field of microarrays, gene expression, especially the
Affymetrix GeneChip arrays, is an established technology present on the market for over ten years. Used
in biomedical research, the mass of information demands statistics for its analysis. Here we present
the particular design of GeneChip arrays, where much research has already been invested and some
validation resources for the comparison of the methods are available. For a basic understanding of the
preprocessing, we emphasize the steps, namely: background correction, normalization, perfect match
correction, summarization, and couple these with alternative probe-gene assignments. Combined with
a recommendation of successful methods a first use of the new technology becomes possible.
Intr od ucti on
Microarrays are the state of the art tool for high-throughput analysis of gene expression. Microarrays
allow one to monitor the expression of several thousand genes in parallel in a single experiment facilitating a broad view of the expression state. The genome-wide investigation is basis of the systems
biology modeling concept.
The Affymetrix GeneChip Microarray Platform
The Affymetrix GeneChip platform was one of the first commercial techniques available on the
market. It comes up with a sophisticated design measuring expression of a single gene by several probes
on the same chip and providing control sequences for every feature. Chips are available for many species, including popular model species and especially in the biomedical research, where the platform is
established. Due to the design and the dissemination of the platform much research has been performed
on the analysis of the generated data. It is the main leads and successful results that we wish to describe
here.
The vast amount of digital and noisy data generated by microarrays requires statistics for its evaluation. Affymetrix provides basic applications for processing the data and collaborates with companies
providing Affymetrix recommended software. On the other hand most of the independent research has
been carried out on the R software environment with the BioConductor package collection for statistical computing (Gentleman et al., 2004; R Development Core Team, 2005). In an attempt to be concise
we shall focus on R/BioC.
D esign of the Platf orm

In the GeneChip approach the expression of a gene is measured by several probes. The probes are selected from the transcript sequence of the respective gene. The UniGene database is the reference for
the gene sequence. To avoid cross hybridization between several genes, the sequence of the probes has
to be chosen unique to the gene. The length of the probes is always 25 nucleotides.
A number of such probes collected in probe sets stands for independent measurements of the number
of transcripts for the gene. The number of probes in a probe set varies between chip platforms. For example in the popular Human Genome U133 Plus 2.0 array there are eleven probes in each probe set.
With the advancement of the human genome sequence and transcript libraries the choice of probe
sequences has to be updated from one chip platform to the next. The assignment of the probe sets to genes
is updated quarterly and can be retrieved from the NetAffx service on the Affymetrix homepage.
In the classic chip designs, each probe is spotted with its perfect match (PM) sequence and the socalled mismatch (MM) sequence. In the mismatch sequence the 13th nucleotide is altered. The idea is,
that the mismatch sequence measures the background expression. The perfect match signal then contains
the background expression plus the gene expression. In the newer chips Affymetrix spares the space for
additional probes and replaces the mismatches with GC-bins. For a given number of G or C nucleotides
(between 0 and 25) the GC-bin contains 25mers unrelated to any gene sequence. The assumption is,
that sequences with the same GC content show similar expression behaviour. To make the hybridization
Figure 1. Probe sequences are selected from the transcribed regions of the gene sequence
252
results independent from the degradation of the transcripts in the cell the probe sequences are selected
near the 3 end of the gene sequence.
In the production of the chips, the probes are spotted on slides using a photolitographic method. In
the experiment, labeled RNA from the sample under study is injected on the chip. The hybridization
result depends non-linearly on the amount of transcripts in the sample. In the analysis of the measurement results this has to be considered by calculating with the logarithm of the hybridization value. The
Affymetrix chips are single channel chips. The RNA is labeled using the same dye. The comparison
between different samples is done by using several chips. The chip with the hybridized solution is then
scanned on the wavelength of the dye. Analysis starts by exploiting the scanner image. An approximate
level of hybridization for every probe is inferred from this image.
Prepr ocessing
S teps
In order to analyze and evaluate GeneChip data from an experiment with multiple arrays, the data
preprocessing at probe-level is a crucial step.
All of the methods have to account for two major disturbing factors, the background signal and the
variance of the measurements resulting in noisy data. The composition of the gene signal is not yet
completely understood, although several groups develop mathematical models. The use of the models
is two-fold, on the one hand the signal can be understood with its disturbance factors and on the other
hand the model is applied to correct the measured signal. One effect of the missing understanding is
the underestimation of the fold change between two samples.
An expression level value is calculated using a four-step procedure, as shown in Figure 2. (1) Background correction, which removes the unspecific background intensities of the scanner images; (2)
normalization, which reduces the undesired non-biological differences between chips and corrects the
signal intensity of the arrays; (3) PM correction, which removes non-specific signal contributions such
as unspecific binding or cross-hybridization from the PM probes by the use of the MM and (4) summarization, which combines the multiple preprocessed probe intensities into a single gene expression value.
Figure 2. Preprocessing pipeline for Affymetrix GeneChip in BioConductor. Background correction, normalization and PM correction are optional, whereas summarization is a mandatory processing step.
253
Errors introduced in one of these steps may corrupt further processing, e.g. spurious correlation with
target conditions may appear especially in few tissue samples (arrays) and large number of genes.
Most of the methods focus on perfect matches and ignore the mismatch information. Various methods
are proposed for preprocessing probe-level data and are the subject of the following sections.
B ackgr ound c orrecti
on Meth ods
The background signal consists of an optical background and a chemical background. The optical
background introduced by the scanning of the chip derives from the technical range of the scanning
device eventually supplemented by the overshining of the neighboring spots not corrected by the image
analysis. The chemical background is explained by the gene signal consisting of gene specific binding
and non-specific binding. The gene specific binding is the goal of all the models and normalization
methods. The non-specific binding, also called cross-hybridization, is the RNA from other genes and
different RNA snippets. Non-specific binding has shorter binding times at the probes and is fortunately
much lower than gene specific binding. The background signal differs from probe to probe and thus
general estimates for all of the probes have had little success. Recent developments hint at the content
of the probe sequence as the main disturbing factor. A higher GC content is associated with a higher
binding affinity due to three instead of two covalent bindings for a single nucleotide. The higher affinity
leads to higher hybridization values and bigger variance.
In ideal circumstances a probe detects the amount of a probe-specific labeled RNA sequence, but
actually the measured probe intensity is a mixture of two signals: a probe-specific signal that contains
the abundance of hybridized RNA and a non-specific background signal that obscures the observation.
To separate these signals one can choose from two methods:
1.
2.
The RMA convolution model, which models the observed probe intensity as the sum of an exponentially distributed signal and a Gaussian distributed background component (Bolstad, Irizarry,
Astrand, & Speed, 2003; Irizarry, Hobbs et al., 2003). It is worth noting, that RMA background
correction can be improved through correcting probe values for the GC content in the probe sequence. This improvement leads to a method called GCRMA (Wu, Irizarry, Gentleman, Murillo,
& Spencer, 2004).
The Affymetrixs MAS 5.0 algorithm, where the chip is partitioned into 16 equally sized segments.
For each segment a background is estimated using the lowest 2% probe intensities of each segment. Then each probe value is adjusted based upon a weighted average of each of the background
values. Here, the weights depend on the Euclidean distances from a particular probe to the centers
of all segments (Affymetrix, 2002).
No rma liza ti on Meth ods

Normalization is the process of removing unwanted chip effects that might bias all measured raw probe
level data in the similar matter. This bias is introduced during RNA extraction, pipetting, temperature
fluctuations, hybridization efficiency and many other sources of variation. (Hochreiter, Clevert, &
Obermayer, 2006) discussed these possible sources in more detail. In the following sections we shall
254
distinguish between model-based and baseline-based approaches. Model-based methods have made use
of information from across all arrays to normalize the probe level data whilst baseline-methods select
only one array in the batch as reference, and then normalize all array to that particular one. Affymetrix
for instance, proposed to normalize arrays by choosing one baseline array and then to scale all the other
arrays to have the same mean intensity as this array.
For model-based methods cyclic loess and quantile normalization both serve as examples. The
cyclic loess normalization (Yang et al., 2002) is derived from the basic principle behind the MvA-plot,
where M is the difference in log expression and A is the average of the log expression values. Ideally, the points of an MvA-plot for normalized data should be centered on M = 0 i.e. the loess curve
of that points should be a straight line lying on the A-axis. The rationale of the loess normalization is
to project the loess curve onto the A-axis. Given two arrays with probe intensities xi, Mi = log2(xi1/xi2)
and Ai = 12 log 2 ( xi1 xi 2 ) are determinated, and then a normalization curve is fitted to the resulting MvAplot using a local regression method called loess. These fits are M i and M i = M i M i thus obtain the
A +M /2
A +M /2
normalization adjustment. Transforming x1i = 2 i1 i1 and x2 i = 2 i 2 i 2 back to linear scale leads to
normalized probes intensities of both arrays.
(Bolstad, Irizarry, Astrand, & Speed, 2003) proposed to align the distribution of the arrays with an
empirical distribution. To determine the empirical distribution, quantile normalization first sorts the
intensities for each array and then computes the mean over the arrays for each sorted intensity position.
The intensities of each array are then set to the mean value of the according position in the unsorted
array. This leads to the fact that all arrays possess the intensity values but for different genes and different positions within a probe set.
Perfect
Match Co rrecti
on Meth ods
Affymetrix GeneChip arrays are so dense, that any non-specific signal contribution such as unspecific
binding or cross-hybridization has to be estimated from the probe level data. Therefore, Affymetrix
originally adjusted the signal intensity of a PM probe by subtracting the intensity value of the corresponding MM probe. But in (Naef, Hacker, Patil, & Magnasco, 2002) replicate experiments on different
arrays were made and the PM values as well as the PM MM values were analyzed. The authors found
that the PM values (PM) have lower noise at low intensity than PM minus MM (PM MM) whereas for
intermediate and high intensities the noise levels for PM and PM MM were similar. Therefore, recent
approaches made use only of PM probes. Affymetrix indirectly supports this aspect due to the retirement of the MM probes in many products. Therefore we shall not address that issue in further detail.
Su mmariza ti on Meth ods

In order to establish a single expression value for a gene that is assessed on the array, it is necessary to
combine together all intensity values of each of the probes in the corresponding probe set to one value.
To tackle this problem several methods have been proposed. In principle it is important to distinguish
between single- and multi-array approaches:
Single-array methods only use probe level information on an individual array to calculate the
expression value (Li & Wong, 2001). Therefore, the calculations are carried out individually chip-by-
255
chip, one consequence is that no information from other chips is available to robustify the expression
value. For instance Affymetrix suggested in their Microarray Suite 5.0 (MAS 5.0) to apply a one-step
Tukeys Bi-Weight, on the log2 transformed probe level data, to give an expression value. In particular,
the algorithm computes a distance measure:
dij =
log 2 (PM ij ) M
cS +
(1)
where M is the calculated median of the probe level data and S is the median absolute deviation. In addition, c is a constant defaulting to 5 and > 0 prevents division by zero. In consideration of dij weight
for every data point is calculated from:
(1 d 2 ) 2 , d 1
w(d ) =
, d > 1
0
(1.1)
In contrast to the preceding approach, multi-array methods combine the multiple preprocessed probe
intensities into a single expression by taking probe information across arrays into account. This is motivated by examining probe patterns across the arrays; Which commonly show that the variability of
a single probe across multiple arrays is smaller than the variability between probes of the same probe
set. Therefore (Irizarry, Bolstad et al., 2003) proposed to fit a linear additive model through a median
polish. This summarization approach is the part of the RMA preprocessing pipeline. As mentioned
above, perfect match intensities were modelled by:
log 2 ( PM ij ) =
+ ij
(2)
where PMij is the intensity matrix of a particular probe set such that j subscripts the probes and i indicates the arrays. Moreover j descript the probe affinity effect and + i provide an estimation of the
log2 expression level. To estimate the parameters, the median polish algorithm is applied. The lack of a
standard error estimate is one drawback to this approach.
The core of the FARMS algorithm is a factor analysis a multivariate technique to detect a common
structure in the data of multiple probes that measure the same target. The assumption is that the probe
intensity measurements of the perfect matches x depend on the true RNA concentration z via:
x = z +
(3)
with being the loadings for the factor analysis (Hochreiter et al., 2006). In equation (3), a N(0, 1)-distributed z models the common factor in the data x, while the N(0, )-distributed models the independent
noise in each probe of each array. In essence, model (3) is explaining the observed covariance structure
of the data x by representing the data as being N(0, T + )-distributed with an individual noise variance
and signal variance T and is optimized by Bayesian maximum a posteriori estimation. In contrast
to other summarization techniques, FARMS provides an unsupervised feature selection criterion that
is based on the reliability of the extracted factor. Here, only such probe sets are considered where the
model can reliably detect a variation of the latent variable z. (Talloen et al., 2007) discussed the concept
and applicability of informative/non-informative calls (I/NI-calls) approach in more detail.
256
Va lida ti on R es ources
The lack of objective criteria to assess competing preprocessing methods motivated (Cope, Irizarry, Jaffee,
Wu, & Speed, 2004; Irizarry, Wu, & Jaffee, 2006) to develop a collection of assessment criteria called
Affycomp II for the evaluation and comparison of expression measures on the Affymetrix GeneChip
platform. Furthermore a web-tool made available at http://affycomp.biostat.jhsph.edu for developers to
benchmark their procedures and to help users identify the best method for their application. Here, the
benchmark data is crucial. Therefore (Cope et al., 2004) used three well-known evaluation datasets,
which were produced by controlled experiments with known target expression values or known mutual
relations. The control of input in spike-in and dilution experiments makes it possible to identify features
of the data for which the expected outcome is known in advance. Based on this knowledge 17 assessment criteria and several related plots were developed. Nevertheless, we think that from all benchmark
criteria the area under the curve (AUC) criterion is best suited to measure the quality of a preprocessing method. The AUC criterion is the area under the receiver operating characteristics (ROC) curve,
which plots the true positive rate (sensitivity) against the false positive rate (1specificity) and serves as
a quality measure for classification methods. The AUC criterion can be applied here by defining gene
classes: for a pair of arrays class 1 genes are the genes for which expression value differences exceed a
certain relative factor (fold change). Now the output of a summarization method can be interpreted as
classification by computing the class membership of genes based on the predicted expression values.
We prefer the AUC criterion over other measures provided by Affycomp II evaluation because it is
independent of scaling of the results (log-expression values) and trades sensitivity against specificity.
Other quality measures from the Affycomp II evaluation, like slope parameter estimation, are often
not scaling independent.
Alterna ti ve Pr obe- G ene A ssignments

Affymetrix delivers an assignment of probe sequences to probe sets. The number of probes in a probe set
is fixed for all probe sets on the chip. Quarterly the probe sets are reannotated to genes and transcripts.
In recent time it has come up, that the probe sets do not have to be fixed but the probes themselves can
be reattached to probe sets and new assignments have been published (Dai et al., 2005). The alternative
assignments have several advantages. The advancement of the sequence databases leads to altered gene
sequence. If the probe sequence is not contained in the gene sequence anymore, the probe does not need
to be processed. Its signal introduces noise. Also extensive libraries for single nucleotide polymorphisms
(SNPs) are meanwhile available. The probe is skipped also if a single nucleotide polymorphism hits the
probe sequence. The assignment can be done with a specific sequence database in mind. This eases the
path of analysis as the probe sets provide another abstract level between the hybridization values and
the gene expression. A drawback of the reassignment is, that there is no fixed number of probes in the
probe set anymore. The idea of reassignment leads us to the issue of evaluating its use.
On a set of chips from the Human Genome U133 Plus 2.0 arrays we compared two assignments. The
Affymetrix assignment and the assignment to Ensembl genes (Birney et al., 2006; Dai et al., 2005).
The first discovery is, that one fourth of the probes on the chip is skipped for the reasons given above.
For the same set of chips the preprocessing has been executed using two assignments. The two results
refer mostly the same genes to be expressed. But one fourth of the genes can be seen to be differentially
expressed between the two assignments! The different number of probes in probe set does not lead to
257
a dependency between variation and probe set size. The Kendall correlation between the coefficient of
variation and probe set size is 0.06. The mean shows a tendency to high expression for small probe set
size. We also processed a set of treatment and control chips with the two assignments and compared the
resulting list of differentially expressed genes. The two lists overlap to about 50%. This again shows a
major influence of the assignment on the analysis results.
From the above considerations we conclude, that the use of alternative probe-gene assignments is
recommendable. The new assignments lead to reduced variation and thus reduced noise in the expression values, likewise they improve the sensitivity and specificity of the algorithms and results.
Medica l Use
The research or scientific use of microarrays is mostly the comparison of diseased and normal tissue
to identify differentially expressed genes. In cases with sufficient disease samples clustering and classification methods can be used to dissect different disease traits.
The dissection facilitates the distinction of research results and identifies different causes and developments. Defining expression signatures for the subtraits of an affliction leads into diagnosis of the
respective disease traits. Currently there is no chip for diagnostics on the market. But projects are running to push first products towards an admission. With the exact disease trait diagnosed the treatment
is optimised using the best possible medicine administered in an adequate dose.
Systems biology uses microarrays at several stages of model building (Klipp, Herwig, Kowald, Wierling, & Lehrach, 2005). Genes with altered expression are the basis to develop physiological models.
Hybridization results estimate the model parameters and verify model predictions.
Figure 3. On the left hand side are the expression of probe sets for the same target gene for two different
assignments compared. On the right hand side we see the variation versus the probe set size.
258
E xperiment a l set up
Due to the noisy data microarray results always have to be verified. In the lab this is done with complementary RT-PCR experiments for selected genes. In the office, look at the consistency with other published data, although this consistency is often low. The amount of verification is alleviated by the use of
statistics in the analysis of the hybridization results. But a solid statistical analysis requires replicates.
Normally this is not an issue in industries but in academic laboratories. In technical replicates the same
sample is used on several chips and in biological replicates samples from different patients or animals
are hybridized each on a single chip. The use of biological replicates is strongly recommended. The
following numbers have established by experience:

Cell culture: 2-3 replicates

Animal system: 4-5 replicates
Human system: 5-6 replicates
Time courses: 4-6 time points with 2 replicates each time point
These numbers imply that it is wise to consult a statistician at the early stage of experiment planning
to ensure a reasonable statistical set-up.
In the US the MicroArray Quality Control (MAQC) project has started to check for inter- and intraplatform reproducibility of gene expression measurements (Shi et al., 2006). The preprocessing problems
raised concerns about the reliability of this technology. The MAQC was initiated to address these among
other issues. The study is an important first step pushing microarrays towards clinical and regulatory
settings. The experimental setup is described in (Shi et al., 2006). Microarray products from different
manufacturers are compared. Affymetrix is presented with a very high reproducibility in- and across
test sites with low variance in measurements.
Out look on T echn ologica l A d vances

In this chapter we described the established series of gene expression chips from Affymetrix. New
chip designs are ahead. A direct enhancement of gene expression chips are exon arrays. They comprise
probe sets for every exon. Thus the expression can be dissected into the single transcripts and identify
alternative splicing. By collecting the probe sets in so called meta probe sets the gene expression is
still included. SNP chips are useful in the help of genotyping with microarrays, avoiding expensive sequencing approaches. On tiling arrays the probes are equidistantly distributed on the genome sequence
identifying expression of sequences independent from the gene structure. Because of the equidistant
selection of probe sequences not all probe sequences are unique introducing new analysis challenges.
Most of the genome is never expressed thus most probes on a tiling array do not return a signal.
By now no tiling array is large enough to cover the whole genome. Presenting the different chip
designs it must be remarked, that the methods and analysis approaches presented in this chapter are
not directly transferable to the new designs. Different biological questions and use of the chips need
different processing.
259
R eferences
Affymetrix. (2002). Algorithms description document.
Birney, E., Andrews, D., Caccamo, M., Chen, Y., Clarke, L., Coates, G., et al. (2006). Ensembl 2006.
Nucleic Acids Res, 34(Database issue), D556-561.
Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization
methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2),
185-193.
Cope, L. M., Irizarry, R. A., Jaffee, H. A., Wu, Z., & Speed, T. P. (2004). A benchmark for Affymetrix
GeneChip expression measures. Bioinformatics, 20(3), 323-331.
Dai, M., Wang, P., Boyd, A. D., Kostov, G., Athey, B., Jones, E. G., et al. (2005). Evolving gene/transcript
definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res, 33(20), e175.
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., et al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology,
5, R80.
Hochreiter, S., Clevert, D. A., & Obermayer, K. (2006). A new summarization method for Affymetrix
probe level data. Bioinformatics, 22(8), 943-949.
Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., & Speed, T. P. (2003). Summaries of
Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4), e15.
Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., et al. (2003).
Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics, 4(2), 249-264.
Irizarry, R. A., Wu, Z., & Jaffee, H. A. (2006). Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22(7), 789-794.
Klipp, E., Herwig, R., Kowald, A., Wierling, C., & Lehrach, H. (2005). Systems Biology in Practice.
Wiley-VCH.
Li, C., & Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA, 98(1), 31-36.
Naef, F., Hacker, C. R., Patil, N., & Magnasco, M. (2002). Empirical characterization of the expression
ratio noise structure in high-density oligonucleotide arrays. Genome Biol, 3(4), RESEARCH0018.
R Development Core Team. (2005). R: A language and environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing.
Shi, L., Reid, L. H., Jones, W. D., Shippy, R., Warrington, J. A., Baker, S. C., et al. (2006). The MicroArray quality control (MAQC) project shows inter and intraplatform reproducibility of gene expression
measurements. Nat Biotechnol, 24(9), 1151-1161.
260
Talloen, W., Clevert, D. A., Hochreiter, S., Amaratunga, D., Bijnens, L., Kass, S., et al. (2007). I/NIcalls for the exclusion of non-informative genes: A highly effective filtering tool for microarray data.
Wu, Z., Irizarry, R. A., Gentleman, R., Murillo, F. M., & Spencer, F. (2004). A model-based background
adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association,
99(468), 909-917(909).
Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., et al. (2002). Normalization for cDNA
microarray data: A robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Research, 30(4), e15.
K ey T erms
FARMS: Factor Analysis for Robust Microarray Summarization, a probabilistic latent variable model
for summarizing high-density oligonucleotide Affymetrix GeneChip array data at probe level.
Gene Expression: Gene expression is the process by which the inheritable information in a gene,
such as the DNA sequence, is made into a functional gene product, such as protein or RNA.
Microarray: A microarray (also known as gene chip or DNA chip) is a collection of microscopic
DNA spots, commonly representing sequence extracts of single genes, arrayed on a solid surface by
covalent attachment to a chemical matrix. DNA arrays are commonly used for expression profiling,
i.e., monitoring expression levels of thousands of genes simultaneously, or for comparative genomic
hybridization.
MM: Mismatch, a probe accompanying a PM, where the 13th nucleotide is changed to its complement (A to T, T to A, G to C, C to G). In the Affymetrix GeneChip design MM are spotted along with
the PM aiming to measure the non-specific hybridization to the PM.
PM: Perfect match, a probe with the true sequence complement of the targeted sequence.
Probe: A probe is a fragment of DNA of 25 nucleotides/basepairs length, which is used to detect
in RNA samples the presence of nucleotide sequences (the DNA target) that are complementary to the
sequence in the probe. The probe thereby hybridizes to single-stranded nucleic acid (DNA or RNA) the
base sequence of which allows probe-target base pairing due to complementarity between the probe
and target.
Probe Set: a set consisting of all probes addressing the transcripts from the same gene. In the Affymetrix GeneChip design the expression level of a gene shall be measured with several probes.
261
262
Chapter XV
Alternative Isoform Detection

Using Exon Arrays
Jacek Majewski
McGill University and Genome Qubec Innovation Centre, Canada
David Benovoy
Tony Kwan
abstract
Eukaryotic genes have the ability to produce several distinct products from a single genomic locus. Recent
developments in microarray technology allow monitoring of such isoform variation at a genome-wide
scale. In our research, we have used Affymetrix Exon Arrays to detect variation in alternative splicing,
initiation of transcription, and polyadenylation among humans. We demonstrated that such variation
is common in human populations and has an underlying genetic component. Here, we use our study to
illustrate the use of Exon Arrays to detect alternative isoforms, to outline the analysis involved, and to
point out potential problems that may be encountered by researchers using this technology.
INTR OD UCTI ON
Alternative pre-mRNA splicing is a process allowing the production of several distinct gene isoforms
from a single genomic locus. The most common type of alternative splicing events in mammals results
in cassette exons, where each such exon can be either included or excluded from the mature mRNA.
Other events include alternative use of donor or acceptor splice sites, and intron retention. In addition,
processes such as alternative promoter usage and alternative polyadenylation, resulting in differences
in initiation and termination of the transcript, respectively, further diversify eukaryotic transcriptomes
and proteomes. Such processes have been suggested to be at least partly responsible for mammalian
Alternative Isoform Detection Using Exon Arrays
complexity, which is otherwise difficult to explain in view of our relatively low number of genomic loci
- less than 25,000 genes in humans, versus approximately 20,000 in the nematode worm C. elegans
(Claverie, 2001). It is estimated that a high percentage of mammalian genes is alternatively spliced, and
this frequency is highest in specialized and complex tissues, such as the brain and the liver. Differences
in splicing patterns have been shown to exist across species, and within populations of the same species.
In humans, splicing defects are known to result in numerous genetic disorders (Faustino & Cooper,
2003) and may confer susceptibility to complex genetic diseases. Thus, the process of alternative splicing attracts the interest of researchers across the entire biomedical sciences spectrum, ranging from
evolutionary biology, through development, to medicine.
In recent years, alternative transcript investigation in a genome-wide context has been carried out
using expressed sequence tag libraries (ESTs). Generally, ESTs (short cDNA sequence reads) are mapped
to the genomic sequence, and different isoforms can be inferred from incongruence of splicing patterns
(Modrek, Resch, Grasso, & Lee, 2001). However, EST library analyses are prone to sequencing errors,
biased towards highly expressed genes, and influenced by cancer-derived ESTs, which may not generally be present in healthy tissues.
More recently, microarray platforms have been proposed as a tool for studying gene expression at
the isoform level (Black & Graveley, 2006; Lee & Roy, 2004; Zhang et al., 2006). Splicing sensitive
microarrays employ a number of exon body oligonucleotide probes, or exon junction probes, or a combination of the two designs, to determine mRNA levels at the resolution of a single exon or splice site.
The Affymetrix GeneChip Human Exon 1.0 ST Array is the first commercially available microarray
product designed for genome-wide, exon level expression analysis. The array relies on targeting multiple probes to individual exons and allows simultaneous, exon-level detection of expression intensity
for 1.4 million probesets covering over 1 million known and predicted human exons. The Exon Array
is a flexible tool, which can be used to perform the function of classical expression arrays and concurrently provide supplementary information on isoform changes. However, because of the complexity
of the design, statistical analysis of the data becomes much more intensive, both at the theoretical and
computational level. The simplest illustration is the multiple testing problem; whereas in classical expression arrays the number of tests is of the order of the number of genes, in an exon array, the number
of tests is over ten-fold higher and can vary between a few hundred thousand to over 1 million (if
computationally predicted exons are included). The statistical approaches need to be able to distinguish
between whole gene expression differences and isoform differences, which introduces a new level of
complexity. The robustness of measurement is also an issue, since the exon array has on average four
probes per probeset, whereas Affymetrix expression arrays relied on more than 10 probes per probeset
to estimate expression.
In our lab, we have used the Exon Arrays, to investigate differences in splicing patterns among
humans. We were able to demonstrate the existence of common variation in splicing, polyadenylation
sites, transcription initiation. We also demonstrated genetic linkage and allelic association of the isoform
variation to common genetic single nucleotide polymorphisms (SNPs). Our results show that the effects
of genetic variants on gene expression are much more complex than previously believed, and constitute
an important step towards understanding the functional consequences of such variation.
In this chapter, we will use our experiments as a case study, in order to outline the flow of the analysis
required to process Exon Arrays. We will also outline problems which may be encountered by potential
users of the chips, and describe solutions which we have developed to overcome such problems. We will
discuss current advances in statistical analysis and propose future improvements to optimize both the
array design and statistical solutions.
263
MICR OARRAY DESIGN

The Exon Array relies on patented Affymetrix technology to provide 5.4 million 25-mer oligonucleotide
probes on a single chip. Probes target individual exons, or portions of an exon when prior evidence of
alternative splicing exists. Each such potential splicing unit is represented by at least one probeset, and
each probeset consists, on average, of four individual probes (Figure 1). It should be noted that because
of the small size of some exons and limitations in probe placements, many probes within a probeset are
overlapping and thus not independent. The array annotation consists of 3 levels: core, extended, and full.
The core probesets represent the highest level of annotation confidence; they are supported by RefSeq
and GenBank evidence. There are 284,000 core probesets on the array. The extended and full annotations represent less confidently annotated exons, with support from ESTs and comparative genomics,
but also de novo gene prediction algorithms. In total there are over 1.4 million probesets interrogating
approximately 1 million known and predicted exons. While the extended and full gene annotations have
the potential to identify novel exons and transcript variants (Siepel et al., 2007), we expect that most
studies will concentrate on the high confidence core annotations. This will greatly limit false positive
discovery rates and allow labs to fine-tune their analysis techniques. We recommend that the non-core
probesets only be considered by more experienced researchers or in case of follow up analyses targeted
to specific transcripts.
One significant departure from earlier Affymetrix designs is the absence of mismatch probes. Mismatch probes have previously been used to estimate background hybridization noise levels. Instead, the
Exon Array uses a large number of antigenomic probes, which do not have a match anywhere in the
genome and ideally represent a null signal. Antigenomic probes are grouped by their GC content and
used to produce a Detection Above Background (DABG) p-value. A p-value below 0.05 may be used
as an indication that a given probe, probeset, or metaprobeset is expressed.
It is important to note that, while the Exon Array design targets the entire genome, certain splicing
changes are virtually impossible to detect using a purely exon-targeted approach. As an example, if a
minor (rare) isoform of a transcript skips a certain exon, even large (several fold) increases in the level
of that isoform may not significantly impact the overall mRNA levels of the exon. Such splicing variants
would be much more efficiently detected using junction-arrays which target their probes to specific
exon-exon junctions. However, junction arrays can only target a limited number of junctions (because
of the high number of possible combinations) and cannot monitor every hypothetical alternative splicing
event. Thus, in order to target every known or predicted exon in the genome, the Exon Array sacrifices
sensitivity to certain types of rare isoform changes.
ST UDY DESIGN
Genetically controlled variation in gene expression has recently been shown to be common in human populations, and it is believed to be responsible for phenotypic variability and susceptibility to
complex diseases (Cheung et al., 2005; Spielman et al., 2007; Stranger et al., 2005). We have a much
poorer understanding of the variability at the level of specific transcript isoforms, such as differences in
transcription initiation, splicing, and polyadenylation. Despite isolated examples of such differences, to
date, no genome-wide studies have been carried out to determine their prevalence and potential impact.
We designed a pilot experiment, to investigate the effectiveness of the Exon Array in detecting splicing
264
Figure 1. A. Schematic for coverage of probesets across the entire length of the transcript. Light regions
correspond to exons, whereas dark regions represent introns. The short dashes underneath the exon
regions indicate individual probes of 25 nucleotides in length representing the probeset. B. Flow chart
for processing and analysis of chips to validation of alternative splicing events. Total RNA is extracted
from the two cell lines (n=15 replicates per individual) and is transcribed to cDNA and labelled with
biotin. The total cDNA is then hybridized to the exon chip, followed by washing and staining with an
anti-streptavidin antibody. Chips are then scanned and hybridization data is processed and analyzed
by the Affymetrix Power Tools software package. A splicing index is calculated and candidate statistically significant differential splicing events are selected using an unpaired t-test. A subset of alternative
splicing events predicted between the two are then validated by: 1) reverse transcriptase-polymerase
chain reaction (RT-PCR) using exon body primers flanking the probeset of interest; and 2) sequencing
of the RT-PCR products (reprinted from Kwan et al., 2007).
265
differences among humans by comparing RNA extracted from lymphoblastoid cell lines of two different individuals, and further observing inheritance of the splicing patterns within their families. For
the continuation of this study, we used cell lines from 60 unrelated individuals of Northern European
descent that have been earlier genotyped for approximately 4 million SNPs by the International HapMap
Project (Altshuler et al., 2005). Each cell line was grown in triplicate, in order to minimize stochastic
effects of cell growth conditions. RNA was isolated from the cells using standard protocols, and the
three biological replicates for each individual were used for hybridization. Since all genotypes were
already known, we were then able to use the resulting exon-level expression measurement as a quantitative phenotype, and carry out allelic association analysis. Briefly, at each genetic marker the genotypes
were coded as 0, 1 or 2, with the heterozygous genotype always given the value of 1. The quantitative
phenotype was then regressed on the genotype, and statistical significance and magnitude of the genetic
effect were estimated from the regression analysis (Figure 4).
ANA LYSIS
All of our microarrays were hybridized and scanned at the Micorarray Platform at McGill University
and Genome Quebec Innovation Centre. We used standard protocols, with an enhanced RiboMinus
treatment step to improve the reduction of ribosomal RNA. The raw hybridization intensity data was
obtained in the form of .cel files. These files can be analyzed using most available microarray software.
In our study, we used the Affymetrix Power Tools software package (Affymetrix). The general quality
control procedures, such as visual inspection and multivariate analyses aimed at outlier detection (e.g.
PCA plots), are common to all microarray analyses and will not be described in detail. Any defective
or suspected outlier arrays should be removed from further analyses.
The first step of subsequent statistical analysis consists of normalization. This step attempts to
control for the chip effect, which may be caused by slight differences in RNA hybridized to each
array or differences across production batches of arrays (if possible, these should be avoided before
setting up the experiment). We use the quantile normalization method (Irizarry et al., 2003), which
assumes that the distribution of signal intensities across all arrays should be identical. Normalized
probe intensities are then summarized into a single measurement. Again, this can be carried out using established techniques developed for general microarray use, such as RMA (Irizarry et al., 2003)
or PLIER (Affymetrix). However, it is at this stage that the analysis of Exon Arrays acquires its own
flavour. The exon array data is summarized at two levels: the probeset level (roughly corresponding to
the expression of each exon), and the metaprobeset level (corresponding to an entire transcript). The
probeset level signal is obtained by combining an average of four probes per exon. Hence, this signal
may be quite noisy. The metaprobeset signal combines the data from all exons within a gene and is thus
more robust. However, alternative splicing within a gene may affect the metaprobeset summary and
result in erroneous differences in transcription level estimates. This is particularly true for genes with
few exons. Both PLIER and RMA claim robustness with respect to rare aberrant signal e.g. caused
by unresponsive of cross-hybridizing probes, or rare alternative exons and currently we do not filter
out such probes. However, to further improve the analysis, algorithms have been suggested to pre-filter
inconsistent probes before the summarization step (Xing, Kapur, & Wong, 2006). As a principle, we
recommend log-transforming all summary expression levels in order to stabilize the variance.
266
One of the biggest challenges in detection of alternative splicing using Exon Arrays is the deconvolution of splicing and transcription. A simple comparison of probeset intensities across samples is
not sufficient; if an exon belongs to a transcript that is differentially expressed, the examination of a
single exon out of its genomic context will lead to an incorrect conclusion. A very simple and intuitive
solution to this problem is the use of the Splicing Index (SI), which is calculated by simply dividing the
probeset intensity by the metaprobeset intensity (i.e. exon expression/gene expression), after the addition of a stabilization constant to both the probeset and metaprobeset scores. This simple procedure
normalizes the expression level of each exon and accounts for any possible gene expression differences
between samples. However, we find that the splicing index has some undesirable statistical properties
(arising from large errors in the estimates in both the numerator and the denominator) as well as being
prone to methodological artefacts, and should be used with caution. Thus, we have also used a simpler,
but more labour intensive method, of carrying out the entire analysis at the probeset level, and relying
on visualization and manual curation of the results in order to distinguish splicing and expression differences between samples. While more robust statistical approaches are being developed, we strongly
advocate visualization of results in the context of genome annotation and EST evidence in order to
filter out false positive signals. We have relied on custom scripts and modifications of the UCSC and
ENSEMBL genome browsers, but commercial packages for the Exon Arrays are also available (e.g.
Partek Genomics Suite).
RES ULTS
T wo S ample C omparison
This is the simplest microarray study design. Generally, the two samples may correspond to two different tissues, healthy and affected, or treated and untreated material. In our pilot study, we used RNA
extracted from two cell lines originating from two different individuals (Kwan et al., 2007). To analyze
splicing differences between the two samples, an unpaired Students t-test was performed using the logtransformed SI values of each probeset for 15 replicates (3 biological replicates x 5 technical replicates) of
each individual (R statistical package, version 2.3.0). Probesets showing significantly different SI scores
were ranked by p-value or fold-change. During the course of the data analysis, we discovered that many
pre-processing steps needed to be performed on the SI results, in order to limit false positive rates.
Effect of SNPs
We found the primary source of false positives to be the effect of SNPs on probe hybridization levels
(Figure 2). A single mismatch in a cDNA sequence can totally prevent the hybridization of the cDNA to
its probe (Alberts et al., 2007). Since each probeset consists of only four probes, and a SNP can affect
all four overlapping probes in some of the probesets, the final effect of SNPs on genetically heterogeneous samples is substantial. We discovered that the most statistically significant candidate exons were
highly enriched in SNP-containing, overlapping probes. A vast majority of those candidates could not
be validated using RT-PCR. This effect was easy to verify in our dataset, since the HapMap samples
have been thoroughly genotyped and in most cases the mismatch substrate is known before hand, but
this may not be the case in many other studies. In our analysis, we conservatively masked (removed
267
Sample A
Sample B
t = -1.057, p-value = 3.0e-0
Probeset_id: 2748252 (mask)
Sample A
Sample B
t = ., p-value = 3.0e-0
.
Sample A Sample B
ATTTGGGCTTCGGGATCACGTGGAC
CTTCGGGATCACGTGGACGAGGTGT
Sample A Sample B
Probe_id: 00
.
Probe_id: 000
SNP
Sample A Sample B
Probe_id: 0
t = ., p-value = .1e-0
.
t = -1., p-value = .e-02
Probeset_id: 2748252 (unmask)
b.
log2(probeset intensities)
GAAACACCACGTCTTGGGCAGGAGA
CAGGAGAACCGGCCAGAGGAACGTC
.
.
.
.
.
0.0
Sample A Sample B
.
.
Probe_id:
0.
.
t = -0.0, p-value = .e-01
.
.
0.0
.
.
0.0
0.
0.
log2(probe intensities)
a.
.
.
268
.
t = ., p-value = .3e-08
Figure 2. The effect of SNPs of probe hybridization leads to false positive results. A. The box plots
show differences in expression between 4 probes belonging to probeset 2748252. The probe sequences
and relative positions are shown under the plots. Two overlapping probes are affected by a single SNP,
rs11549015, and show highly statistically significant reduction in intensity in the individual carrying
the mismatch allele. B. The probeset 2748252 appears as differentially included in the transcripts of
the two individuals (left panel). However, this effect disappears after masking the probes overlapping
the SNP (right panel).
from the analysis) all probes containing known SNPs. While this procedure led to a slight reduction of
the available data, it drastically reduced false positive rates. We recommend that this procedure should
be followed in experimental setups where samples are genetically heterogeneous (e.g. cancer patients).
However, the presence of SNPs should not be a problem in cases where different tissues from the same
patient or pooled RNA samples are being studied.
Dataset Reduction
In order to reduce the amount of random noise, and decrease the number of tests being carried out, it
is useful to exclude all genes which are either not expressed in all of the samples, or more than one of
the samples being compared. Such genes, by definition cannot be alternatively spliced across samples.
There is currently no reliable procedure on deciding whether a gene is expressed or not, and Affymetrix
recommends using an ad hoc expression value of 15, and some additional filters using DABG values
of individual exons.
Effect of Dead Probesets

A probeset which is not expressed e.g. an exon which is skipped in all samples under investigation
may produce a false positive signal in the splicing index, in the presence of transcript-level variation.
All non-responsibe probesets should be removed from the analysis. A DABG-based criterion may be
used here, e.g. DABG p-value < 0.05 in at least 50% of the samples.
Additional Filtering Steps

Based on the experience of our group and many other researchers using the Exon Arrays, Affymetrix
now recommends a number of data filtering steps, which can be found at the company website (www.
affymetrix.com).
Splicing Differences
We applied three different methods for multiple testing correction. The Bonferroni correction, obtained
by dividing the nominal p=0.05 threshold by the total number of probesets yields the most conservative cut-off (p-value=3.159e-7), yielding 1892 candidate probesets (1.2% of expressed core probesets)
showing differential splicing between the two samples. The false discovery rate (FDR) (Storey & Tibshirani, 2003) at a 0.01 significance level provided the least conservative estimate (p-value=8.915e-4)
with 8771 (5.7%) potential splicing events. We also used an empirical null distribution of p-values from
the observed data, by shuffling the SI scores for all samples of each probeset (Churchill & Doerge,
1994). At the p=0.05 level, this method estimates 4020 (2.6%) differentially spliced probesets between
the two individuals.
A small subset of 20 candidates was subjected to validation by RT-PCR using a pair of primers in
distinct exons flanking the predicted differential splicing events. We were able to confirm alternative
isoforms for 9 of the transcripts, which translates into a 45% validation rate. However, our study evaluates the ability of this microarray technology to identify alternative AS events de novo in genetically
diverse populations. Restricting our candidates to those showing EST and cDNA evidence of alterna-
269
tive splicing in sequence databases reduces the number of cases from 20 to 12, thereby increasing our
success rate to 60% (7 out of 12).
As noted earlier, the Exon Array can be used as a de novo discovery platform for alternative splicing,
or it can be targeted to the more confident, previously observed events, either by using only the core
exons or further limiting the analysis to EST-supported alternative exons. If used as a discovery tool, the
false discovery rates will invariably increase, but the benefit of detecting novel events may sometimes
outweigh the cost. Judging by our experience, and the experience of other groups, the false positive
rates of this platform will generally be non-negligible and may be of the order of 50%. We recommend
thorough independent (PCR, Northern blot) validation of the most interesting results.
Heritability of Splicing Differences

One of our hypotheses was that differences in splicing between individuals may be heritable, i.e. they
have a genetic component. Hence, we carried out genetic linkage analysis within the family of the two
tested individuals, using the SI of the 9 validated exons as a quantitative trait. We confirmed linkage to
cis-acting SNPs (heritability) for three of the nine exons (Figure 3), and for two of them we were able
to identify the SNPs affecting their extended splice-site consensus sequences.
Multiple S ample A nalysis

Following the above pilot study, we carried out a full-scale investigation of splicing and isoform level
variation in a human population, using cell lines from 60 unrelated CEU HapMap individuals (Kwan et
al., 2008). This setup is illustrative of other designs often encountered in microarray experiments, where
data subdivided into multiple classes is analyzed simultaneously. In contrast to the two-sample design,
ANOVA or regression analysis is used instead of a t-test to detect differences among samples. In our
case, we assumed an underlying co-dominant genetic model, where each SNP allele is associated with
preferential expression of a distinct isoform. Hence, the heterozygous SNP genotype has an effect intermediate to the two homozygotes, and a linear regression approach is the logical choice (Figure 4).
In this part of the study, we decided not to use the splicing index, but rather carry out the entire
analysis at the probeset (exon) level. After the filtering steps outlined above, we identified all probesets
exhibiting significant association between expression levels and cis-acting SNP genotype. The statistical
cut-offs were established using permutation testing and a 0.05 FDR level. Cis-acting SNPs were defined
by their proximity within 50kb of the probeset tested. We used a semi-automated classification method,
to decide whether a significant association represented an alternative splicing, alternative transcription
start, alternative polyadenylation, or whole transcript expression change. This was performed using a
simple script which grouped all significant probesets together (in cases where they belonged to a single
transcript), determined their location within the transcript (internal, 5 or 3) and tested whether the
whole gene or just individual exons were significantly associated with the SNP genotype (expression or
isoform difference). In addition, we examined all the candidate events by eye, in relation to their genomic
context, EST, and mRNA evidence, in order to determine the final classification of isoform differences.
Although this may seem like an ad hoc and laborious process, thorough visualization of 324 candidate
genes obtained from this analysis was carried out in our lab in a matter of days. It is a relatively minor
effort compared to the cost and labour required to set up most microarray studies, and we believe that
270
Figure 3. Inheritance of alternative splicing for genes A. OAS1, B. CRTAP, and C. CAST. Left panel
shows pedigree structure of CEPH/UTAH family 1444 with the autosomal dominant inherited splice
pattern as blue symbols. Haplotypes for each of the eight founder chromosomes are labelled A, B, C,
D, E, F, G, and H, and the two inherited haplotypes of each family member are indicated within the
symbol. The regulatory haplotype is shown as bold white text. Squares represent males and circles represent females. The right panel shows the two transcript isoforms of the genes. Exon-body primers are
shown above the flanking exons of the predicted alternatively spliced exons. Shown below the transcript
isoforms are the RT-PCR results. Lanes are numbered from 1-14 according to the pedigree on the left
(reprinted from Kwan et al., 2007).
its the most effective approach currently available to create a highly confident result set. Some examples
of the observed variants are shown in Figure 5. Complete results of this study may be retrieved from
our website (www.regulatorygenomics.org). In order to further limit false positives, we also sequenced
in multiple individuals 83 candidate events which were supported by a single probeset, and where all
271
Figure 4. Analysis steps from identification of a significant probeset in the PARP2 gene to validation. A.
Linear regression analysis of expression scores for probeset 3527423 with genotypes of SNP rs4981998,
giving a p-value of 2.81x10-30. B. Visualization of probeset 3527423 in the context of all other probesets
belonging to the same transcript (metaprobeset 3527418). For each probeset, the significance level
(p-value) is indicated by the grey line, along with fold change expression between the mean scores of
the two homozygous genotypes (meanTT / meanCC) shown as vertical bars. The solid horizontal grey
and black lines represent the significance and fold change expression for the regression analysis at the
metaprobeset level against SNP rs4981998. Probeset 3527423 is indicated by an arrow. C. RT-PCR
validation of probeset 3527423 using flanking exon-body primers. D. Schematic of 5 end of two isoforms of PARP2 with Exon Array probesets shown below the exons. The significant probeset 3527423
corresponds to alternative 5 splice site usage resulting in a larger 2nd exon for NM_005484 (reprinted
from Kwan et al., 2008).
probes overlapped and could potentially be affected by a single SNP. As a result of this re-sequencing
and discovery of previously un-annotated SNPs, we excluded 27 probesets from further analysis.
Again, we proceeded to validate a subset (32) of our candidate events. Putative alternative splicing
events were validated using a simple end-point RT-PCR, while promoter and poly-adenylation changes
were tested using real time SYBR Green PCR (Applied Biosystems, Foster City, CA, USA) RT-PCR.
We successfully validated 25 of the 32 events. We consider the resulting validation rate (78%) to be
extremely high for this type of study. However, this reflects the large amount of processing: statistical
(filtering etc.), manual (visualization), and laboratory work (additional sequencing), which contributed
to the final dataset.
272
Figure 5. Examples of different types of transcript isoform events observed. Each plot corresponds to
a transcript, and each data point to an individual probeset. A. Gene expression level changes of LRAP,
including alternative splicing of a cassette exon. B. Differential 3 UTR change of ARTS-1 resulting
in long and short isoforms with alternative stop codon usage. C. Expression of two TCL6 transcript
isoforms that contain different 5 and 3 ends. D. Increasing significance and fold change expression
levels towards the 3 end of the CCT2 gene, suggesting genetic variation associated with mRNA stability
(reprinted from Kwan et al., 2008).
C ONC LUSI ON
B iology
Alternative splicing is rapidly being recognized as an important mechanism regulating numerous
biological processes. Identification of alternative splicing and alternative isoform expression provides
us with a new venue in understanding the diversity and the complexity of the human genome. Several
recent studies have demonstrated the presence of variation in gene expression levels among humans.
Furthermore, such variation has been shown to have a genetic basis. It is believed that these expression
differences among individuals may be responsible for downstream phenotypic differences, including
susceptibility to genetic disorders. In our research, we show that, in addition to gene expression level
differences, a significant amount of variation affects the types of isoforms being expressed. This variation also has a strong genetic component, and hence, the effect of common genetic variation in humans
is much more complex than previously believed. In fact, we show that the effect of SNPs on isoform
expression (initiation, splicing, and termination), is at least as common as the effect on overall levels
273
of gene expression. The downstream phenotypic effects are likely to be substantial; some associations
with genetic disorders already exist, e.g. a 3UTR variant of the IRF5 gene has recently been associated
with susceptibility to lupus (Reddy et al., 2007). The full extent of the effects of isoform variants will
be revealed in the near future.
T echnology
Historically, our understanding of alternative splicing has been limited to individual laboratory experiments. Analysis of EST libraries provided the first glimpse into the genome-wide extent of this
phenomenon. Splicing-sensitive micorarrays now place the genome-wide analysis into the hands of
individual researchers. The Exon Array is the first commercially available tool offering whole-genome
coverage, in theory targeting all possible alternative splicing events. Our research, and studies from
several other groups (Clark et al., 2007; Yeo et al., 2007) prove that the Exon Array can indeed be used
to detect alternative splicing and isoform differences in a variety of systems. We are very excited by
this technology and are currently applying it in several experimental settings. However, several words
of caution should be given. This is not yet an out of the box product. A large number of pre- and postprocessing steps are necessary. Data should be filtered to remove systematic artefacts, such as the effect
of SNPs, unresponsive, cross-hybridizing, or saturated probes. Final results must be carefully inspected
to determine the type of splicing or isoform event represented by the signal, and whether the signal
makes sense in the context of the gene structure or is likely to be a false positive. No automated methods currently exist to perform those steps, and it is unlikely that such a process can be fully automated.
Even given specialized visualization tools, such as custom versions of genome browsers or commercial
software, the researcher must expect to spend a large amount of time pouring over the final data, rather
than simply obtaining a list of genes. Given the massive amount of data produced, false positive results
will remain a problem. This may be partially alleviated by development of more appropriate statistics
and optimizing array design, but the basic multiple testing problem cannot be avoided. If limiting false
positives is a priority, researchers may like to narrow their analysis to a subset of the data by focusing
on events only supported by prior data (ESTs) or genes involved in certain biological processes.
F UT URE DIRECTI ONS

Whole-genome analysis of alternative splicing has been made possible largely due to recent technological
advances, allowing cost-effective manufacture of oligonucleotide microarrays able to interrogate millions
of probe sequences. The Exon 1.0 ST Array is the first generation comprehensive splicing array. The
current design reveals enormous potential, but also some limitations of the design. The next generation
splicing microarrays are likely to combine exon body probes with junction probes. This step should
improve the sensitivity and allow of detection of additional types of isoform changes. We foresee that
in parallel with the development of appropriate statistical and visualization methods, splicing-sensitive
microarrays will facilitate major breakthroughs in investigations of splicing and isoform variations in
the coming years.
Looking ahead to a more distant future, we expect that high-throughput parallel sequencing will
soon become competitive with microarray technology, and may eventually surpass it as a genomewide transcriptome profiling tool. Several groups are currently optimizing parallel sequencing based
274
gene expression analyses. With coverage currently reaching 2 billion bases per sequencing run, these
methods are likely to produce first results within one or two years. However, in order to simultaneously
monitor all splice junctions in a quantitative manner, sequencing coverage will have to reach levels at
least 10-fold higher than those necessary for gene expression level analysis. Sequencing-based analysis
of alternative splicing will ultimately resemble prior EST-based techniques, where short sequence reads
will be aligned to the genome, and the presence and frequency of alternative splicing will be determined
by counting the splice variants.
REFERENCES
Alberts, R., Terpstra, P., Li, Y., Breitling, R., Nap, J. P., & Jansen, R. C. (2007). Sequence polymorphisms
cause many false cis eQTLs. PLoS ONE, 2, e622.
Altshuler, D., Brooks, L. D., Chakravarti, A., Collins, F. S., Daly, M. J., Donnelly, P., et al. (2005). A
haplotype map of the human genome. Nature, 437(7063), 1299-1320.
Black, D. L., & Graveley, B. R. (2006). Splicing bioinformatics to biology. Genome Biol, 7(5), 317.
Cheung, V. G., Spielman, R. S., Ewens, K. G., Weber, T. M., Morley, M., & Burdick, J. T. (2005).
Mapping determinants of human gene expression by regional and genome-wide association. Nature,
437(7063), 1365-1369.
Churchill, G. A., & Doerge, R. W. (1994). Empirical threshold values for quantitative trait mapping.
Genetics, 138(3), 963-971.
Clark, T. A., Schweitzer, A. C., Chen, T. X., Staples, M. K., Lu, G., Wang, H., et al. (2007). Discovery
of tissue-specific exons using comprehensive human exon microarrays. Genome Biol, 8(4), R64.
Claverie, J. M. (2001). Gene number. What if there are only 30,000 human genes? Science, 291(5507),
1255-1257.
Faustino, N. A., & Cooper, T. A. (2003). Pre-mRNA splicing and human disease. Genes Dev, 17(4),
419-437.
Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., et al. (2003).
Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics, 4(2), 249-264.
Kwan, T., Benovoy, D., Dias, C., Gurd, S., Provencher, C., Beaulieu, P., et al. (2008). Genome-wide analysis of transcript isoform variation in humans. Nature genetics, Published online: 13 January 2008
Kwan, T., Benovoy, D., Dias, C., Gurd, S., Serre, D., Zuzan, H., et al. (2007). Heritability of alternative
splicing in the human genome. Genome Res, 17(8), 1210-1218.
Lee, C., & Roy, M. (2004). Analysis of alternative splicing with microarrays: successes and challenges.
Genome Biol, 5(7), 231.
Modrek, B., Resch, A., Grasso, C., & Lee, C. (2001). Genome-wide detection of alternative splicing in
expressed sequences of human genes. Nucleic Acids Res, 29(13), 2850-2859.
275
Reddy, M. V., Velazquez-Cruz, R., Baca, V., Lima, G., Granados, J., Orozco, L., et al. (2007). Genetic
association of IRF5 with SLE in Mexicans: higher frequency of the risk haplotype and its homozygozity
than Europeans. Hum Genet.
Siepel, A., Diekhans, M., Brejova, B., Langton, L., Stevens, M., Comstock, C. L., et al. (2007). Targeted
discovery of novel human exons by comparative genomics. Genome Res, 17(12), 1763-1773.
Spielman, R. S., Bastone, L. A., Burdick, J. T., Morley, M., Ewens, W. J., & Cheung, V. G. (2007).
Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet,
39(2), 226-231.
Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc Natl Acad
Sci USA, 100(16), 9440-9445.
Stranger, B. E., Forrest, M. S., Clark, A. G., Minichiello, M. J., Deutsch, S., Lyle, R., et al. (2005). Genome-wide associations of gene expression variation in humans. PLoS Genet, 1(6), e78.
Xing, Y., Kapur, K., & Wong, W. H. (2006). Probe selection and expression index computation of Affymetrix Exon Arrays. PLoS ONE, 1, e88.
Yeo, G. W., Xu, X., Liang, T. Y., Muotri, A. R., Carson, C. T., Coufal, N. G., et al. (2007). Alternative
splicing events identified in human embryonic stem cells and neural progenitors. PLoS Comput Biol,
3(10), 1951-1967.
Zhang, C., Li, H. R., Fan, J. B., Wang-Rodriguez, J., Downs, T., Fu, X. D., et al. (2006). Profiling alternatively spliced mRNA isoforms for prostate cancer classification. BMC Bioinformatics, 7, 202.
K ey T erms
Allelic Association: A statistical association of a genetic marker allele with a phenotypic trait. Here,
we use association analysis to detect SNPs statistically correlated with changes in isoform-level expression. While association does not directly imply causation, it is highly likely that a causative genetic
variant is in linkage disequilibrium with the significant SNP marker.
Alternative Splicing: A mechanism which results in the production of several mRNA variants from
a single genomic locus, by preferential inclusion or exclusion of certain splice sites or exons.
EST: Expressed sequence tag. Short sequence reads are produced on large scale from cDNA libraries. EST sequencing allowed quantification of known transcripts, detection of novel genes, and novel
isoforms.
Exon Array: A type of microarray using probes targeted to individual exons within each gene.
Exon Arrays may be used to measure the expression of an entire transcript, but also detect higher level
changes, such as alternative splicing and other transcript isoform differences.
Pre-mRNA Splicing: A process which removes intronic sequences from the precursor messenger
RNA of eukaryotic genes, to produce mature messenger (m)RNA.
276
Isoform: In the context presented here, an isoform is one of the transcript variants produced by
each locus. A gene isoform can result from alternative splicing, alternative transcription initiation, or
polyadenylation.
SNP: Single nucleotide polymorphism. SNPs are single base pair mutations which have been driven
to detectable frequencies in human populations. On average, two human individuals will differ at 1
polymorphic site for each 1000 bp of DNA. Vast majority of the SNPs are likely to be neutral, but some
may affect phenotypic traits.
277
278
Chapter XVI
Gene Expression in Microbial

Systems for Growth
and Metabolism
Prerak Desai
Utah State University, USA
Bart Weimer
Utah State University, USA
abstract
The use of systems biology to study complex biological questions is gaining ground due to the ever-increasing amount of genetic tools and genome sequences available. As such, systems biology concepts
and approaches are increasingly underpinning our concept of microbial physiology. Three tools for use
in functional genomics are gene expression, proteomics, and metabolomics. However, these tools produce
such large data sets that we sometimes become paralyzed trying to merge the data and link it to form
a consistent biological interpretation. Use of functional groupings has relieved some of the issues in
merging data for biological meaning. Statistical analysis and visualization of these multi-dimension data
sets are needed to aid the microbiologist, which brings additional methods that are often not familiar.
Progress is being made to bring these diverse data types together to understand fundamental metabolic
processes and pathways. These efforts are paying tremendous dividends in our understanding of how
microbes live, grow, survive, and metabolize nutrients. These insights allow metabolic engineering to
progress and allow scientists to further define the mechanisms of metabolism.
Gene Expression in Microbial Systems for Growth and Metabolism
Intr od ucti on
Systems biology brought a great challenge to the microbial world produce genome-scale data that is
integrated into a complete biological picture using specific genes. In spite of this challenge, systems biology
is increasingly underpinning our concept of microbial physiology based on genome sequence. Initially,
production of a genome sequence limited implementation of this paradigm. However, the rate at which
new genome content is accumulating is staggering and is the basis of new avenues for discovery.
Publicly available genome sequence now exceeds 700 finished genomes with an additional 1,700
genomes in process of sequencing that provides access to at least 3,600 individual microbial genomes
with an additional 116 metagenome projects underway (GOLD, 2008; www.genomeonline.org). These
projects are challenging scientists ability to collect, process, and overlay a biologically meaningful interpretation of the data. Application of that information to make biologically informed decisions is also
a daunting challenge that requires a fresh perspective and new skill sets that leverage genomic-based
tools to answer specific biological questions.
The heart of the systems biology discovery lies in the new fields of comparative and functional genomics, along with proteomic and metabolite profiles (Fields, 2000). Comparison of genomes to assess
the link between structural similarity and functional expression is fully enabled with access to genome
content. Comparative analyses of bacterial genomes provide new information about the dynamic interchange of DNA between microbes (Hughes, 2000). Comparing sequenced genomes is an excellent
approach to explore genome plasticity and how it impacts the metabolism of microbes.
Approaching microbial metabolism from a systems biology perspective is a new position that requires
scientists to think of systems of activities that is composed of multiple individuals to carry out those
activities. Use of gene ontologies and (GO) and clusters of orthologous groups (COGs) are very helpful
for this aspect of comparison since specific and common terms are used in a hierarchical classification
system, and are essential when using functional genomic tools to drill down to individual conversations
(i.e. activities). These tools include gene expression profiling, protein expression profiling, metabolomics
profiles, and new statistical methods.
Functional genomics must incorporate data from each set of tools to examine how organisms utilize
the genome potential (via gene expression initially). Since most genomes have thousands of genes that are
monitored the task is like trying to monitor all the individual conversations in a crowded stadium at one
moment in time. In essence, functional genomics documents a cells many conversations as they occur
simultaneously. The conversations are defined by genes expression (gene arrays), protein interactions
(proteomics), and small molecule biochemistry (metabolomics) that create a multi-dimensional view of
the cell. Deciphering these conversations in turn outlines the web of actors in the metabolic networks,
providing an unprecedented view of how a living cell carries out the many functions of growth, survival,
pathogenicity, and metabolism.
Production of multi-dimensional data using these techniques produce very different types of data that
need a common thread that links the data types. Such large data sets lead to an interpretational paralysis that limits the linkage between the functional genomics data and biological relevance or meaning.
Therefore, the analytical phase must seek to fully integrate all of these tools using new statistical tools
that are also unfamiliar to most microbiologists, which often exacerbates interpretation difficulties when
the genotype and the phenotype are at odds. Taken together, it is clear that the use of bioinformatics
is essential for mining genome content, but also essential to provide bring new scientific abilities that
truly bring biologically meaningful insights via systems biology.
279
S ystems
B iology and Micr obes
Systems biology brings a new perspective to the life sciences generally, but has specific impact on microbiology due to the nature of the rapid evolution and small genome that is the essence of a microbe.
Due to the small genomes in microbes, they lend themselves well to rapid, short-term sequencing efforts because they have limited complexity. Often only a single small circular chromosome with limited
extrachromosomal DNA (i.e. plasmids) is found microbes, as compared to eukaryotic organisms that
contain multiple chromosomes that are very complex and significantly larger than microbes. Due to
the general ease of sequencing microbes the number of microbes projects around the world exploded
with isolates coming from various ecologies, including unique environmental organisms and human
pathogens (see NCBI-www.ncbi.nlm.nih.gov/genomes/lproks.cgi for a complete listing). This in part is
driving the doubling of NCBIs database every 18 months (Pennisi, 2003).
The genome sizes range from 13.03 Mb from the myxobacterium Sorangium cellulosum to 0.16 Mb
from the obligate endosymbiont Candidatus Carsonella ruddii PV. While the accumulation of genomes
enables more and more fields within microbiology to participate in systems biology shifting from generating more genome sequence to the more difficult aim of comprehending the impact of specific functional
genomic patterns is an important and substantial transition that faces microbiology today.
Direct comparison of genome alignments enables assessment of organization between strains, species, and genera to give a picture of structural conservation (Figure 1) that is beyond the use of 16S
DNA sequence. However, genome structure does not predict phenotype, even among closely related
subspecies. For example, the genome structure between Lactococcus lactis ssp. lactis IL1403 compared
to Lactococcus lactis ssp. cremoris SK11 is conserved, except for a small inversion. However, the strains
behave very differently in the same growth conditions for growth and gene expression (Yie et al., 2004).
With the extensive amount of genome information available it is possible to redefine the classification of
microbes based on their genome similarity (known as phylogenomics) rather than the 16s DNA or phenotype (Anderssen and Fuxelius, 2005). Gene duplication, translocation, inversion, deletion, and horizontal
transfer facilitate genome rearrangement, which is becoming an increasingly common observation in
new genomes from environments that have high selective pressure. Presumably, such rearrangements
mediate rapid strain evolution allowing adaptation of the strain to the selective pressure (Hughes, 2000).
Use of phylogenomics to classify microbes is a complex and emerging area of classification that will
produce new phylogeny methods to represent evolution and relatedness between microbes in the near
future. Koonin and Galperin (1997) examined this approach with relatively few genomes to inform their
conclusions that structure is not conserved, but rather gene exchange in distantly related microbes is
common for the conservation of function via many different genome organizational structures.
Tools to probe conservation of function are clearly needed to bring focus to the conundrum of the
lack of structure conservation with the common phenotypes noticed among genera or species. To aid
comparative genomics a technique called comparative genomic hybridization (CGH) is becoming
increasingly popular. This is based on use of gene expression arrays that have genomic DNA hybridized to the array rather than labeled cDNA. This allows one to determine genomic similarity without
sequencing every strain of interest. Protocols for this approach are being standardized and completely
depend on the genome that was used to create the array (Guinane and Fitzgerald, 2008). If genes are
missing in the reference genome then no data will be available for the CGH comparison. However, this
method is inexpensive, fast, and provides a very rich data set to compare closely related organisms for
the conservation of functionally related genes independent of the location on the genome.
280
Figure 1. Synteny plots of specific microbes to demonstrate conservation of genome structure
G ene E xpressi on Pr ofi les

With a genome in hand creation of a whole genome gene expression array is the next logical step for
functional analysis of genome sequence and gene regulation. Construction of gene expression arrays is
a critical step to enable creation of reliable and estimates of gene expression profiles (Table 1). This is
done by placing probes to various locations in a gene to assess the expression of that gene during growth
and environmental treatment. Cellular RNA is isolated from an organism and converted to cDNA (made
from RNA using reverse transcription PCR) as a tag is incorporated. Often the tag is fluorescent but
other options are available. The tagged cDNA is hybridized to the genome array to produce a signal.
Two main types of arrays are in use spotted arrays and photolithography to deposit the probe on the
solid medium for use. Variability in spotting leads to variability in expression data, which has limited
use of spotted arrays, except in carefully controlled conditions.
Gene expression arrays produce an enormous amount of data for each experiment that requires
immense data-handling capacity. For example, an experiment with a DNA array of an entire genome,
even a small one with 2,500 genes requires approximately 27,500 individual probes (considering 11
probes per gene) for a single genome. A single experiment done using two treatments (e.g. a single vari-
281
Table 1. Options to produce gene arrays for expression and CGH

Medium
Probe
Spotting method
Signal
Bead
>50mer oligonucleotide
Ink jet
Dual color
Membrane (nylon)
cDNA
Pin solid or slotted
Single color
Slide
<25mer oligonucleotide
Photolithography
Chemiluminescent
Solid glass
Use of a linker
Fluorescent
3D glass
Specific or degenerate
Radioactive
Membrane
Direct or indirect label
nanocatilever
Probes per gene
able) and three replications results in 165,000 discrete data points for analysis that need to be acquired,
processed, analyzed for validity, and compared statistically before biological meaning can be assigned.
Subsequently, these significant genes are associated with a biological function to determine the metabolic difference between the test conditions. The data set balloons to nearly 1 million data points when
a time course is done for just six different times. As one can imagine, computer science and statistics
become an integral part of the data analysis.
Gene expression profiles produce a long list of genes that change during various treatments. Bringing
biological meaning is a significant challenge if the variability is too high or treatment conditions are
too complex. In some cases protein production and gene expression do not agree. This is more come in
eukaryotic data sets than in prokaryotic studies where probe construction is less complex, exon variants
do not exist, and experimental conditions are easier to control. Consequently, some investigators prefer
to examine the proteome of microbes to assess the proteins that are present during the treatments. One
advantage of gene expression profiles is that integral membrane proteins can be observed along with
cytosolic proteins in the same sample preparation.
Proteomics involves monitoring whole-cell protein expression. This approach uses a more traditional
tool, two-dimensional gel electrophoresis (2D), to separate the proteins present in a cell at a specific
point in time during growth. Like RNA expression, protein-expression profiles from cells immersed in
different environmental conditions can be compared. The protein spots on a 2D gel that differ between
conditions are eluted or cut out and digested by specific proteases and subjected to mass spectrometry or tandem mass spectrometry. The resultant mass fingerprint is used to link the 2D gel spot to a
specific gene within the genome. Alternatively, more investigations use mass spectrometry coupled
to a chromatography interface that does a pre-separation of trypsin-digested proteins before the mass
determination. Once the mass is determined a database is searched to determine the protein identification. Using either technology, proteins expressed from the genome are cataloged and monitored and
collectively known as the proteome.
Whole-cell protein expression profiling provides a fundamentally different, yet complementary, view
of cellular systems that is linked to gene profiling. For example, proteins may persist in a cell longer (or
degrade faster) than their cognate mRNA transcripts, such that RNA analysis may not give an accurate
view of the proteins available in a cell at any moment in time. Conversely, proteome analysis often
misses proteins that are bound in the membrane or are not in the water phase of the cell mass. Hence,
these complementary methods provide a fuller picture and understanding of cellular gene expression
282
when used together. This process provides a wealth of information about the cell and how it responds to
the environment. With these data, strategies can be formulated to modulate bacterial survival, as well
as modify metabolic processes during different environmental conditions or challenges.
An advantage of gene expression over proteomics is that there is no need for cell division to observe
changes in expression. Stuart et al. (1998) determined that lactococci lose the ability to produce colonies after carbohydrate exhaustion, which is accompanied by release of methionine and serine into the
medium. Ganesan et al. (2006) extended this study using gene expression studies to confirm Stuarts
work and further demonstrate that the cells continue to transcribe RNA even in the absence of colony
formation, which is due to repression of the bacterial cytoskeleton ( fts). With the lose of colony formation and the production of amino acids without sugar they hypothesized that global gene expression
changes were being regulated by multiple mechanisms to change the metabolism of the non-culturable
cell allowing survival. This was proven and creates a link between sugar, regulation of metabolism and
colony formation. These dead cells are intact, regulate gene expression, and continue to metabolize
peptides and amino acids to end products that subsequently change the local environment that again
changes the gene expression and metabolism in a complex interplay between the bacterial system and
the environment. This is all done in cells that we cannot grow or see on a plate. These observations
beg the question of gene expression regulation, growth, and how uncultured cells change the community ecology.
Met ab olic Co nnecti ons t o the G en ome and E xpressi on

Pr ofi les
In addition to structural gene products metabolism is of interest to understand how microbes grow,
persist, produce industrially important chemicals, process toxins, and metabolize the nutrients in their
environment broadly. To enable this view various groups produced metabolic reconstruction software
that uses gene sequence data to draw the metabolic potential for that specific organism. Importantly, it
provides a theoretical touchstone for the possible gene expression profiles to further enable biological
meaningful conclusions.
Metabolic reconstruction maps (Figure 2) also enables one to conduct comparisons of metabolic
capabilities that are predicted by the genome sequence using in silico tools, such as KEGG and Pathway
Tools that are designed to predict the metabolic potential from the genome content (Kaneshisa, 2004;
Karp et al., 2005). Based on the Pathway Tools platform a generalized inventory of metabolic pathways
that is dynamic called MetaCyc allows one to see all of the known biochemical reactions known (Caspi
et al., 2006). In the same approach, Biocyc, a web-based database, was created that has over 160 organisms allowing user defined customization to display or compare metabolic pathways (Karp et al., 2005).
This provides users with incredible predictive abilities to study the most important pathways and their
interconnections to microbial growth and ecosystems.
Additional tools for metabolite interconnections are being developed to predict the complex web of
metabolism. This approach provides new level of insight into the coupling between genetic, metabolic,
and expression events that lead to phenotypic traits. Weimers group created a tool as a plug-in to Pathway
Tools that allows pathway interconnections to be displayed for any queries of genes, intermediates or
proteins. For example, a query using glutamic acid resulted in over 150 interconnections in Lactococcus
cremoris SK11 that indicates the central metabolic role of this amino acid. Complexity of metabolism
283
for glutamic acid is expected, but it is unexpected to realize the extent of the pervasive connections for
this amino acid in lactococcal metabolism.
Predictions of metabolism can be verified with gene expression and metabolite analysis. A systems
biology view of the microbial cell is more fully enabled using gene expression profiles with metabolic
reconstruction maps. However, one still lacks the ability to prove links gene expression profiles with
the metabolic processes in a high through put method. Therefore, a more direct measure of the small
molecule flow is needed on the same scale as the genome and gene expression profiles to fully enable
a system level assessment.
Met ab olomic Pr ofi ling

Metabolic profiling is a high through put method to measure the small molecule metabolism via mass
spectrometry combined with gas or liquid chromatography. The challenge with metabolomics is based
in sample preparation coupled with individual compound identification based on the observed mass.
Many compounds have the same or very similar masses, making identification the bottleneck in this
technology. Creating libraries of masses coupled with retention times is in progress by a number of
manufacturers and academic research laboratories.
Metabolomics can be applied to microbial metabolic questions by addressing the profile of compounds,
seeking to follow specific compounds, or by determining the interchange of atoms in the metabolites.
Each of these areas provides a different answer to the overall question of metabolism that is suggested
by gene or protein expression profiling. Metabolite profiling is common at the beginning of projects
or communities are being examined. The lack of unique differences between compounds is severely
limiting the full integration of metabolite profiling. Hence, it is common to restrict metabolite analysis
to specific types of compounds that flux through the cell in response the changes in gene and protein
expression due to environmental changes.
Often, gene expression profiles need additional or supporting evidence to provide assurance that
the metabolism extrapolated from the gene expression pattern is reliable when overlaid to metabolic
reconstruction maps. Ganesan et al. (2006) used NMR, gene expression analysis, and metabolic network
maps to delineate the exact metabolic pathway to convert branched chain amino acids to branched chain
fatty acids. This study demonstrates the usefulness of gene expression to understand complex metabolic
pathways and refined the exact genes that enable the committed steps in amino acid metabolism. Further, this study found that we still lack specific understanding of metabolism of central intermediates,
such as pyruvate, because the carbon atoms were traded between unexpected intermediates and found
in compounds previously unknown to interact in fatty acid production.
A systems biology approach seems to always provide unexpected results that challenge the biological dogma. In this light it is critical that one determine how to best use the multi-dimensional data sets
of gene expression, protein expression, metabolites, and the in silico prediction models based on the
genome content and what is significantly different of just biological noise. Combining these data sets
provides a unique look at the cell that requires validation and a perspective of variability due to regulatory networks that are yet to be defined.
284
Figure 2. Diagram of -omics integration with a metabolic reconstruction map based on the genome
sequence of L. cremoris SK11.
Genomics
DNA
Regulation
Regulatory Circuits
RNA
Protein
Metabolites
Metabolism
Growth
Survival
285
B ioinf orma tics and S t atistics

Bioinformatics and statistics were first used together to predict gene similarity between organisms, as
well as to predict genes among the vast amount of base calls in a genome. They are being used again to
merge the vast amount of data produced with gene expression, proteomics, and metabolomics. Methods to merge these multidimensional data sets are factor defining overriding rules or common types of
systems. Bioinformatics and statistics are required to provide a systematic context with an estimation of
probability that the real biology was observed and not just biological noise due to variation in subtle
adaptation to the local environment. Systems biology seeks to define the complexity of the system;
however a base language and set of common rules are lacking to achieve this task.
To bridge that gap use of models is emerging as a preferred method. This approach is borrowing tools
and techniques from engineering, circuit design, and control theory to provide a basis of understanding
the system. However, this requires biologists, engineers, computer scientists, and mathematicians to
communicate on common ground, which is no small feat. The first example of this is software designed
to display and analyze the entire genome, but today many examples of analysis exist that predict for
genes, intergenic regulatory motifs, metabolic reconstruction from the genome, and metabolic flux.
Before higher-level combinations can occur the initial analyses of functional genomic techniques
require statistical models that define normalization methods and address the multiple comparison issue
of high dimensional data sets. These tools are available and are the beginnings of standardization. For
example, it is common to incorporate array normalization techniques called robust multiarray average
(Irizarry et al., 2003; Storey and Tibshirani, 2003). Normalization allows comparison between probes
and between arrays that is commonly used to determine the fold change. However, one must assess the
statistical difference of the change and not rely on the fold change alone. This can be done using a pairwise t-test, except that the large number of probes on an array and the low number of replications causes
a bias that needs to be corrected creating a false discovery rate that is reflective of the correction that
uses the q-value rather than a p-value (Tusher et al., 2001; Dudoit et al., 2002). Desktop software called
SAM is freely available for this analysis (Tusher et al., 2001; www-stat.stanford.edu/~tibs/SAM/index.
html). Bioconductor, using the R package (www.r-project.org), is a large worldwide effort that produces
statistical solutions for multi dimensional data sets (www.biocondutor.org) that is becoming more accessible to the biologist. The same principles can be applied to gene or protein expression data sets.
Once the statistical analysis is completed a list of significantly changing genes or proteins is produced.
However, unless additional data are mapped on the significant hits little biological information is derived
from the changes. This is where bioinformatics beyond genome structure is burgeoning. Use of GO
terms and COG assignments provides a systematic approach to classify the broad (network) changes
in the cell. These schemes link function with location of specific genes and gene products. This is very
useful in assessing broad changes. For example, Ganesan et al. (2007) used this approach to determine
the changes in all the dehydrogenases in the genome due to carbon starvation. Mapping the gene name,
function (GO or COG), and change due to the treatment are the beginning steps to define the biological meaning. However, again the limitation is the vast data set that needs additional annotation. This
requires additional automated tools to provide these functions for the biologist. Doing this by hand is
not possible considering that it is likely the reduced data set will still contain 200 to 1,000 genes per
treatment or condition.
Method to assess over representation by specific classes of genes or proteins was developed by Gentleman et al. (2004). This method determines statistically significant enrichment within a GO category
286
due to changes of specific genes with that GO term. This can be thought of a using the probability that
chance will lead to the collective changes in that GO due random chance. If the GO is significant it is
unlikely that chance will produce this type of broad change in the genes in the GO. Use of this strategy
further allows the biologist to determine a link between the system-wide changes and specific gene or
protein changes that are biologically meaningful and relevant for further study.
Visualization of the data is very difficult for data with high dimensionality and it is linked to the
statistical method used to determine the importance of the results. Clustering and principle component
analysis (PCA) are commonly used for each of these data types. Cluster analysis lends itself well to heat
maps, while PCA tends to be better with spider plots. Alternatively, one can map the changes on the
metabolic reconstruction map to provide a metabolism context. Strategies to visualize the data depend
on the question that is asked. Individual genes or reactions are easily defined, but a visualization of the
entire system is difficult and still depends on the question of regulation or direct reactions. Traditionally, metabolism is thought of in a linear fashion that is represented by the classical biochemical maps
and the metabolic reconstruction maps. However, it is becoming increasingly clear that they are highly
interconnected circuits that deviate based on the protein and metabolite present at any point in time. To
further link gene expression and proteomic data many scientists are mining microbial genomes for hidden
gems of metabolic circuits to create custom metabolic maps that provide a touchstone for interpretation
of the genetic possibility in a specific organism. Coupled with develop of high-throughput techniques
that allow large-scale biochemistry provide additional insight into the meaning of the merged data sets
to define the limits and conditions needed for specific production of metabolic end products. Addition
of metabolic flux predictions software is further expanding interpretation.
Su mmar y and Fu t ure T rends

With all of these data analysis methods microbiologists are forced to use unfamiliar statistical tools that
require computer science skills that are beyond their normal experience and produce a steep learning
curve in addition to all the other choices to be made about technology selection. Therefore, while this
approach brings many benefits to see the entire cellular conversation at a single time, this course of
research is for the committed scientist that really wants to use systems biology for discovery.
However, dedication produces a new world of discovery with new perspectives that are likely more
realistic of what occurs in the microbial cell in response to its environment. For example new concepts
that use these measurements define the cell as a circuit that is wired by the genome, but is transduced
by proteins and chemical reactions. This approach to systems biology of bacteria brings the absolute
need for collaboration between groups that are not traditionally connected physics, computer science,
statistics, biology, and genetics. Communication between these groups is required to make substantial
progress to answer biological questions with such large, multi-dimensional data sets. However, this
challenge also brings a new view biological systems that can be described by engineering principles
borrowed from communications and control theory that was done by Alan Hodgkin and Andrew Huxley
more that 50 years ago in diagramming the circuit of a squid nerve cell. This approach is undeniably
possible for a microbe since the genome is limited at any point in time. However, if one allows mutation
and genome rearrangement then the task of prediction becomes substantially more difficult.
Phylogenomics will likely become more popular as more genomes become available and new comparison tools allow comparison from an evolutionary perspective of genes or proteins, GO or COGs
287
categories, genomes, and ultimately strains or species. The high dimensionality of this approach is
somewhat limited by our perspective of evolution, gene transfer, and intimate interaction between
microbes in their environment. However, as these perceptions change additional progress will be made
in classification systems based on the genome content and expression methods used that ultimately
translate into the cells phenotype.
A newer technique called metagenomics will continue to impact our understanding of microbes
and their ecology. The over 116 data sets available were generated with ideas of defining an ecological
niche in a shotgun sequencing approach. We immediately learned that our understanding of the complexity of the oceans and soils on earth is sorely lacking during the analysis of the metagenomes that
were produced. The allure of metagenomics is the ability to produce genome sequence without growing
bacteria. A significantly challenge in most ecologies. However, one can produce enough sequence to
create entire genomes without needing to isolate or grow individuals in the community if a strategic
approach is used to increase or begin with staged complexity in the sample. This approach lends well
to extreme environments to determine the community members without the need to understand how
to grow the individuals. An example of this strategy is the metagenomes of acid mine drainage, which
contained less than five members after the metagenomes was analyzed (Tyson et al., 2004).
Additional data analysis tools and methods are needed to quickly link observation to biological meaning. While the statistical methods and databases provide a basis to build the framework of interpretation
additional relational structures are needed to fully integrate the perspective of the genetic basis of growth,
metabolism, and persistence. Metagenomics is a reasonable method to interrogate complex communities
of uncultured organisms, but this brings another dimension and complexity to integrate that requires
additional connectivity between biology, computer science, mathematics, and engineering.
R eferences
Alon, U. (2007). An introduction to systems biology: Design principles of biological circuits builds a
solid foundation for the intuitive understanding of general principles. (3rd edition). Chapman & Hall/
CRC Press.
Andersson, S. G. E. & Hans-Henrik Fuxelius. (2005). Phylogenomics for studies of microbial evolution. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics, 4: Bioinformatics. John
Wiley & Sons, Ltd.
Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley,
S., Pick, J., Rhee, S.Y., Tissier, C., Zhang, P.D., & Karp (2006). MetaCyc: A multiorganism database of
metabolic pathways and enzymes. Nucleic Acids Res. 34, D511-6.
Dudoit S., Yang, Y. H., Callow, M. J., & Speed, T. (2002). Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 12, 111-139.
Fields, S. (2000). Proteomics in genomeland. Science, 291, 1221-1224.
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier,
L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C.,
Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y., and Zhang,
288
J. (2004). Bioconductor: Open software development for computational biology and bioinformatics.
Genome Biol. 5, R80.
Guinane, C. M., & Fitzgerald, J. R. (2008). Microarray comparative genomic hybridization for the
analysis of bacterial population genetics and evolution. Bacterial Pathogenesis: Methods and Protocols
431, 47-53.
Hughes, D. (2000). Evaluating genome dynamics: The constraints on rearrangements within bacterial
genomes. Genome Biology 1, reviews 0006.10006.8.
Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K. J., Scherf, U., & Speed, T. P.
(2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level
data. Biostat 4, 249-264.
Karp, P.D., Ouzounis, C.A., Moore-Kochlacs, C., Goldovsky, L., Kaipa, P., Ahren, D., Tsoka, S., Darzentas, N., Kunin, V., Lopez-Bigas, N. (2005). Expansion of the BioCyc collection of pathway/genome
databases to 160 genomes. Nucleic Acids Res. 33, 6083-9.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., & Hattori, M. (2004). The KEGG resource for
deciphering the genome. Nucleic Acids Res. 32, D277-80.
Knudsen, S. (2004). Guide to analysis of DNA microarray data. (2nd Edition). John Wiley & Sons,
Inc.
Koonin, E.V., & Galperin, M.Y. (1997). Prokaryotic genomes: The emerging paradigm of genome-based
microbiology. Curr Opin Genet Dev. 7, 757-63.
Liolios, K., Mavrommatis, K., Tavernarakis, N., Kyrpides, N.C. (2008). The genomes on line database
(GOLD) in 2007: Status of genomic and metagenomic projects and their associated metadata. NAR 36,
D475-D479.
Pennisi, E. (2003). Tracing lifes circuitry. Science, 302, 1646-9.
Riesenfeld, C. S., Schloss, P. D., & Handelsman. J. (2004). METAGENOMICS: Genomic analysis of
microbial communities. Annual Review of Genetics, 38, 525-552.
Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. PNAS 100, 94409445.
Tusher V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to the ionizing
radiation response. Proc Natl Acad Sci USA, 98, 5116-21.
Tyson G.W., Chapman, J., Hugenholtz, P., Allen, E. E., Ram, R. J., Richardson, P. M., Solovyev, V.V.,
Rubin, E. M., Rokhsar, D. S., & Banfield. J. F. (2004). Community structure and metabolism through
reconstruction of microbial genomes from the environment. Nature, 428, 37-43.
Xie, Y., Chou, L-S., Cutler, A., & Weimer, B. (2004). Expression profile of Lactococcus lactis ssp.
lactis IL1403 during environmental stress with a DNA macroarray. Appl. Environ. Microbiol. 70,
6738-6747.
289
K ey T erms
Bioconductor (www.bioconductor.org): An open source and open development software project for
the analysis and comprehension of genomic data.
R (www.r-project.org): R is a language and environment for statistical computing and graphics. R
provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, etc.) and graphical techniques that is highly extensible.
Significance Analysis Of Microarrays (SAM; www-stat.stanford.edu/~tibs/SAM/index.html): An
Excel plug-in that is used to analyze microarray data.
290
291
Chapter XVII
Alternative Splicing and Disease

Heike Stier
Paul Wrede
Jrgen Kleffe
abstract
Alternative splicing is an important part of the regular process of gene expression. It controls time and
tissue dependent expression of specific splice forms and depends on the correct function of about 100
splicing factor proteins of which many are the product of alternative splicing itself. It is therefore not
surprising that even minor sequence disturbances can cause mis-spliced gene products with pathological effects. We survey some common diseases which can be traced back to a malfunction of alternative
splicing including cystic fibrosis, beta-thalassemia, spinal muscular atrophy and cancer. Often cancer
also results from even mis-spliced splicing factors leading to randomly spliced non-functional isoforms
of several genes.
Intr od ucti on
The hypotheses of Beadle and Tatum (1941), that one gene codes for a single uniquely defined protein,
has been disproved many times during the past 30 years by discovering a number of ways how one single
gene gives rise to different gene products. The differences may result at the transcriptional level, during
RNA pre-processing, m-RNA translation or post-translational protein processing and folding. A recent
review by Boeckmann et al. (2005) describes the most important ways of gene product modification.
Alternative splicing increases the diversity of gene products by deriving different final mRNAs from
the same pre-mRNA transcript by alternative definition of introns which are spliced out to derive the
final mRNA used for translation. But whereas the process of splicing, the interaction of different proteins forming the spliceosome and removing an intron, is investigated and described in great detail by
Staley and Guthrie (1998), Will and Luhrmann (2001), Nilsen (2003) and Tazi et al. (2005), the process
of how the splice sites are selected is not yet fully understood.
For each single splicing reaction the spliceosome is newly formed in interaction with the pre-mRNA
and acts in two basic steps as seen in Figure 1. Cleavage of the donor site and ligation of the 5 end of
the intron to the branch side (step 1) is followed by removal of the intron by cleavage of the acceptor site,
the 3 end of the intron, and the ligation of the neighboring exons (step 2). About hundred interacting
molecular splicing factors, many of them are snRNPs, are known to steer this complex process. Four
important signals on the pre-mRNA intron are essential to attract these splicing factors to form the
spliceosome and to perform splicing. Theses are the donor site at the 5 end of the intron, the branch
point with the nucleotide A located about 17-40 nucleotides upstream of the acceptor site, the polypyrimidine region, and the acceptor site itself which defines the 3 end of the intron. Exonic and intronic
splicing enhancer or silencer sequence motifs additionally affect attraction of all splicing factors. If hit
by mutations, all changed sequence signals are prone to change the pattern of splicing and to seriously
affect protein expression, eventually causing death and disease.
Krawczak et al. (1992) reported about 15% of all known diseases causing point mutations to directly
hit splice sites. This early publication did not consider mutations of exonic or intronic splicing enhancer
or silencer signals known to be very important today. The Human Gene Mutation Database published by
Stenson et al. 2003 contains 61106 registered mutations of which 5822 affect splicing events. From this
set 3633 are point mutations of the GT donor and the AG acceptor splice sites. Skipped exons induced by
Figure 1. The two major steps of splicing; cleavage at the donor site (5 splice site) and forming the
lariat structure (step 1) followed by removing the intron and ligating the exons (step 2)
292
these mutations often change the reading frame and create aberrant not functional proteins. Mutations
within exons, which do not change the resulting proteins, are considered neutral or silent. Still they can
damage splicing enhancer or silencer motifs. Studies of the NF1 (Neurofibromatosis type 1) and ATM
(Ataxia telangiectasia) mutated gene by Teraoka et al. (1999) and Ars et al. (2000) found that even 50%
of the disease causing mutations imply splicing mistakes. Characterization of these splicing defects can
help to understand diseases and to find novel strategies for diagnostics and therapy.
Even the normal process of alternative splicing is very complex. Reports by Mironov et al. (1999),
Johnson et al. (2003) and Gupta et al. (2004) reported alternative RNA splicing for 35% to 75% of the
human genes. Although these numbers must be handled with care, they surely imply that mutations can
easily harm the natural balance of alternative splice products and lead to diseases and cancer.
After splicing 20-24 nucleotides upstream of the exon-exon junction a so called exon junction complex (EJC) remains on the processed mRNA as found by Le Hir et al. (2000), Tange et al. (2004) and
Bono et al. (2006). This complex is important for post-processing of the finally spliced mRNA such
as the early stop codon triggered NMD (nonsense mediated decay) pathway decribed by Charlet-B et
al. (2002). It can cause complete lack of a protein desired for health. More general the EJC influences
mRNA translation, surveillance, and localization. We now provide some interesting examples of wellknown diseases caused by splicing defects.
Alterna ti ve S plicing in Co mmon D isease

Cystic fibrosis (CF) is one of the most common hereditary diseases. It causes malfunctions of different organs including lung, heart and liver. Within lungs abnormal thick mucus are produced which
clog the lung and lead to life-threatening lung infections. Responsible for this disease are mutations
described by Kerem et al. (1989) which affect the CFTR (Cystic Fibrosis Transmembrane conductance
Regulator) gene. This gene codes for a cyclic-AMP-dependent chloride channel reviewed by Zielenski
and Tsui (1995). From 1210 mutations of the Human Gene Mutation Database reported to cause CF,
168 affect splicing. In one of these mutations a C is changed to T which generates a new weakly scoring 5 splice site within intron 19. It eventually leads to an exon extension by 84 bps which receives a
premature stop codon. The ratio between the correctly and aberrantly spliced transcripts determines
the degree of affliction.
Beta-thalassemia is an inherited blood disorder, mainly affecting people of the Mediterranean
regions and causing anaemia. The reason is a mutated Beta-globin gene (HBB) so that the ability to
produce haemoglobin is hampered. Within the Human Gene Mutation Database 419 mutations of the
HBB gene are registered of which 49 are reported to affect splicing. One of the point mutations is the
change of T to G within the poly-pyrimidine tract of intron 2 as shown in Figure 2. This mutation leads
to increased binding of the splicing inhibitor hnRNP C protein and down regulates the recognition of
the downstream acceptor splice site. Patients with this mutation produce a great proportion of a nonfunctional transcript isoform, that lacks exon 3, and only a small number of full length proteins. The
proportion of the full length transcript determines the severity of the symptoms. This observation was
described by Sebillon et al. (1995). Another case is the G to A mutation in codon 26 (GAG to AAG)
which activates a cryptic 5 splice site at codon 25 and reduces the amount of the correctly spliced protein. Suwanmanee et al. (2002b) succeeded in blocking this aberrant 5 splice site by oligonucleotides
re-establishing the correct way of splicing.
293
Figure 2. The T to G substitution in intron 2 of the -globin gene creates a new intronic silencer sequence
responsible for hnRNP binding and inhibition acceptor splice site recognition. The next downstream
acceptor site is used instead.
Spinal muscular atrophy (SMA) is the most frequent autosomal recessive disorder in human.
It happens in one of 10,000 cases and is characterized by a progressive degeneration of spinal cord
-motor neurons which result in skeletal muscle denervation and paralysis. Studies by Lefebvre et al.
(1995), Frugier et al. (2002) and Eggert et al. (2006) have shown that more than 96% of the SMA cases
are caused by the homozygous loss or mutations of the SMN1 gene (survivor of motor neuron gene
1). In the Human Gene Mutation Database 2 out of 22 mutations of the SMN1 gene are reported to be
associated with splicing defects. Meister et al. (2001) describe how the loss of the otherwise resulting
SMN1 protein leads to a global defect in pre-mRNA splicing. However, humans have a second, nearly
identical copy of the SMN1 gene, the SMN2 gene, which codes for the SMN2 protein and is located in
an inverted repeat on chromosome 5q13 as shown in Figure 3. A total loss of both genes leads to early
embryonic death. But the SMN2 gene can partially compensate for the loss of the SMN1 gene. It cannot
completely replace the SMN1 gene since according to Lorson and Androphy (2000) more than 80% of
the mRNA transcribed from the SMN2 gene lack the important exon 7 and lead to a non-functional
protein. Compared with gene SMN1 the skipping of exon 7 by transcribing gene SMN2 is caused by a
single nucleotide C to T mutation in exon 7. This substitution is silent but, as conjectured by Cartegni
et al. (2006), it is supposedly destroying an exonic splicing enhancer motif. For the intact SMN1 gene
this enhancer supports the binding of the SF2/ASF splicing factor. The binding of this factor to the
considered splicing enhancer strengthens the otherwise weak acceptor splice site of exon 7. Cartegni
and Krainer (2003) discuss ways of changing the balance of SMN2 transcripts towards inclusion of
exon 7 in order to improve compensation for the loss of the SMN1 gene.
Hutchinson-Gilford progeria syndrome (HGPS) is a rare genetic disorder observed for one out
of four million humans that causes premature rapid aging shortly after birth. One mutation leading to
HGPS is a change from GGC to GGT of codon 608 of the LMNA gene. The LMNA gene codes for
A-type and B-type lamins (intermediate filament) which are part of the nuclear lamina. The described
294
Figure 3. Location (left) and splicing difference (right) of the SMN1 and SMN2 genes. A single nucleotide
C to T mutation within exon 7 presumably damages an exonic splicing enhancer that supports binding
of the SR protein SF2/ASF which is important for identification of the 5 splice site of exon 7.
Figure 4. A point mutation from C to T within exon 11 of the LMNA gene generates a cryptic donor
site that causes skipping the rest of exon 11. The resulting protein misses important sites for proper
functioning.
mutation creates a cryptic donor splice site within exon 11, making it 150 nucleotides shorter. Hence the
resulting protein has an internal deletion of 50 amino acids close to the end of its C-terminal and misses
an important internal proteolytic cleavage site as well as some phosphorylation sites which are necessary
for normal interaction with lamin C which usually creates heterodimeric multiprotein filaments. A review
is given by Pollex and Hegele (2004). Figure 4 shows how the described C to T mutation generates a
much stronger but cryptic donor site since a T in intron position 6 is about three times more frequently
found than a C. The wild type cryptic donor site is three times weaker and hence remains unused.
Myotonic dystrophy (DM) is another exciting story about alternative splicing and disease. DM is
an autosomal dominant disorder, affecting one of 8000 humans, the most common form of muscular
dystrophy for adults, and is characterized by multiple symptoms like muscle hyperexcitability (myotonia),
progressive muscle wasting, defects of the cardiac conduction, alterations in smooth muscle function,
ocular cataracts, testicular atrophy, neuropsychiatric disturbances and insulin resistance. Reviews were
written by Day and Ranum (2005) and Machuca-Tzili et al. (2005). For DM1, a special version of DM,
the only abnormality found on the DNA level is an unusually high number of CTG repeats counting
from 80 to 1000 copies, in the 3 untranslated region of the DMPK gene (DM Protein Kinase) located
at chromosome 19q13.3. On the mRNA level DM1 patients show a number of unusually spliced genes
like CLC1 (Chloride Channel protein 1), IR (Insulin Receptor), cTNT ( Cardiac Troponin T) which
at the first glance have nothing to do with the DMPK gene. However, the mRNA of the DMPK gene
with its CTG repeats forms stable hairpin structures, and therefore, is neither translated nor degraded,
295
but accumulates within so called ribonuclear foci. As a result the important splicing factor MBNL1
(muscleblind-like mRNA binding protein) irreversibly binds to the double stranded CTG repeat structures and comes short for the regulation of splicing. This leaves more space for CUG-BP, the CUG
binding protein which is a member of the CELF-family known as trans-acting splicing regulator and
antagonist of MBLN1.
The resulting abnormal ratio of the splicing factors MBNL1 and CUG-BP is now responsible for
altered splicing patterns of CLC1, IR, cTNT, the Tau protein, and the Myotubularin proteins which
cause at least some of the disease related symptoms like myotonia, insulin resistance, cardiac arrhythmia, CNS and myopathic effect, respectively, as shown in studies by Charlet-B et al. (2002), Savkur
et al. (2004), Philips et al. (1998), Seznec et al. (2001) and Sergeant et al. (2001). Fokstuen et al. (2001)
found that the number of CTG repeats partially determines the onset and the degree of the disease. The
longer the repeat region is the more ribonuclear foci are formed and the stronger are the symptoms. The
phenotypic similar disease DM2 is caused by CCTG repeats within the first intron of the ZNF9 gene.
Figure 5 describes three known examples of how an unbalance of MBNL1 and CUG-BP concentrations
affects splicing.
For the CLC1 gene CUG-BP binds to the U/G rich repeats upstream of exon 3 of the CLC1 mRNA,
and prevents recognition of its 5 splice site of exon 3. Hence, the upstream 3 splice site of exon 2 is
not used, which leads to inclusion of intron 2 into the final transcript. A premature stop codon leads to
a wrongly spliced mRNA that is targeted by the NMD pathway described by Charlet-B et al. (2002).
In case of the IR protein, binding of MBNL1 to intron 11 is important to include exon 11 into the final
transcript. It facilitates the recognition of the 5 splice site of exon 11 by a yet unknown mechanism. In
the absence of MBNL1, CUG-BP binds upstream of exon 11 and hinders the spliceosome from binding,
resulting in exclusion of exon 11. This process was decribed by Dansithong et al. (2005).
A study of the cardiacal Troponin T by Ladd et al. (2005) revealed that normally MBNL1 inhibits
the recognition of the 3 splice site of exon 5 of the cTCT mRNA, whereas in DM patients with higher
CUG-BP concentration, CUG-BP binds to an enhancer structure downstream of exon 5 leading to an
inclusion of the disease causing exon 5.
The von Willebrand disease (VWD) is owned by one percent of the German population and is
the most frequently inherited blood coagulating disorder. It is caused by defects of the VWF (von Willebrand factor) gene, which is localized on the human chromosome 12p13.3. From 205 cases reported
in Human Gene Mutation Database, 11 cause splicing defects. The VWF-protein supports adhesion
of platelets to the sites of vascular injury by connecting the sub-endothelial collagen matrix with the
platelet-surface receptor complex GPIb-IX-V. Together with the factor VIII it forms a complex which is
circulating within the blood delivering the VWF-protein to the site of injury, stabilizing its heterodimeric
structure and protecting it from premature clearance from plasma. A new case reported by Gallinaro
et al. (2006) describes a C to A substitution within the acceptor splice site of intron 13 which causes
production of two alternative transcripts in addition to the normal one. One skips exon 14 by extending
intron 13 while the other uses a new cryptic acceptor splice site placed within intron 13 that extends
exon 14 by 62 base pairs to the 5 side. Both transcript variations show a premature stop codon, leading
to non-functional proteins.
Retinitis pigmentosa (RP) is a group of similar neurodegenerative diseases of the retina with similar
clinical symptoms reviewed by Vervoort and Wright (2002). In an earlier stage RP is associated with
night blindness due to progressive loss of rod photoreceptor function. In later stages, day vision is also
affected. Incorrect expression of three genes, PRPF31, HPRP3, and PRPC8F, is responsible for the
296
Figure 5. A proposed mechanism of how high numbers of CTG (CCTG) repeats can affect splicing of
multiple genes. Top: Schematic representation of the DMPK gene with untranslated regions shown as
grey boxes and exons as grey-shaded (boxes 1, 14, 15). Bottom: Changes of the splicing patterns of
three genes caused by the RNA binding proteins CUG-BP and MBLN1 shown as dark grey and light
grey ellipsoids, respectively.
297
autosomal dominant RP (Liu et al. 2002). Zhou et al (2002) identified the resulting proteins in isolated
functional spliceosomes which are all associated with the U4/U6-U5 splicing factors. One protein is
called RPGR (Retinitis pigmentosa GTPase regulator) of which 97 variations are reported in the Human
Gene Mutation Database. Nineteen of them are due to splicing defects partly described by Vervoort
and Wright (2002).
Alterna ti ve S plicing in C ancer

Cancer is a generic term for over 100 different diseases, with different progression, prognoses and
therapy. The common feature is increased cell proliferation and the formation of tumours. In principal
each cell type of an organism could generate tumour cells. Different studies of mRNA expression
patterns have shown that cancer tissue is often associated with unusual or unbalanced expression of
alternatively spliced isoforms. Brinkman (2004) and Garcia-Blanco et al. (2004) survey cancer type
specific isoforms which can be used as diagnostic markers for different stages of cancer progression
and therapy. A recent paper by Ritchie et al (2008) found a number of cancer types correlated with
significantly increased variation of alternative splicing. The authors conjecture that changes of splicing
factor expression may lead to loss of splicing specificity for tissues and developmental stages. Watermann et al. (2006), Karni et al. (2007), Maeda et al. (1999), Zerbe et al. (2004), Fisher et al. (2004) and
Stickeler et al. (1999) describe cases of cancer implied up-regulation of splice factors such as SF2/ASF,
U2AF-65, SFRS2, SFRS3, SPm160, hnRNP A1/A2 and TRA2-. In Table 1 we show a collection of
mutations leading to alternative splicing associated with cancer.
One example is given by the CD44 receptor which is known for multiple isoforms due to alternative
splicing. Its exon 5 is often involved in tumor progression and T-cell activation. It has been shown by
Matter et al. (2002) that inclusion of this exon could be mediated by the RAS pathway. This pathway
stimulates the extra cellular signal-regulated kinase (ERK), which phosphorylates the SAM68 splicing
factor that binds to the AAAAUU sequence within exon 5. The phosphorylation of SAM68 is required
for correct inclusion of exon 5 into the final transcript. In non-tumor cells, the RAS pathway is not
permanently active, ERK is not stimulated, SAM68 gets not phosphorylated and the AAAAUU site in
exon 5 does not trigger splicing and its inclusion into the final transcript. In tumor cells RAS is often
found activated by several environmental factors. Also growth factors which are often autoregulated in
cancer cells permanently activate the RAS pathway. Cheng and Sharp (2006) describe the regulation of
CD44 alternative splicing by the splice factor SRm160 and its role in tumor cell invasion.
The BCL-x gene from the BCL-2 gene family is another example for a changed balance of differently
spliced isoforms under cancer conditions. The BCL-2 family is important for the regulation of apoptosis, the so called programmed cell death and reviewed by Akgul et al. (2004). There are three different
transcripts of BCL-x called BCL-xL (long), BCL-xS (short) and BCL-x which are decribed by Boise et
al. (1993) and Ban et al. (1998). The two splice forms Bcl-xL and BCL-xS result from using two different 5 splice sites within exon 2 of the BCLx gene. The use of the upstream 5 splice site results in the
short BCL-xS transcript and use of the down stream 5 splice site leads to the long BCL-xL transcript.
Both proteins have opposite effects on apoptosis, BCL-xL is anti-apoptotic and BCL-xS is pro-apoptotic.
Following Korsmeyer et al. (1993), the balance between these interacting factors determines the normal
cell response to external and internal stimuli. In cancer cells the function of apoptosis is often damaged
by a high level of BCL-xL and a decreased level of BCl-xS.
298
Table 1. Genes and mutations associated with cancer

Gene
Mutation
Cancer typ
Reference
AML1b
AML1b transcript lacking exon 6

(AML1b Del179-242)
Ovarian cancer
Nanjundan et al. (2006)
APC
Germline mutations in the tumor suppressor gene APC; exon 4

(c.423G>T), exon 14 (c.1956C>T, c.1957A>G, and c.1957A>C);
complete exon skipping due to damage of an exonic splicing
enhancer sequence
Familial
adenomatous
Polyposis (FAP)
Aretz et al. (2004)
ATM
IVS10-6T>G at the 3 splice site of intron 10;

incorrect splicing of exon 11 and exon skipping, resulting in a
frame-shift starting at codon 355 and premature stop codon 371.
Breast cancer
Broeks et al. (2003)
BRCA2
Missense mutation in exon 13 (c.7235G>A),

results in skipping of this exon, causing a frame shift and
generating a premature stop codon in exon 14; mutation site has
low homology to an known exonic splicing enhancer sequence.
Breast cancer
Thomassen et al. (2006)
BRCA2
BRCA2 variant 8204G>A is a splicing mutation. The acceptor site

of exon 17 is damaged which implies skipping of this exon.
Breast cancer
Hofmann et al. (2003)
BRCA1
IVS10-2A>C produces an aberrant RNA splicing transcript with

missing exon 11.
Breast cancer
Keaton et al. (2003)
BRCA1
G to T mutation at position 5199 of exon 18 implies exon skipping

due to changing an ESE sequence. Exon 18 is essential for the
DNA repair, transcriptional control and tumour supresssion.
Breast cancer
Liu et al. (2001)
CDKN2A
Two different mutations D153spl (c.457G>T) and IVS2+1G>T

imply a 74 bp deletion in exon 2 or complete loss of exon 2,
respectively.
Cutaneous
malignant
melanoma
(CMM)
Rutter et al. (2003)
HRPT2
IVS2-1G>A of the AG dinucleotide of the 3splice site of intron 2

leads to deletion of exon 3 or a by 23 bp shorter exon.
Parathyroid tumors
Moon et al. (2005)
KAI1
Deletion of exon 7causes weaker interaction with integrin 1

leading to increased motility and metastasis of the cells.
Gastric cancer
Lee et al. (2003)
KLF6
IVS1-27G>A: generates a novel SRp40 DNA binding site leading

to three alternatively spliced
isoforms, antagonizing the wtKLF6 function.
Prostate cancer
Narla et al. (2005)
KIT
Deletion of 30 or 34nt in intron10 and exon 11, respectively, leads

to a mis-spliced constitutively activated onkoprotein (KIT).
Gastrointestinal
stromal tumors
Chen et al. (2005)
LI-cadherin
Mutation at a putative branch point at IVS6 + 35 A>G in intron 6

leads to loss of exon 7
Hepatocellular
carcinoma
Wang et al. (2005)
LKB1
The IVS2+1A>G mutation alters the 5 splice

site of intron 2 (U12-type AT-AC intron)
Peutz-Jeghers
Syndrome with
increased cancer risk
Hastings et al. (2005)
WT1
Use of an alternative promoter located within intron 1 leads to 147

missing amino acids at the N terminus required for transcriptional
repression of sWT1.
Leukemia
Hossain et al. (2006)
This was observed by Cory et al. (2003), Cory and Adams (2002) and Xerri et al. (1996). Hence,
changing the balance of these isoforms to increase production of BCL-xS, could make cancer cells more
sensitive to chemotherapy.
299
Till now unfortunately little is known if cancer causes mis-spliced isoforms or if these isoforms
cause cancer. Therefore Hayes et al. (2006) and Pilch et al. (2001) suggest targeting the general RNA
splicing machinery as novel strategy for cancer treatment.
Alterna ti ve S plicing and Cl inica l T herapy

The growing knowledge about the molecular mechanisms of alternative splicing and how it causes
diseases leads to new concepts for clinical therapies. Here we will mention the potential of antisence
oligoribonucleotides and modified splicing factors as first strategies to intervene in splicing defects. Kole
et al. (2004) and Sazani and Kole (2003) have written excellent reviews about this emerging field.
Antisense oligonucleotides (AON) can be used to block splice sites and other signals like splicing
enhancers and silencers in order to change the final transcript. Binding of AON to the mRNA creates a
short double strand piece of RNA and makes the considered site invisible to splicing factors. However,
RNase H, localized in the cell, is able to recognize and destroy double stranded RNA. Therefore the
AON has to be chemically modified not to be recognized and degraded by RNase H. The resulting small
interference RNA technique has become a widely used tool for down-regulating gene expression and is
described by Mercatante et al. (2001), Dallas et al. (2006) and Kurreck (2003 and 2006).
Duchenne muscular dystrophy (DMD) is a degenerative muscle disease. Dystrophin is the largest
human gene with 2.2 million base pairs and 79 exons. The resulting protein is 3685 amino acids long.
Of the 696 mutations reported in the Human Gene Mutation Database 68 lead to loss of exons, shifts
of reading frame and non-functional proteins often due to premature stop codons and nonsense-mediated decay as described by Lareau et al. (2007) and Ni et al. (2007). The milder case of DMD is BMD
(Becker Kiener disease) in which mutations cause the deletion of less important parts of the dystrophin
protein leading to a shorter but more or less functional protein. The most abundant form of DMD is
the loss of exon 45 resulting in a frame shift. However, using bioinformatics revealed that an additional
skipping of exon 46 re-establishes the reading frame and produces a shortened but probably functional
protein. In vitro experiments by Aartsma-Rus and coworkers (Aartsma-Rus et al. 2002; 2003; 2004)
with muscle cells from DMD patients having the deletion of exon 45 proved that anti sense oligonucleotides blocking of the splice sites of exon 46 successfully generates functional transcripts which lack
both exons 45 and 46. Using the same in vitro method as illustrated in Figure 6 it was possible to skip
11 other exons damaged by DMD.
Many cancer cells over express the BCL-xL (long) splice form of the BCL-x gene, which codes for
an anti-apoptotically acting protein. Small interfering RNA against the downstream 5 splice site of its
exon 5 leads to a reduced level of BCL-xL and a higher level of its antagonist BCL-xS (short). Experiments by Xie et al. (2006) and Lei et al. (2006) showed that the implied change of balance between
these two proteins makes cancer cells more sensitive to chemotherapy and radiation. A similar effect
has been shown for another member of the BCL-2 gene family. The expression of the BCL- isoform is
increased in many cancer cells as reported by Real et al. (2002). In vivo experiments by Dias and Stein
(2002) and using anitsense oligonucleotides blocking could demonstrate that suppression of the BCL-
isoform inhibits the expansion and progression of tumors.
Mutations of the -globin gene IVS2 within intron 2 create a novel aberrant 5 splice site and activate
a cryptic 3 splice site finally leading to a premature stop codon. Using AON technique these mutations
could be corrected in a culture system reported by Suwanmanee et al. (2002a/b)
300
Figure 6. Antisense oligonucleotides blocking induces skipping of exon 46 to re-establish the correct
reading frame of the damaged dystrophin gene found in DMD patients lacking exon 45.
A mutation within intron 19 of the CFTR (cystic fibrosis transmembrane conductance regulator) gene
creates a novel 5 splice site and activates an additional otherwise un-present exon. Like in the previous
case Friedman et al. (1999) re-established correct splicing by AON.
Another example is the correction of the tau gene associated with frontotemporal dementia and parkinsonism linked to chromosome 17. One of the major mRNA changes in this disease is the inclusion
of exon 10. Within a tau minigene system and neuronal pheochromocytoma cells, Kalbfuss et al. (2001)
was able to stop inclusion of exon 10 using AON against its splice sites.
The success of future practical therapy of the discussed diseases by AON naturally depends on the
stability of RNA blocking over a certain time. Therefore improved so called locked nucleic acid (LNA)
oligonucleotides were proposed by Roberts et al. (2006). These oligonucleotides were observed to bind
up to 3 weeks.
Oligonucleotide induced recruitment of splicing factors is another approach to control splicing
patterns. For example, in the cases of SMA, the SMN1 gene is lost or mutated. The replacement gene
SMN2 is less effective since its transcript often lacks the important exon 7. To overcome this problem
Cartegni et al. (2006) invented a chimeric protein. With its N terminus it binds to the less functional
exonic splicing enhancer located in the skipped exon 7 of SMN2 and on its C terminus it carries a
peptide structure similar to other SR proteins which control interaction with the splicing machinery as
described by Cartegni and Krainer (2003) and illustrated in Figure 7. In fact, the mimicked SR-protein
was observed to work, attracting the splicing machinery and increasing the production of SMN2 mRNA
including exon 7 in a dose dependent manner. This technique perhaps opens a general way to restore
normal splicing patterns for genes with damaged control signals.
301
Figure 7. Inclusion of exon7 into the SMN2 transcript facilitated by a chimeric protein SF** induces
recognition of the acceptor site of exon 7.
Co nc lusi on
Alternative RNA splicing is the prevailing mechanism of gene expression to generate the great diversity
of proteins which is typical for complex forms of life. It is strictly controlled by a complex system of
about hundred known splicing factors and a number of complex sequence signals like splice sites, the
branch point signal, splicing enhancers and silencers. It is therefore no surprise that sequence mutations
and damaged splicing factors can have pathological effects. Our review presented different examples
for erroneously spliced gene products which cause severe diseases.
The complex process of normal and aberrant splicing offers a number of targets for possible therapies
which differ from gene therapy by not repairing genes but making improved use of damaged genes as
already known for Duchenne muscular dystrophy. One technology described in this review is the application of complementary oligonucleotides like siRNA in order to suppress unfavourable splice sites,
splicing enhancers or silencers (Aartsma-Rus et al.2006). Oligonucleotide induced recruitment of splicing
factors is another approach. More research in this direction and focussed on a single important disease
like cystic fibrosis (CF) can perhaps soon provide a successful therapy. CF is one of the best studied
hereditary diseases and much is known about the process of alternative splicing and other molecular
mechanisms related with this disease. The described siRNA technique could soon allow restoring of
correct splicing.
Another target is increasing the stability and the half-life of correctly or favourably spliced mRNAs
in order to produce more functional proteins from each single mRNA template. It could lead to successful treatments of spinal muscular atrophy, myotonic dystrophy and other diseases which are caused by
insufficient production of known proteins.
A ckn ow ledgment
This work was supported by the BMBF Germany under contract number 0312705A.
302
R eferences
Aartsma-Rus, A., Bremmer-Bout, M., Janson, A. A., den Dunnen, J. T., van Ommen G. J., & van
Deutekom, J. C. (2002). Targeted exon skipping as a potential gene correction therapy for Duchenne
muscular dystrophy. Neuromuscul Disord, 12(Suppl 1), S71-77.
Aartsma-Rus, A., Janson, A. A., Kaman, W. E., Bremmer-Bout, M., den Dunnen, J. T., Baas, F., van
Ommen, G. J., & van Deutekom, J. C. (2003). Therapeutic antisense-induced exon skipping in cultured
muscle cells from six different DMD patients. Hum Mol Genet, 12(8), 907-914.
Aartsma-Rus, A., Janson, A. A., Kaman, W.E., Bremmer-Bout, M., van Ommen, G.J., den Dunnen,
J. T., & van Deutekom, J. C. (2004). Antisense-induced multiexon skipping for Duchenne muscular
dystrophy makes more sense. Am J Hum Genet, 74(1), 83-92.
Aartsma-Rus, A., Janson, A. A., Heemskerk, J. A., D. E. Winter, C. L., VAN Ommen, G. J., VAN Deutekom, J. C. (2006). Therapeutic modulation of DMD splicing by blocking exonic splicing enhancer
sites with antisense oligonucleotides. Ann N Y Acad Sci, 1082, 74-76.
Akgul, C., Moulding, D. A., & Edwards, S. W. (2004). Alternative splicing of Bcl-2-related genes:
Functional consequences and potential therapeutic applications. Cell Mol Life Sci., 61(17), 2189-2199.
Aretz, S., Uhlhaas, S., Sun, Y., Pagenstecher, C., Mangold, E., Caspari, R., Mslein, G., Schulmann, K.,
Propping, P., & Friedl, W. (2004). Familial adenomatous polyposis: Aberrant splicing due to missense
or silent mutations in the APC gene. Hum Mutat., 24(5), 370-380.
Ars, E., Serra, E., Garcia, J., Kruyer, H., Gaona, A., Lazaro, C., & Estivill, X. (2000). Mutations affecting mRNA splicing are the most common molecular defects in patients with neurofibromatosis type
1. Hum Mol Genet. 9(2), 237-247.
Ban, J., Eckhart, L., Weninger, W., Mildner, M., & Tschachler, E. (1998). Identification of a human
cDNA encoding a novel Bclx isoform. Biochem. Biophys. Res. Commun, 248, 147-152
Beadle, G. W., Tatum, E. L. (1941). Genetic control of biochemical reactions in neurospora. Proc Natl
Acad Sci U S A., 27(11), 499-506.
Boeckmann, B., Blatter, MC., Famiglietti, L., Hinz, U., Lane, L., Roechert, B., & Bairoch, A. (2005).
Protein variety and functional diversity: Swiss-Prot annotation in its biological context. C R Biol.,
328(10-11), 882-899.
Boise, L. H., Gonzalez-Garcia, M., Postema, C. E., Ding, L., Lindsten, T., & Turka, L. A. et al. (1993).
bcl-x, a bcl-2-related gene that functions as a dominant regulator of apoptotic cell death. Cell, 74, 597608
Bono, F., Ebert, J., Lorentzen, E., & Conti, E.. (2006). The crystal structure of the exon junction complex
reveals how it maintains a stable grip on mRNA. Cel,.126(4), 713-725.
Brinkman, B. M. (2004). Splice variants as cancer biomarkers. Clin Biochem., 37(7), 584-594.
Broeks, A., Urbanus, J. H., de Knijff, P., Devilee, P., Nicke, M., Klpper, K., Drk, T., Floore, A. N.,
vant Veer, L. J. (2003). IVS10-6T>G, An ancient ATM germline mutation linked with breast cancer.
Hum Mutat., 21(5), 521-528.
303
Broeks, A., van t Veer, L. J., Ottenheim, C., Hiel, J. A., Kleijer, W. J., & Weemaes, C. (2003). From
gene to disease; ataxia telangiectasia. Ned Tijdschr Geneeskd, 147(9), 386-389.
Cartegni, L., & Krainer, A. R. (2003). Correction of disease-associated exon skipping by synthetic
exon-specific activators. Nat Struct Biol., 10(2), 120-125.
Cartegni, L., Hastings, M. L., Calarco, J. A., de Stanchina, E., & Krainer, A. R. (2006). Determinants
of exon 7 splicing in the spinal muscular atrophy genes, SMN1 and SMN2. Am J Hum Genet., 78(1),
63-77.
Charlet, B. N., Savkur, R. S., Singh, G., Philips, A. V., Grice, E. A., & Cooper, T. A. (2002). Loss of the
muscle-specific chloride channel in type 1 myotonic dystrophy due to misregulated alternative splicing.
Mol Cell., 10(1), 45-53.
Chen, L. L., Sabripour, M., Wu, E. F., Prieto, V. G., Fuller, G. N., & Frazier, M. L. (2005). A mutation-created novel intra-exonic pre-mRNA splice site causes constitutive activation of KIT in human
gastrointestinal stromal tumors. Oncogene, 24(26), 4271-4280.
Cheng, C., & Sharp, P. A. (2006). Regulation of CD44 alternative splicing by SRm160 and its potential
role in tumor cell invasion. Mol Cell Biol., 26(1), 362-370
Cory, S., & Adams, J. M. (2002). The Bcl2 family: Regulators of the cellular life-or-death switch. Nat
Rev Cancer, 2(9), 647-656.
Cory, S., Huang, D. C., & Adams, J. M. (2003). The Bcl-2 family: Roles in cell survival and oncogenesis.
Oncogene, 22(53), 8590-8607.
Dallas, A., & Vlassov, A. V. (2006). RNAi: A novel antisense technology and its therapeutic potential.
Med Sci. Monit., 12, RA67-74
Dansithong, W., Paul, S., Comai, L., & Reddy, S. (2005). MBNL1 is the primary determinant of focus
formation and aberrant insulin receptor splicing in DM1. J Biol Chem., 280(7), 5773-5780.
Day, J. W., & Ranum, L. P. (2005). RNA pathogenesis of the myotonic dystrophies. Neuromuscul Disord., 15(1), 5-16.
Dias, N.,& Stein, C. A. (2002). Potential roles of antisense oligonucleotides in cancer therapy: the
example of Bcl-2 antisense oligonucleotides. Eur. J. Pharm. Biopharm., 54, 263-269
Eggert, C., Chari, A., Laggerbauer, B., & Fischer, U. (2006). Spinal muscular atrophy: The RNP connection. Trends Mol Med., 12(3), 113-121.
Fischer, D. C., Noack, K., Runnebaum, I. B., Watermann, D. O., & Kieback, D. G., et al. (2004). Expression of splicing factors in human ovarian cancer. Oncol Rep, 11, 1085-1090
Fokstuen, S., Myring, J., Evans, C., & Harper, P. S. (2001). Presymptomatic testing in myotonic dystrophy: Genetic counselling approaches. J Med Genet., 38(12), 846-850.
Friedman, K. J., Kole, J., Cohn, J. A., Knowles, M. R., Silverman, L. M., & Kole, R. (1999). Correction of aberrant splicing of the cystic fibrosis transmembrane conductance regulator (CFTR) gene by
antisense oligonucleotides. J Biol Chem., 274(51), 36193-3619.
304
Frugier, T., Nicole, S., Cifuentes-Diaz, C., & Melki, J. (2002). The molecular bases of spinal muscular
atrophy. Curr Opin Genet Dev., 12(3), 294-298.
Gallinaro, L., Sartorello, F., Pontara, E., Cattini, M. G., Bertomoro, A., Bartoloni, L., Pagnan, A., & Casonato, A. (2006). Combined partial exon skipping and cryptic splice site activation as a new molecular
mechanism for recessive type 1 von Willebrand disease. Thromb Haemost., 96(6), 711-716
Garcia-Blanco, M. A., Baraniak, A. P., & Lasda, E. L. (2004). Alternative splicing in disease and therapy. Nat. Biotechnol, 22, 535-546
Gupta, S., Zink, D., Korn, B., Vingron, M., & Haas, S. A. (2004). Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics., 20, 2579-2585.
Hastings, M. L., Resta, N., Traum, D., Stella, A., Guanti, G., & Krainer, A. R. (2005). An LKB1 AT-AC
intron mutation causes Peutz-Jeghers syndrome via splicing at noncanonical cryptic splice sites. Nat
Struct Mol Biol., 12(1), 54-59.
Hayes, G. M., Carrigan, P. E., Beck, A. M., Miller, L. J. (2006). Targeting the RNA splicing machinery
as a novel treatment strategy for pancreatic carcinoma. Cancer Res, 66, 3819-3827
Hofmann, W., Horn, D., Hattner, C., Classen, E., & Scherneck, S. (2003). The BRCA2 variant 8204G>A
is a splicing mutation and results in an in frame deletion of the gene. J Med Genet., 40(3), e23.
Hossain, A., Nixon, M., Kuo, M. T., & Saunders, G. F. (2006). N-terminally truncated WT1 protein with
oncogenic properties overexpressed in leukemia. J Biol Chem., 281(38), 28122-28130.
Johnson, J. M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P. M., Armour, C. D., Santos, R., Schadt,
E. E., Stoughton, R., & Shoemaker, D. D.(2003). Genome-wide survey of human alternative pre-mRNA
splicing with exon junction microarrays. Science, 302(5653), 2141-2144
Kalbfuss, B., Mabon, S. A., & Misteli, T. (2001). Correction of alternative splicing of tau in frontotemporal dementia and parkinsonism linked to chromosome 17. J Biol Chem., 16, 276(46), 42986-42993
Karni, R., de Stanchina, E., Lowe, S. W., Sinha, R., & Mu, D., et al. (2007). The gene encoding the
splicing factor SF2/ASF is a proto-oncogene. Nat Struct. Mol. Biol, 14, 185-193
Keaton, J. C., Nielsen, D. R., Hendrickson, B. C., Pyne, M. T., Scheuer, L., Ward, B. E., Brothman, A.
R., & Scholl, T. (2003). A biochemical analysis demonstrates that the BRCA1 intronic variant IVS102A--> C is a mutation. J Hum Genet., 48(8), 399-403.
Kerem, B., Rommens, J. M., Buchanan, J. A., Markiewicz, D., Cox, T. K., Chakravarti, A., Buchwald,
M., & Tsui, L. C.. (1989). Identification of the cystic fibrosis gene: genetic analysis. Science. 245(4922),
1073-1080.
Kole, R., Vacek, M., & Williams, T. (2004). Modification of alternative splicing by antisense therapeutics. Oligonucleotides,. 14(1), 65-74.
Korsmeyer, S. J., Shutter, J. R., Veis, D. J., Merry, D. E., & Oltvai, Z. N.(1993). Bcl-2/Bax: A rheostat
that regulates an anti-oxidant pathway and cell death. Semin. Cancer Biol., 4, 327-332
305
Krawczak, M., Reiss, J., & Cooper, D. N. (1092). The mutational spectrum of single base-pair substitutions
in mRNA splice junctions of human genes: Causes and consequences. Hum Genet., 90(1-2), 41-54.
Kurreck, J. (2006). siRNA Efficiency: Structure or Sequence-That Is the Question. J Biomed Biotechnol., (4), 83757.
Kurreck, J. (2003). Antisense technologies. Improvement through novel chemical modifications. Eur J
Biochem., 270(8), 1628-1644.
Ladd, A. N., Stenberg, M. G., Swanson, M. S., & Cooper, T. A. (2005). Dynamic balance between activation and repression regulates pre-mRNA alternative splicing during heart development. Dev Dyn.,
233(3), 783-793.
Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C, & Brenner, S. E. (2007). Unproductive splicing of
SR genes assoziated with highly conserved and ultraconserved DNA elements. Nature, 446, 926-929
Lefebvre, S., Burglen, L., Reboullet, S., Clermont, O., Burlet, P., Viollet, L., Benichou, B., Cruaud, C.,
Millasseau, P., & Zeviani, M., et al. (1995). Identification and characterization of a spinal muscular
atrophy-determining gene. Cell., 80(1), 155-165
Lee, H. S., Lee, H. K., Kim, H. S., Yang, H. K., & Kim, W. H. (2003). Tumour suppressor gene expression correlates with gastric cancer prognosis. J Pathol., 200(1), 39-46.
Le Hir, H., Izaurralde, E., Maquat, L. E., & Moore, M. J. (2000). The spliceosome deposits multiple
proteins 20-24 nucleotides upstream of mRNA exon-exon junctions. EMBO J., 19(24), 6860-6869.
Lei, X. Y., Zhong, M., Feng, L. F., Zhu, B. Y., Tang, S. S., & Liao, D.F. (2006). Bcl-XL small interfering
RNA enhances sensitivity of Hepg2 hepatocellular carcinoma cells to 5-fluorouracil and hydroxycamptothecin. Acta Biochim Biophys Sin (Shanghai,. 38(10), 704-710.
Liu, H. X., Cartegni, L., Zhang, M. Q., & Krainer, A. R. (2001). A mechanism for exon skipping caused
by nonsense or missense mutations in BRCA1 and other genes. Nat Genet., 27(1), 55-58.
Liu, Q., Zhou, J., Daiger, S. P., Farber, D. B., Heckenlively, J. R., Smith, J. E., Sullivan, L. S., Zuo, J.,
Milam, A. H., & Pierce, E. A. (2002). Identification and subcellular localization of the RP1 protein in
human and mouse photoreceptors. Invest Ophthalmol Vis Sci., 43(1), 22-32.
Lorson, C. L., & Androphy, E. J. (2000). An exonic enhancer is required for inclusion of an essential
exon in the SMA-determining gene SMN. Hum Mol Genet., 9(2), 259-265.
Machuca-Tzili, L., Brook, D., & Hilton-Jones, D. (2005). Clinical and molecular aspects of the myotonic
dystrophies: a review. Muscle Nerve., 32(1), 1-18.
Maeda, T., Hiranuma, H., & Jikko, A. (1999). Differential expression of the splicing regulatory factor
genes during two-step chemical transformation in a BALB/3T3-derived cell line, MT-5 (1999). Carcionogenesis, 20, 2341-2344
Matter, N., Herrlich, P., & Kanig, H. (2002). Signal-dependent regulation of splicing via phosphorylation
of Sam68. Nature, 420(6916), 691-695.
306
Meister G, Buhler D, Pillai R, Lottspeich F, Fischer U. (2001). A multiprotein complex mediates the
ATP-dependent assembly of spliceosomal U snRNPs. Nat Cell Biol. 3(11), 945-949.
Mercatante, D. R., Sazani, P., & Kole, R. (2001). Modification of alternative splicing by antisense oligonucleotides as a potential chemotherapy for cancer and other diseases. Curr Cancer Drug Targets,
1(3), 211-230.
Mironov, A. A., Fickett, J. W., & Gelfand, M. S. (1999). Frequent alternative splicing of human genes.
Genome Res., 9(12), 1288-93.
Moon, S. D., Park, J. H., Kim, E. M., Han, & J. H. et al. (2005). A novel IVS2-1G>A mutation causes
aberrant splicing of the HRPT2 gene in a family with hyperparathyroidism-Jaw Tumor Syndrome
Journal of Clinical Endocrinology & Metabolism, 2, 878-883
Nanjundan, M., Zhang, F., Schmandt, R., Smith-McCune, K., & Mills, G. B. (2007). Identification of a
novel splice variant of AML1b in ovarian cancer patients conferring loss of wild-type tumor suppressive
functions. Oncogene, 26(18), 2574-2584.
Narla, G., DiFeo, A., Yao, S., Banno, A,, Hod, E., Reeves, H. L., Qiao, R.F., Camacho-Vanegas, O.,
Levine, A., Kirschenbaum, A., Chan, A. M., Friedman, S. L., & Martignetti, J. A. (2005). Targeted
inhibition of the KLF6 splice variant, KLF6 SV1, suppresses prostate cancer cell growth and spread.
Cancer Res., 65(13), 5761-5768.
Ni, J. Z., Grate, L., Donohue, J. P., Preston, C., & Nobida, N., et al. (2007). Ultraconserved elements are
associated with homeostatic control of splicing regulators by alternative splicing and nonsense-mediated
decay. Genes Dev,. 21, 708-718.
Nilsen, T. W. (2003). The spliceosome: The most complex macromolecular machine in the cell? Bioessays., 25(12), 1147-9.
Philips, A. V., Timchenko, L. T., & Cooper, T. A. (1998). Disruption of splicing regulated by a CUGbinding protein in myotonic dystrophy. Science, 280(5364), 737-741.
Pilch, B., Allemand, E., Facompre, M., Bailly, C., & Riou, J. F. et al. (2001). Specific inhibition of
serine- and argenine-rich splicing factors phosphorylation, spliceosome assembly, and splicing by the
antitumor drug NB-506. Cancer Res, 61, 6876-6884.
Pollex, R. L., & Hegele, R. A. (2004). Hutchinson-Gilford progeria syndrome. Clin Genet., 66(5), 375381.
Real, P. J., Sierra, A., De Juan, A., Segovia, J. C., Lopez-Vega, J. M., & Fernandez-Luna, J. L. (2002).
Resistance to chemotherapy via Stat3-dependent overexpression of Bcl-2 in metastatic breast cancer
cells. Oncogene, 21, 7611-7618.
Ritchie, W., Granjeaud, S., Puthier, D., & Gautheret, D. (2008). Entropy Measures Quantify Global
Splicing Disorders in Cancer. PLoS Comput Biol, 4(3), e1000011. doi:10.1371
Roberts, J., Palma, E., Sazani, P., Orum, H., Cho, M., & Kole, R. (2006). Efficient and persistent splice
switching by systemically delivered LNA oligonucleotides in mice. Mol Ther, 14(4), 471-475.
307
Rutter, J. L., Goldstein, A. M., Davila, M. R., Tucker, M. A., & Struewing, J. P. (2003). CDKN2A point
mutations D153spl(c.457G>T) and IVS2+1G>T result in aberrant splice products affecting both p16INK4a
and p14ARF. Oncogene, 22(28), 4444-4448.
Savkur, R. S., Philips, A. V., Cooper, T. A., Dalton, J. C., Moseley, M. L., Ranum, L. P., & Day, J. W.
(2004). Insulin receptor splicing alteration in myotonic dystrophy type 2. Am J Hum Genet., 74(6),
1309-1313.
Sazani, P., & Kole, R. (2003). Therapeutic potential of antisense oligonucleotides as modulators of
alternative splicing. J Clin Invest., 112(4), 481-486.
Sebillon, P., Beldjord, C., Kaplan, J. C., Brody, E., & Marie, J. (1995). A T to G mutation in the polypyrimidine tract of the second intron of the human beta-globin gene reduces in vitro splicing efficiency:
evidence for an increased hnRNP C interaction. Nucleic Acids Res., 23(17), 3419-3425.
Sergeant, N., Sablonniere, B., Schraen-Maschke, S., Ghestem, A., Maurage, C. A., Wattez, A., Vermersch, P., & Delacourte, A. (2001). Dysregulation of human brain microtubule-associated tau mRNA
maturation in myotonic dystrophy type 1. Hum Mol Genet., 10(19), 2143-2155.
Seznec, H., Agbulut, O., Sergeant, N., Savouret, C., Ghestem, A., Tabti, N., Willer, J. C., Ourth, L., Duros,
C., Brisson, E., Fouquet, C., Butler-Browne, G., Delacourte, A., Junien, C., & Gourdon, G. (2001). Mice
transgenic for the human myotonic dystrophy region with expanded CTG repeats display muscular and
brain abnormalities. Hum Mol Genet., 10(23), 2717-2726.
Stenson, P. D., Ball, E. V., Mort. M., Phillips, A. D., Shiel, J. A., Thomas, N. S., Abeysinghe, S., Krawczak, M., & Cooper, D. N. (2003). Human gene mutation Ddatabase (HGMD), 2003 update. Hum Mutat.
21(6), 577-581.
Staley, J. P., Guthrie, C. (1998). Mechanical devices of the spliceosome: Motors, clocks, springs, and
things. Cell, 92, 315-326.
Stickeler, E., Kitirell, F., Medina, D., & Berget, S. M. (1999). Stage-specific changes in SR splicing factors and alternative splicing in mammary tumorigenesis. Oncogene, 18, 3574-3582
Suwanmanee, T, Sierakowska, H., Lacerra, G., Svasti, S., Kirby, S., Walsh, C. E., Fucharoen, S., & Kole,
R. (2002a). Restoration of human beta-globin gene expression in murine and human IVS2-654 thalassemic erythroid cells by free uptake of antisense oligonucleotides. Mol Pharmacol., 62(3), 545-553.
Suwanmanee, T., Sierakowska, H., Fucharoen, S., & Kole, R. (2002b). Repair of a splicing defect in
erythroid cells from patients with beta-thalassemia/HbE disorder. Mol Ther., 6(6), 718-726.
Tange, T. O., Nott, A., & Moore, M. J. (2004). The ever-increasing complexities of the exon junction
complex. Curr Opin Cell Biol., 16(3), 279-284.
Tazi, J., Durand, S., & Jeanteur, P. (2005). The spliceosome: a novel multi-faceted target for therapy.
Trends Biochem Sci., 30(8), 469-478.
Teraoka, S. N., Telatar, M., Becker-Catania, S., Liang, T., Onengut, S., Tolun, A., Chessa, L., Sanal,
O., Bernatowska, E., Gatti, R. A., & Concannon, P. (1999). Splicing defects in the ataxia-telangiectasia
gene, ATM: underlying mutations and consequences. Am J Hum Genet., 64(6), 1617-1631
308
Thomassen, M., Kruse, T. A., Jensen, P. K., & Gerdes, A. M. (2006). A missense mutation in exon 13
in BRCA2, c.7235G>A, results in skipping of exon 13. Genet Test., 10(2), 116-120.
Vervoort, R., & Wright, A. F. (2002). Mutations of RPGR in X-linked retinitis pigmentosa (RP3). Hum
Mutat,. 19(5), 486-500.
Wang, X. Q., Luk, J. M., Leung, P. P., Wong, B. W., Stanbridge, E. J., & Fan, S. T. (2005). Alternative
mRNA splicing of liver intestine-cadherin in hepatocellular carcinoma. Clin Cancer Res., 11(2 Pt 1),
483-489.
Watermann, D. O., Tang, Y., Zur Hausen, A., Jager, M., & Stamm, S., et al. (2006). Splicing factor
Tra2-beta1 is specifically induced in breast cancer and regulates alternative splicing of the CD44 gene.
Cancer Res., 66, 4774-4780
Will, C. L. & Luhrmann, R. (2001). Spliceosomal UsnRNP biogenesis, structure and function. Curr
Opin Cell Biol., 13(3), 290-301.
Xerri, L., Parc, P., Brousset, P., Schlaifer, D., Hassoun, J., & Reed, J. C. et al. (1996). Predominant expression of the long isoform of Bcl-x (Bcl-xL). in human lymphomas. Br. J. Haematol., 92, 900-906
Xie, Y. E., Tang, E. J., Zhang, D. R., & Ren, B. X. (2006). Down-regulation of Bcl-XL by RNA interference suppresses cell growth and induces apoptosis in human esophageal cancer cells. World J
Gastroenterol, 12(46), 7472-7477.
Zerbe, L. K., Pino, I., Pio, R., Cosper, P. F., & Dwyer-Nield, L. D., et al. (2004). Relative amounts of
antagonistic splicing factors, hnRNP A1 and ASF/SF2, change during neoplastic lung growth: implications for pre-mRNA processing. Mol Carcinog, 41, 187-196
Zhou, Z. H., Licklider, L. J., Gygi, S. P., & Reed, R. (2002). Comprehensive proteomic analysis of the
human spliceosome Nature, 218, 182-185
Zielenski, J., & Tsui, L. C. (1995). Cystic fibrosis: genotypic and phenotypic variations. Annu Rev
Genet., 29, 777-807.
K ey T erms
Alternative Splicing: Alternative choice of introns which are spliced out during pre-mRNA processing.
Autosomal Recessive Disorder: Disorder that only occurs if both alleles of a pair of autosomal
chromosomes are mutated.
Beta-Globin Gene: Gene that codes for the beta globin chain of the hemoglobin protein.
CFTR (Cystic Fibrosis Transmembrane conductance Regulator): CFTR is an ion channel protein belonging to the class of ABC transporters. It transports chloride ions through the cell membrane.
Dysfunction of this protein causes the Cystic fibrosis.
309
Exon: An exon is a part of the pre-mRNA that is not removed during the RNA-splicing process.
Exonic Splicing Enhancer/Silencer: A pre-mRNA sequence motif of about six bases within an
exon regulating enhanced/silenced splicing at a close by sequence position.
Hereditary Disease: A genetic disorder is called a hereditary disease.
Human Gene Mutation Database: The Human Gene Mutation Database (HGMD) constitutes a
comprehensive core collection of data on germ-line mutations in nuclear genes underlying or associated
with human inherited disease (www.hgmd.org).
Hyperexcitability: A mutation in a cation ion channel which leads persistent muscle contractions
caused by increased cell membrane voltage.
Intron: A pre-mRNA fragment that is cleaved out during pre-mRNA splicing.
Intronic Splicing Enhancer/Silences: A pre-mRNA sequence motif of about six bases within an
intron regulating enhanced/silenced splicing at a close by sequence position.
Ligation: The joining of two molecules facilitated by an enzyme called ligase. In our context ligation denotes the joining of exon fragments during pre-mRNA splicing.
Spliceosome: The entire assembly of proteins that facilitates pre-mRNA splicing.
Thalassemia: Severe heriditary disease caused by mutations in the beta-globin gene sequence.
310
Section V
312
Chapter XVIII
Mathematical Modeling of the

Aging Process
Axel Kowald
Medizinisches Proteom Center (MPC), Ruhr-Universitt Bochum, Germany
abstract
Aging is a complex biological phenomenon that practically affects all multicellular eukaryotes. It is
manifested by an ever increasing mortality risk, which finally leads to the death of the organism. Modern
hygiene and medicine has led to an amazing increase in average life expectancy over the last 150 years,
but the underlying biochemical mechanisms of the aging process are still poorly understood. However, a
better understanding of these mechanisms is increasingly important since the growing fraction of elderly
people in the human population confronts our society with completely new and challenging problems.
The aim of this chapter is to provide an overview of the aging process, discuss how it relates to system
biological concepts, and explain how mathematical modeling can improve our understanding of biochemical processes involved in the aging process. We concentrate on the modeling of stochastic effects
that become important when the number of involved entities (i.e., molecules, organelles, cells) is very
small and the reaction rates are low. This is the case for the accumulation of defective mitochondria,
which we describe mathematically in detail. In recent years several tools became available for stochastic
modeling and we also provide a brief description of the most important of those tools. Of course, mitochondria are not the only target of modeling efforts in aging research. Therefore, the chapter concludes
with a brief survey of other interesting computational models in this field of research.
W ha t is aging?
Looking at the enormous rise of average human lifespan over the last 150 years, one could get the
impression that modern research actually has identified the relevant biochemical pathways involved in
aging and has successfully reduced the pace of aging. Oeppen and Vaupel (2002) collected data on world
Mathematical Modeling of the Aging Process
wide life expectancy from studies going back to 1840. Figure 1 shows the life expectancy for males
(squares) and females (circles) for the countries that had the highest life expectancy for the given year.
Two points are remarkable. Firstly, there is an amazingly linear trend in life expectancy that corresponds
to an increase of 3 months per year (!) and secondly there is no leveling off observable.
These impressive data suggest strongly that lifespan will also continue to rise in the next years, but it
does not show that the actual aging rate has fallen during the last century. Aging can best be described as
a gradual functional decline, leading to a constantly increasing risk to die within the next time interval
(mortality). The Gompertz-Makeham equation (Gompertz, 1825; Makeham, 1867), m(t) = I eG t + E,
describes how the exponential increase of mortality depends on intrinsic vulnerability (I), actuarial aging
rate (G) and environmental risk (E). All living organisms have a base mortality caused by environmental
risks, but it is the aging rate, G, which causes human mortality to double approx. every 8 years. From
I
this equation we can derive the following expression for the survivorship function: N(t) = e G
(1e ) E t.
Gt
Figure 1. Male (blue squares) and female (red circles) life expectancy in the world record holding country
between 1840 and 2000 based on the annual data of countries world wide (reproduced with permission
from Oeppen & Vaupel, 2002).
313
As expected we see that the number of remaining survivors depends on all three parameters and consequently a change of the average life expectancy (time until 50% of the population has died) can be
caused by a modification of any of those parameters. This point is also discussed in more detail by
(Kowald, 2002). And indeed, analyzing the survivorship data of the last 100 years more closely, it becomes clear that the aging rate, G, remained constant. The enormous increase in life expectancy was
achieved exclusively by changes of intrinsic vulnerability and environmental risk!
Because of the drastic social, economical and political consequences that are brought about by the
demographic changes of the age structure of the population, it is now more important than ever to
understand what constitutes the biochemical basis for a non-zero aging rate, G. Systems biology might
help to achieve this goal.
W hy is aging a prime candida te f or systems
bi ology?
Evolutionary theories of the aging process explain why aging has evolved, but unfortunately they dont
predict specific mechanisms to be involved in aging. As a consequence more than 300 mechanistic ideas
have been developed (Medvedev, 1990), each centered around different biochemical processes. This is
probably due to the fact that even the simplest multicellular organisms are such complex systems that
many components have the potential to cause deterioration of the whole system in case of a malfunction.
Figure 2 shows a small sample of the most popular mechanistic theories. The spatial arrangement of the
diagram intends to reflect the various connections between the different theories. And it is exactly the
large number of interactions that makes it so difficult to investigate aging experimentally and renders
it ideal for systems biology. To understand this we will look at a few examples.
The Telomere Shortening Theory is an important idea that has gained considerable support in the
last 10-15 years. Telomeres are the physical ends of linear eukaryotic chromosomes and vital for the
Figure 2. Graphical representation of some mechanistic theories of aging. The topology of the diagram
reflects logical and mechanistic overlaps and points of interaction between different theories.
314
functioning of the cell (Lundblad & Szostak, 1989; Yu, Bradley, Attardi, & Blackburn, 1990). It has
been recognized for a long time that linear DNA has a replication problem since DNA polymerases can
only replicate in the 5-3 direction and cannot start DNA synthesis de novo (Olovnikov, 1973; Watson,
1972). This inability leads to a gradual loss of DNA, which was confirmed experimentally for human
fibroblasts in 1990 and consequently proposed as being responsible for aging (Harley, 1991; Harley,
Vaziri, Counter, & Allsopp, 1992). Telomere shortening provides the explanation and connection to the
Hayflick Limit, the long known phenomenon that most cultured cell types have only a limited division potential. This in turn is by some researcher interpreted as evolutionary selected trait that acts as
a cancer prevention system by preventing unlimited cell division. While the direct cause of telomere
shortening is the lack of the enzyme telomerase, there is an interaction to oxygen radicals that complicates the mechanism. Oxidative stress was shown to increase the rate of telomere shortening and
thus modulating the telomere attrition (von Zglinicki, Saretzki, Docke, & Lotze, 1995). As the figure
indicates, free oxygen radicals are also the central hub to several other prominent phenomena affecting
the aging process. They damage all kinds of macromolecules, leading to cross linking of proteins and
the generation of indestructible waste products that accumulate in post mitotic cells (i.e. lipofuscin).
Degraded mitochondria are supposed to be a major fraction of these waste products and certain oxygen
radicals are known to damage the mitochondrial membrane. It seems, however, that radical induced
somatic mutations of the mitochondrial DNA (mtDNA) are the main route to defective mitochondria,
which produce less energy, but generate more radicals (Linnane et al., 1990; Linnane, Marzuki, Ozawa,
& Tanaka, 1989; Miquel, Economos, Fleming, & Johnson, 1980). These mutants are capable of taking
over the mitochondrial population of the cell, causing a chronic energy deficiency and maybe aging. It
is still unclear what the selection pressure and mechanism is that leads to the accumulation of defective mitochondria, but several suggestions have been made (de Grey, 1997; Kowald, Jendrach, Pohl,
Bereiter-Hahn, & Hammerstein, 2005; Kowald & Kirkwood, 2000).
Supporting evidence has been found for all the above mentioned ideas and the many interdependencies make ageing a phenomenon that is very difficult to study experimentally. If a single mechanism
is studied in isolation, it is hard to interpret the results that were obtained without the contribution of
the other mechanisms. And if a complex system is studied with all involved pathways, it is expensive,
technically demanding and now the results are difficult to interpret because of the large number of factors that might have influenced the results.
This is exactly the situation where a systems biological approach is useful. Systems biology aims at
investigating the components of complex biochemical networks and their interactions, applying experimental high-throughput and whole genome methods, and integrating computational and mathematical
methods with experimental efforts. The growing number of high-throughput techniques that have been
developed in the last years is a major driving force behind the wish to utilize computational methods
to manage and interpret the high data output. Modelers, on the other side, are keen to use the generated
data to develop quantitative models of systems with complex interactions. Because of the large number
of parameters such models would not be meaningful without sufficient experimental measurements.
In addition, quantitative modeling of complex systems has several benefits. First of all, it requires
that each aspect of a verbal hypothesis is being made specific. Before a computational model can be
developed, the researcher has to define each component and how it interacts with all the other components. This is a very useful exercise to identify gaps in current knowledge and in the verbal model.
It helps to complete the conceptual model, respectively motivates experiments to collect the missing
315
experimental information. To understand complex systems with components that produce opposing
effects it is essential to have a model with quantitative predictions. Purely qualitative models (such as
verbal arguments) are not sufficient to decide how a system develops over time if it contains non-linear
opposing subcomponents. Computational models are also a convenient way to explore easily and cheaply
what if scenarios that were difficult or impossible to test experimentally. What if a certain reaction
would not exist? What if a certain interaction would be ten times stronger? What if we are interested
in time spans too short or too long to observe in an experiment?
In the next section we will show in detail how a mathematical model can be constructed for a specific
problem in aging research that (i) gives insight into a not well understood phenomenon, (ii) explores
several what if questions that would be difficult to text experimentally and (iii) makes suggestions
for further experiments.
Underst anding
the acc umulati on of defecti
ve mit och ondria
Although a large number of mechanistic theories of aging exist, there are a few that have gained widespread popularity. Probably the most intensively studied idea is currently the free radical theory in
combination with the accumulation of defective mitochondria. Large mtDNA deletions were found in
post-mitotic tissues (heart, brain skeletal muscle) of aging individuals (Hattori et al., 1991; Linnane et
al., 1990). Early studies were performed on tissue homogenates and found that the fraction of defective
mtDNAs is well below 1% (Cortopassi, Shibata, Soong, & Arnheim, 1992; Randerath, Randerath, &
Filburn, 1996). However, it turned out that the underlying assumption that mitochondrial damage is
distributed homogeneously within a tissue is wrong. The combination of PCR amplification of extended
sequences together with single cell studies revealed that muscle tissue displays a mosaic pattern of mitochondrial damage. While even in old individuals most cells harbor little or no damaged mitochondria,
there are a few cells that contain a large proportion of mitochondrial mutants (Cao, Wanagat, McKiernan,
& Aiken, 2001; Gokey et al., 2004; Khrapko et al., 1999). These studies demonstrated that in affected
cells the mitochondrial population was apparently taken over by a single mutant, which was different
for different cells. This suggests that the cellular accumulation of defective mitochondria proceeds via
clonal expansion of a single originating mutant. But how can a single mutant mitochondrial DNA, which
lacks essential genes, out-compete the population of wild type mitochondria?
The phenomenon that mitochondria form dynamic networks, undergoing constant fission and fusion
events has been observed for several cell types, such as yeast, plants, HeLa and human endothelial cells
(Arimura, Yamamoto, Aida, Nakazono, & Tsutsumi, 2004; Jendrach et al., 2005; Karbowski et al., 2004;
Nunnari et al., 1997; Takai, Inoue, Goto, Nonaka, & Hayashi, 1997; Takai, Isobe, & Hayashi, 1999).
This opens the possibility that the shorter replication time of deleted mtDNAs represents the selection
advantage, since all the mitochondria of a cell effectively form one large compartment with constant
mixing of mtDNA as well as matrix and membrane components. Under these conditions, mtDNA deletions no longer lead to energy and proton gradient deficiencies within a single mitochondrion since there
is a common pool of these resources. We therefore developed a model of mtDNA mutation, replication
and degradation (Kowald et al., 2005) that should explain the cellular mosaic pattern of mitochondrial
damage and the observed distribution of deletion sizes in old rats found by Cao et al. (2001).
316
T he Model
The model was simulated stochastically to explicitly take into account random fluctuations and the discrete nature of the studied biological objects. Mutation events with their inherently small probabilities
require a non-deterministic approach to describe adequately the finding that in old animals some cells
do contain mitochondrial mutants and others do not. Gillespie developed exact stochastic simulation
algorithms that directly calculate the change of the number of molecules of the participating species
during the time course of a chemical reaction (D. T. Gillespie, 1977). Thus stochastic algorithms calculate explicitly the time until the next reaction takes place (e.g. AB+C) and keep track of how the
number of molecules in the system is changed by this reaction (A decreased by one, B and C increased
by one). Such an approach also deals with the discrete nature of molecule numbers. This is of special
importance for self-replicating entities (like mtDNAs) if the number of objects is either zero or one.
One mutant mtDNA can multiply and take over the cell, while zero mtDNA molecules obviously cannot. Differential equations, however, assume that variables can attain continuous values, so that 0.001
mtDNAs is a valid simulation result. Because this value is different from zero it is theoretically possible
that a population of mutants can recover from such low levels and replace the wild-type: an unrealistic
situation which is avoided by stochastic simulations.
The free software package Dizzy was used for all simulations (Ramsey & Orrell, 2005). Dizzy is
written in Java and implements several stochastic (Gillespie, Gibson-Bruck, Tau-Leap) and deterministic
(ODE solvers) simulation algorithms. Models can be described in Systems Biology Markup Language
(SBML) (Hucka et al., 2003) or using a proprietary model definition language (CMDL). Because of a
special language construct it was possible to define the 4950 mutation reactions with just five lines of
code in CMDL. To be comparable with the experimental conditions of Cao et al. (2001), Gillespies
direct method was used to calculate the fate of a mitochondrial population within a cell for 1100 days,
corresponding to 38 months of a rats life. On a modern workstation such a calculation took approx. 1
hour. The simulation was then repeated 1000 times on a Linux cluster to have enough information for
a statistical analysis of the accumulation process of defective mitochondria.
Mutations
In this simulation study only deletion mutations affecting the large arc (Figure 3B) of the rat genome are
considered. It is assumed that deletions which include the heavy or light strand origin of replication (OH
and OL) are no longer capable of replication and thus cannot accumulate. Deletions affecting exclusively
the minor arc are also excluded, since they have not been observed (Cao et al., 2001).
To further simplify and speed up the simulation, deletions are modeled with a granularity of 100bp,
i.e. deletions can be of 100, 200 or 300bp, but not 250bp. Consequently, the model contains 100 possible
classes of mitochondrial genotypes, wild-type (M0) and 99 different types of deletions ranging from
100bp (M1) to 9900bp (M99). Note that the name indicates the size of the deletion. Other assumptions
that have been made regarding mitochondrial mutations are:
1.
Deletion mutants can suffer from further deletion events (MxMy, x<y). Figure 3A shows the resulting network of possible mutation reactions for five different types of mitochondria. However,
since 100 different types were used in the simulations, the total number of mutation reactions is
4950.
317
Figure 3. A) Overview of the mutations allowed by the model. Different mutants are labeled by numbers
and the larger the number, the larger the deletion that this mutant carries. Arrows indicate possible mutations, showing that small (M0M1) or large (M0M4) pieces of DNA can be lost in a single mutation
event. Mutants can suffer further mutations leading to even smaller mutants (M1M4 but not M1
M0). B) Schematic drawing of the rat mtDNA showing OH, the origin of the heavy strand synthesis and
OL the origin of the light strand synthesis. The locations of the origins define the minor and major arc
with the replication of the strands proceeding in the direction indicated by the arrows.
OH
A)
B)
M1
M2
M0
Rat mtDNA
. kb
OL
M4
2.
3.
4.
M3
Since no details are known of the biochemical mechanisms underlying the deletions, we assume that
the mutation rate is proportional to the length of the mtDNA molecule and has the dimension per
day per kbp. Thus larger mtDNAs have a higher probability of mutation than small mtDNAs.
Another important point is the outcome of a deletion event. What is the size distribution of the
resulting deletion mutants? Without knowing the biochemical basis of the deletion process it is not
possible to predict if all deletions will be of identical size or what type of distribution to expect.
Under these circumstances we assume that the deletion is the result of two random cuts within the
large arc. This leads to a distribution with a linearly increasing probability for the occurrence of
small deletions. Basically, large deletions are very rare since only a few cut positions exist within
the large arc that will result in a large deletion fragment.
Finally, we investigate the possibility that the presence of defective mtDNAs increases oxidative
stress, and thus the mutation rate, within a cell. It is therefore assumed that the mutation rate
is increased by a certain factor (the boost factor) if 100% of the mitochondrial population is
defective. A linear relationship between the fraction of defective organelles and the increase in
oxidative stress is assumed, so that a cell with X% mutants has its mutation rate increased by:
boost factor * X/100.
Taking the above points into account the following equation can be constructed that calculates the
mutation rates leading from one mutant type to another. The first term describes the effects of increasing oxidative stress caused by mitochondrial mutants (point 4). If there are only wild-type organelles
318
(M0), the term reduces to one, leaving the mutation rate unchanged, as calculated by the second term.
But for a population consisting to 100 percent of mutants (M1 to M99) the first term evaluates to bF, the
boost factor.
The second term incorporates the assumption that the mutation rate is per kbp and therefore the
baseRate is multiplied by the length of the mtDNA of mutant Mi, which is given by (16-0.1*i). Furthermore, we have to take into account that small deletions are more likely to occur than large ones (point
3). There is a 1/1002 chance of placing two random cuts (limited to 100 bp intervals) such that a 10kbp
piece results, but there is a 100/1002 chance of placing the two cuts such that a 0.1kbp piece results (i.e. a
(100-k)/10000 chance for a deletion size of k*0.1kbp). Since the total mutation rate is distributed among
all possible resulting mutants
100
i =1
i / 10000 1 / 2 , the resulting equation is given by:
(bF 1) M i
baseRate (16 0.1 i) (100 k)

i =1
+ 1
mutRateik =
5000
Mi
i =0
Replication
It is assumed that replication time is proportional to the length of the mtDNA molecule. However, because
of the special mode of replication of mtDNA, some regions of the mitochondrial genome count twice.
The origins of replication of the heavy and light strands of the mammalian mtDNA molecule (OH and
OL, respectively) are located at different positions on the molecule (Figure 3B). The replication process
proceeds asymmetrically via strand-displacement (Clayton, 2003; Tapper & Clayton, 1981). Leading
strand synthesis starts at OH and continues clockwise along the major arc until it reaches and exposes the
origin of the lagging strand, OL. Only now can lagging strand synthesis start counter-clockwise through
the major arc, while leading strand synthesis continues through the minor arc until it has completed a
full round of replication. However, the process is not finished until lagging strand synthesis has also
completed the full circle of DNA.
The time for the total replication process is therefore the time required for the synthesis of the full
length molecule plus the time that the start of light strand synthesis lags behind the start of heavy strand
synthesis. For the wild-type molecule this is the time required to synthesize 16kbp + 10kbp = 26kbp
so that the major arc effectively counts twice. As a consequence a mutant with a complete deletion of
the major arc replicates 26/6 = 4.33 times faster than wild-type and not 16/6 = 2.66 times, as might have
been expected.
To construct a model that leads to a steady state population of mtDNA molecules we have to add
negative feed back such that the total synthesis of mtDNA declines with increasing copy number in the
cell. The following equation shows how this feed back enters into the calculation of the replication rate
for mutant Mi. The first term is a Hill type product inhibition depending on the total number of genomes,
a parameter which controls when the total synthesis rate has reached 50% of its maximum (k), and an
exponent determining how sharply the system reacts if the number of mtDNA departs from k. The term
includes a constant specifying the maximum number of mtDNA that can be produced per day (c).
319
repRatei =
c k5
Mi + k 5
26
Mi
26 i0.2
26
Mi
26 i0.2
If there are no mtDNA molecules, the total number of newly synthesized genomes is c, if the
number of mtDNAs is equal to k the synthesis rate is reduced to c/2 and if the number of mtDNAs
is very large the total synthesis rate approaches zero. We also have to specify how the total amount of
synthesized mtDNAs is distributed between the different types (mutants). This is done by the second
term that basically weights the different types according to their size advantage and number. If there
is only wild-type (M0), the second term simplifies to 1, indicating that all synthesized genomes will
be wild-type. If there is a mutant with a 10kbp deletion (M100) together with equal amounts of wildtype, a fraction 1/(1+26/6) = 0.1825 of the newly synthesized genomes will be wild-type and a fraction
(26/6)/(1+26/6) = 0.8125 will be of type M100.
Degradation
If new mtDNAs are synthesized some genomes have to be degraded to maintain a stable population
size. Unfortunately, little is known about the process of mitochondrial degradation. It is not known if
they are selected for degradation according to the amount of membrane damage, as suggested by the
SOS hypothesis (de Grey, 1997) or if they are selected randomly. However, here we are exclusively interested in the effects of different mtDNA sizes and in this context degradation of mitochondria means
degradation of mtDNA molecules. For the purpose of this study we assume identical degradation rates,
and therefore half-lifes, for all types of mtDNAs (M0 to M99). For all simulations shown, the half-life
was set to 10 day (see also Table1).
R esults
Figure 4 displays the results for simulations using a mutation rate of 6*10-7 day-1 kbp-1. Part A of the figure
is effectively a contour plot showing the accumulation of mtDNA mutants over time. The solid line, for
instance, indicates the fraction of cells that contain at least one defective mitochondrial genome (>0%
mutants) at a given time. Under the mutation rate used in these simulations, all cells contained one or
more mutants after 400 days. Since each simulation run computes the fate of the mtDNA population of
a single cell over time, the fraction of cells is identical to the fraction of simulation runs performed.
The other contour lines represent higher levels of defective mtDNAs. The point marked by the arrow
shows that after approx. 200 days, 60% of the cells had a mtDNA population consisting of at least 20%
defective genomes.
The other parameters used for this simulation are summarized in Table1. No boost factor has been
used (bF=1), i.e. defective mtDNA did not increase oxidative stress. Parameters k and c, necessary
for the calculation of the replication rate, were set to 5000 and 1000, respectively. Together with the 10
day half-life of mitochondrial DNA, this resulted in a mtDNA population size of approx. 5500 (data not
shown). This is in good agreement with studies of skeletal and heart muscle that find values between
3000 and 7000 mtDNA molecules per cell (Miller, Rosenfeldt, Zhang, Linnane, & Nagley, 2003).
320
Figure 4. Simulation results for a high mitochondrial mutation rate (6*10-7 day-1 kbp-1). A) Summary
of the accumulation pattern of defective mitochondria obtained by following the fate of 1000 cells. The
curves show the fraction of cells that contain more than 0, 20, 40 or 60% defective mitochondria, i.e.
after 200 days roughly 60% of the cells had more then 20% of defective organelles (marked by arrow).
B) Distribution of mitochondrial deletion sizes at four equally spaced time points during the life of a rat.
Early in life (day 281) the distribution contains a large fraction of small deletions down to 0.5 kbp. As
time progresses, these small deletions disappear until finally (day 1127) the distribution is practically
free of deletions smaller than 3.5 kbp. The diagram is based on the results of 1000 simulation runs.
The diagram shown in Figure 4A summarizes the total number of defective genomes that accumulate with time, but does not differentiate among different deletion sizes. For this purpose the histogram
of deletion sizes at four different time points (day 281 to day 1127) is shown in Figure 4B. As can be
seen, early on many small deletions down to 0.5 kbp are present. But as time progresses, the shape of
the distribution changes: small deletions disappear and larger deletions emerge. This resembles very
closely the size distribution observed experimentally for old rats (Cao et al., 2001). It turns out that the
model assumption that mutants can mutate again is responsible for the disappearance of small deletions
over time.
321
Table 1. Parameters and standard values used for the simulations

Name
Value
baseRate
6*10 day kbp
Mutation rate per day and kbp. Order of magnitude is in accordance with (Shenkar et
al., 1996).
bF
Factor by which the mutation rate is increased if 100% of the mitochondrial

population are deletion mutants.
5000
If the number of mitochondria is equal to k, the synthesis rate of new organelles is

half maximal. This is related to the number of mtDNAs per cell which ranges from
700 in sperm (Diez-Sanchez et al., 2002), 7000 in heart muscle (Miller et al., 2003)
to 25000 in liver (Berdannier & Everts, 2001).
1000
degRate
ln(2)/10 day
-7
Description
-1
-1
Maximum number of mitochondria that can be synthesized per day.

-1
Degradation rate for all mtDNAs. Corresponding to a half-life of 10 days (Huemer,

Lee, Reeves, & Bickert, 1971; Korr, Kurz, Seidler, Sommer, & Schmitz, 1998;
Menzies & Gold, 1971).
C onclusion
In the presented model description we omitted several of the actually performed simulations for brevity,
but the main benefits of the modeling approach are clear.
The stochastic approach is important to account for the probabilistic nature of mutations, which is
likely to be responsible for the mosaic pattern of healthy and impaired cells seen in cross sections of
tissues from old organisms. In some of the simulation runs, no mitochondrial mutants were present after
38 months, while in others a deletion event took place and the wild-type mtDNA was replaced by the
mutant. A model composed of differential equations could not have captured such a pattern, since that
approach is more equivalent to the experimental method of using tissue samples containing millions
of cells. The result is in both cases an averaging effect, obscuring the true pattern of accumulation of
mtDNA deletion mutants.
Furthermore, the model (i) provides the inside that the lack of small deletions might be caused by
several successive generations of mutants with increasing deletion size, (ii) it is able to explore several
scenarios, which are inaccessible for experiments (high or low mutation rate, different influence of
oxidative stress on mutation rate) and (iii) it makes the prediction that the size distribution should be
quite different in young animals, with much more small deletions being present.
Tools f or S t ochastic
Mode ling
Most current modeling in systems biology is done using deterministic ordinary differential equations,
because many tools exist for this type of modeling and the results are easier to interpret. But as discussed
above, there are good reasons to perform stochastic simulations if molecule numbers are low. This
approach, however, requires substantially more computing power, since many trajectories have to be
simulated for a statistical analysis of the system. The number of software packages capable of stochastic
simulations is slowly but constantly growing and in the following we will present a brief survey.
322
D izzy
Dizzy (Figure 5) is the software tool we used for the mitochondrial model described above and it has
several nice features that make it our favourite tool. It is written in Java and thus platform independent.
This is very helpful in heterogeneous environments consisting of Linux/Unix and Windows machines.
Furthermore, Dizzy has an easy to use GUI and the results can be directed to a data file or displayed
graphically. Another useful feature is the separation of GUI and the number crunching routines. This
made it possible for us to calculate thousands of model trajectories on our Linux cluster and analyse
the results later. Dizzy not only contains various stochastic solvers (Gillespie direct method, GibsonBruck, -leap method), but also a number of deterministic ODE solvers. Without any modifications a
model can be simulated stochastically or deterministically. This is very convenient for comparing the
two different approaches and for obtaining a fast first view of the model behaviour (by using the ODE
solver). A model can be defined in the de facto standard SBML (Systems Biology Markup Language)
Figure 5. Screenshot of the stochastic simulation software package Dizzy. Three different windows
of the GUI are shown. The editor in the background displays a model description, whose simulation
results are shown in the window at the top right. In the front window at the bottom the type of simulator
(stochastic or deterministic) and the output destination (diagram, table, file) are chosen.
323
or in a powerful model definition language called CMDL. This was essential for our simulations, since
it enabled us to define 4950 mutation reactions with only five lines of code. Dizzy is developed by
Stephen Ramsey at the Institute for Systems Biology and is freely available (http://magnet.systemsbiology.net/software/Dizzy).
Other T ools
Copasi (http://www.copasi.org) is another popular tool that is also capable of stochastic (Gibson-Bruck)
and deterministic simulations. Copasi can read and write SBML models and its major strength is the
analysis of existing deterministic models. It performs metabolic control and steady state analysis, calculates Lyapunov exponents and can be used for model parameter estimation using time course or steady
state data. Like Dizzy it can be used as a command line version for batch processing.
Stocks2 (http://www.sysbio.pl/stocks) can only perform stochastic simulations but contains, in addition to the standard solvers, also a hybrid algorithm for reaction systems that consist of a mixture of
rare and frequent molecular species. Stocks models frequent species using an approximate algorithm
and slow reactions with an exact Gillespie algorithm. Species are assigned dynamically to the frequent
respectively rare group to take care of concentration changes over time.
The Systems Biology Workbench (SBW) (Hucka et al., 2002) is a software infrastructure that enables
different tools to communicate with each other (http://www.sys-bio.org). That means all SBW aware
programs can use services provided by different modules and in turn advertise their own specialized
services. A powerful SBW module that provides stochastic simulation and analysis functions has been
developed by the Sauro group and can be downloaded from http://public.kgi.edu/~rrao. Popular model
construction tools that are SBW aware are for instance JDesigner (included in SBW) or CellDesigner
(http://www.celldesigner.org).
The software packages discussed so far are only suitable for spatially homogeneous models, which
implies a constant and immediate mixing of all participating species. For stochastic simulations of 3D
reaction diffusion systems, MesoRD (Hattne, Fange, & Elf, 2005) is suitable. It is free software, written
in C++ and implements the next subvolume method to simulate the Markov process corresponding to
the reaction-diffusion master equation. It can be obtained from http://mesord.sourceforge.net.
Su r vey of mathema tica l mode ling in aging
research
While the section about the accumulation of defective mitochondria gave an in depth view of the development and use of a mathematical model for aging research, it is of course only one in a large number
of models in this area of research. To provide the reader with a broader picture of the types of models
that exist and the types of problems that are tackled, we close this chapter with a brief survey of selected
models from the literature.
The interaction between oxygen radicals and mitochondria plays a very important role in many aging theories and one of the most complete models of oxidative membrane damage to mitochondria has
been developed by Antunes et al. (1996). They developed a system of ordinary differential equations
describing more than 80 reactions of lipid metabolism in the inner mitochondrial membrane and surrounding matrix environment. Apart from being an invaluable source of kinetic parameters the main
result is that the perhydroxyl radical is the main initiator of lipid peroxidation.
324
Another publication dealing with mitochondria is by Albert et al. (1996). They model the dynamics
of plant mitochondrial genomes, which are composed of a set of molecules of various sizes that generate each other through recombination between repeated sequences. Their stochastic model describes
the selection process at the inter-molecular, inter-mitochondrial and inter-cellular level. They show that
the inter-mitochondrial level is important for maintaining the entire mitochondrial information in cells.
Under those conditions no master circle with maximum fitness is necessary.
Since aging is a process that comes about by the interaction of several damage pathways (reflected
by the different theories), it is necessary to model the reaction network as a whole. The model of Kowald
& Kirkwood (1996) integrates the contributions of defective mitochondria, aberrant proteins and free
radicals. It also includes antioxidant enzymes and proteolytic scavengers. The most important result is,
that damage accumulation in mitochondria and proteins occurs on different time scales and that the final
breakdown seems to be a cooperation of mitochondrial and cytoplasmic reactions. The mitochondria
undergo gradual, long term changes, which eventually trigger a short lived cytoplasmic error loop.
A rather unusual phenomenon has been observed in Saccharomyces cerevisiae, which are also used
as model system for aging. Yeast mother cells have a limited division potential, while young daughter
cells start with a fresh division capacity. It has been observed that extrachromosomal ribosomal DNA
circles (ERCs) liberate from genomic DNA and accumulate in old yeast cells. A mathematical model
was developed that readily explains the observed data, if ERC formation increases with the age of the
cell (C. S. Gillespie et al., 2004).
Finally, mathematical models have also been used to investigate evolutionary questions about the
aging process. The evolution of human menopause is difficult to explain. It has been suggested that there
may be little benefit for an older mother in taking the increasing risk of a further pregnancy if existing
children depend critically on her survival. Another idea is that post-reproductive grandmothers increase
their fitness by assisting their adult daughters. Modeling studies showed that individual theories fail
to provide sufficient selection advantage to explain the evolution of menopause, but a combined model
can achieve this (Shanley & Kirkwood, 2001).
And of course also the evolution of the aging process itself has been modeled. It is not trivial to
explain why organisms should grow old and die. What is the selective advantage of this trait and why
have different species widely differing life spans? The disposable soma theory (T. B. L Kirkwood &
Holliday, 1986; T. B. L. Kirkwood & Rose, 1991) shows mathematically that aging can be explained
as the consequence of an optimal resource allocation between survival and reproduction. The optimal
resource allocation (and hence aging rate) differs for different species according to their environmental
mortality risk.
Mathematical models have been used for many years in aging research to understand phenomena
and test explicit predictions of theories. Because aging practically affects all pathways and all levels of
a living organism, it is a prime candidate for the emerging field of systems biology. The generation of
high throughput data will enable the development of larger models that describe the complex network
of interactions underlying the different aging theories.
R eferences
Albert, B., Godelle, B., Atlan, A., De Paepe, R., & Gouyon, P. H. (1996). Dynamics of plant mitochondrial genome: Model of a three level selection process. Genetics, 144, 369-382.
325
Antunes, F., Salvador, A., Marinho, H. S., Alves, R., & Pinto, R. E. (1996). Lipid peroxidation in mitochondrial inner membranes. I. An integrative kinetic model. Free Radical Biology & Medicine, 21(7),
917-943.
Arimura, S., Yamamoto, J., Aida, G. P., Nakazono, M., & Tsutsumi, N. (2004). Frequent fusion and
fission of plant mitochondria with unequal nucleoid distribution. Proc Natl Acad Sci USA, 101(20),
7805-7808.
Berdannier, C. D., & Everts, H. B. (2001). Mitochondrial DNA in aging and degenerative disease. Mutation Research, 475, 169-183.
Cao, Z., Wanagat, J., McKiernan, S. H., & Aiken, J. M. (2001). Mitochondrial DNA deletion mutations
are concomitant with ragged red regions of individual, aged muscle fibers: Analysis by laser-capture
microdissection. Nucleic Acids Research, 29(21), 4502-4508.
Clayton, D. A. (2003). Mitochondrial DNA replication: What we know. IUBMB Life, 55(4-5), 213-217.
Cortopassi, G. A., Shibata, D., Soong, N. W., & Arnheim, N. (1992). A pattern of accumulation of a
somatic deletion of mitochondrial DNA in aging human tissues. Proc. Natl. Acad. Sci. USA, 89, 73707374.
de Grey, A. D. N. J. (1997). A proposed refinement of the mitochondrial free radical theory of aging.
BioEssays, 19(2), 161-166.
Diez-Sanchez, C., Ruiz-Pesini, E., Lapena, A. C., Montoya, J., Perez-Martos, A., Enriquez, A., et al.
(2002). Mitochondrial DNA Content of Human Spermatozoa. Biology of Reproduction, 68, 180-185.
Gillespie, C. S., Proctor, C. J., Boys, R. J., Shanley, D. P., Wilkinson, D. J., & Kirkwood, T. B. L. (2004).
A mathematical model of ageing in yeast. J. of Theoretical Biology.
Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. of Physical Chemistry, 81, 2340-2361.
Gokey, N. G., Cao, Z., Pak, J. W., Lee, D., McKiernan, S. H., McKenzie, D., et al. (2004). Molecular
analyses of mtDNA deletion mutations in microdissected skeletal muscle fibers from aged rhesus monkeys. Aging Cell, 3(5), 319-326.
Gompertz, B. (1825). On the nature of the function expressive of the law of human mortality and on a new
mode of determining life contingencies. Philosophical Transactions of the Royal Society, 2, 513-585.
Harley, C. B. (1991). Telomere loss: Mitotic clock or genetic time bomb? Mutation Research, 256, 271282.
Harley, C. B., Vaziri, H., Counter, C. M., & Allsopp, R. C. (1992). The telomere hypothesis of cellular
aging. Experimental Gerontology, 27, 375-382.
Hattne, J., Fange, D., & Elf, J. (2005). Stochastic reaction-diffusion simulation with MesoRD. Bioinformatics, 21(12), 2923-2924.
Hattori, K., Tanaka, M., Sugiyama, S., Obayashi, T., Ito, T., Satake, T., et al. (1991). Age-dependent
increase in deleted mitochondrial DNA in the human heart: Possible contributory factor to presbycardia.
American Heart Journal, 121(6), 1735-1742.
326
Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J., & Kitano, H. (2002). The ERATO Systems
Biology Workbench: Enabling interaction and exchange between software tools for computational biology. Pac Symp Biocomput, 450-461.
Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., et al. (2003). The systems
biology markup language (SBML): A medium for representation and exchange of biochemical network
models. Bioinformatics, 19(4), 524-531.
Huemer, R. P., Lee, K. D., Reeves, A. E., & Bickert, C. (1971). Mitochondrial studies in senescent mice
- II. Specific activity, bouyant density, and turnover of mitochondrial DNA. Experimental Gerontology,
6, 327-334.
Jendrach, M., Pohl, S., Vth, M., Kowald, A., Hammerstein, P., & Bereiter-Hahn, J. (2005). Morphodynamic changes of mitochondria during ageing of human endothelial cells. Mechanisms of Ageing
and Development, 126, 813-821.
Karbowski, M., Arnoult, D., Chen, H., Chan, D. C., Smith, C. L., & Youle, R. J. (2004). Quantitation of
mitochondrial dynamics by photolabeling of individual organelles shows that mitochondrial fusion is
blocked during the Bax activation phase of apoptosis. J Cell Biol, 164(4), 493-499.
Khrapko, K., Bodyak, N., Thilly, W. G., van Orsouw, N. J., Zhang, X., Coller, H. A., et al. (1999). Cell
by cell scanning of whole mitochondrial genomes in aged human heart reveals a significant fraction of
myocytes with clonally expanded deletions. Nucleic Acids Research, 27(11), 2434-2441.
Kirkwood, T. B. L., & Holliday, R. (1986). Ageing as a consequence of natural selection. In A. H. Bittles
& K. J. Collins (Eds.), The biology of human ageing (pp. 1-15). Cambridge University Press.
Kirkwood, T. B. L., & Rose, M. R. (1991). Evolution of senescence: Late survival sacrificed for reproduction. Philosophical Transactions of the Royal Society, London B, 332, 15-24.
Korr, H., Kurz, C., Seidler, T. O., Sommer, D., & Schmitz, C. (1998). Mitochondrial DNA synthesis
studied autoradiographically in various cell types in vivo. Braz J Med Biol Res, 31(2), 289-298.
Kowald, A. (2002). Lifespan does not measure ageing. Biogerontology, 3, 187-190.
Kowald, A., Jendrach, M., Pohl, S., Bereiter-Hahn, J., & Hammerstein, P. (2005). On the relevance of
mitochondrial fusions for the accumulation of mitochondrial deletion mutants: A modelling study. Aging Cell, in press.
Kowald, A., & Kirkwood, T. B. L. (1996). A network theory of ageing: The interactions of defective
mitochondria, aberrant proteins, free radicals and scavengers in the ageing process. Mutation Research,
316, 209-236.
Kowald, A., & Kirkwood, T. B. L. (2000). Accumulation of defective mitochondria through delayed
degradation of damaged organelles and its possible role in the ageing of post-mitotic and dividing cells.
J. of Theoretical Biology, 202, 145-160.
Linnane, A. W., Baumer, A., Maxwell, R. J., Preston, H., Zhang, C., & Marzuki, S. (1990). Mitochondrial gene mutation: The ageing process and degenerative diseases. Biochemistry International, 22(6),
1067-1076.
327
Linnane, A. W., Marzuki, S., Ozawa, T., & Tanaka, M. (1989). Mitochondrial DNA mutations as an
important contributor to ageing and degenerative diseases. The Lancet, 333, 642-645.
Lundblad, V., & Szostak, J. W. (1989). A mutant with a defect in telomere elongation leads to senescence
in yeast. Cell, 57, 633-643.
Makeham, W. H. (1867). On the law of mortality. J. Inst. Actuaries, 13, 325-358.
Medvedev, Z. A. (1990). An attempt at a rational classification of theories of ageing. Biological Reviews,
65, 375-398.
Menzies, R. A., & Gold, P. H. (1971). The turnover of mitochondria in a variety of tissues of young adult
and aged rats. J. of Biological Chemistry, 246(8), 2425-2429.
Miller, F. J., Rosenfeldt, F. L., Zhang, C., Linnane, A. W., & Nagley, P. (2003). Precise determination of
mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR-based assay: Lack
of change of copy number with age. Nucleic Acids Res, 31(11), e61.
Miquel, J., Economos, A. C., Fleming, J., & Johnson, J. E. (1980). Mitochondrial role in cell ageing.
Experimental Gerontology, 15, 575-591.
Nunnari, J., Marshall, W. F., Straight, A., Murray, A., Sedat, J. W., & Walter, P. (1997). Mitochondrial
transmission during mating in Saccharomyces cerevisiae is determined by mitochondrial fusion and
fission and the intramitochondrial segregation of mitochondrial DNA. Molecular Biology of the Cell,
8, 1233-1242.
Oeppen, J., & Vaupel, J. W. (2002). Broken limits to life expectancy. Science, 296, 1029-1031.
Olovnikov, A. M. (1973). A theory of Marginotomy. J. of Theoretical Biology, 41, 181-190.
Ramsey, S., & Orrell, D. (2005). Dizzy: Stochastic simulations of large-scale genetic regulatory networks. J. Bioinf. Comp. Biol., 3(2), 1-21.
Randerath, K., Randerath, E., & Filburn, C. (1996). Genomic and mitochondrial DNA alterations in
aging. In L. E. Schneider & J. W. Rowe (Eds.), Handbook of The Biology of Aging (4th ed., pp. 198-214).
London: Academic Press.
Shanley, D. P., & Kirkwood, T. B. (2001). Evolution of the human menopause. Bioessays, 23(3), 282287.
Shenkar, R., Navidi, W., Tavare, S., Dang, M. H., Chomyn, A., Attardi, G., et al. (1996). The mutation
rate of the human mtDNA deletion mtDNA4977. American J. of Human Genetics, 59(4), 772-780.
Takai, D., Inoue, K., Goto, Y., Nonaka, I., & Hayashi, J. (1997). The interorganellar interaction between
distinct human mitochondria with deletion mutant mtDNA from a patient with mitochondrial disease
and with HeLa mtDNA. J. of Biological Chemistry, 272(9), 6028-6033.
Takai, D., Isobe, K., & Hayashi, J.-I. (1999). Transcomplementation between different types of respiration deficient mitochondria with different pathogenic mutant mitochondrial DNAs. J. of Biological
Chemistry, 274(16), 11199-11202.
328
Tapper, D. P., & Clayton, D. A. (1981). Mechanism of replication of human mitochondrial DNA. Localization of the 5 ends of nascent daughter strands. J Biol Chem, 256(10), 5109-5115.
von Zglinicki, T., Saretzki, G., Docke, W., & Lotze, C. (1995). Mild hyperoxia shortens telomeres and
inhibits proliferation of fibroblasts. A model for senescence. Experimental Cell Research, 220(1), 186193.
Watson, J. D. (1972). Origin of concatameric T4 DNA. Nature, 239, 197-201.
Yu, G.-L., Bradley, J. D., Attardi, L. D., & Blackburn, E. H. (1990). In vivo alteration of telomere sequences and senescence caused by mutated Tetrahymena telomerase RNAs. Nature, 344, 126-131.
K ey T erms
Aging: A biological phenomenon observed in most animals leading to increasing functional impairment and constantly rising mortality rate. Age related changes can be observed at intracellular, tissue
and organismic level. Many theories about the mechanism of the aging process exist, but the details are
currently still unresolved.
Disposable Soma Theory: Popular theory about the evolution of the aging process. Aging is explained as the result of an optimal resource allocation between reproduction and self-maintenance. That
means, aging itself has no selection advantage, but is a side product of another selected trait. Species
specific life-spans are readily explained by different environmental mortalities.
Dizzy: A stochastic simulation tool written in Java. Models can be defined in systems biology markup
language (SBML) or a proprietary language and simulated using various stochastic and deterministic
algorithms. The GUI and the core engine are separate modules so that Dizzy can also be used for batch
calculations on a computer cluster.
Life Expectancy: Time until 50% of a cohort of newborn individual have died. Also known as average
life-span, although technically it is the median life-span. The life expectancy for humans in industrialized countries is currently between 75 and 80 years, and for women 2-3 years higher than for man.
Mitochondria: Cellular organelles present in most eukaryotic cells that are important for calcium
homeostasis, apoptosis and energy production. Mitochondria are endosymbionts and probably derived
from purple bacteria. A remnant of this origin is the small circular mitochondrial DNA (mtDNA) that
is at the center of the mitochondrial theory of aging.
Stochastic Modeling: A modeling framework that takes care of microscopic random fluctuations
and the discreteness of molecules. Stochastic models explicitly calculate the change of the number of
molecules of the participating species during the time course of a chemical reaction. The first exact
stochastic simulation algorithms were developed by Gillespie (1977) and are now part of several modeling tools. Stochastic simulations are normally more time consuming than deterministic simulations
via differential equations.
329
Systems Biology Workbench: The Systems Biology Workbench (SBW) is a software systems that
enables different modeling programs to communicate with each other and provide or use specialized
analysis services. In this way SBW acts as broker for services like deterministic and stochastic simulation
engines, stability and bifurcation analysis, model optimization and graphical model building. Popular
tools that are SBW aware are among others JDesigner, CellDesigner and Dizzy.
330
331
Chapter XIX
The Sebaceous Gland:

A Model of Hormonal Aging
Evgenia Makrantonaki
Dessau Medical Center, Germany and
Christos C. Zouboulis
Dessau Medical Center, Germany and
abstract
This chapter introduces an in vitro model as a means of studying human hormonal aging. For this
purpose, human sebaceous gland cells were maintained under a hormone-substituted environment.
This environment consisted of growth factors and sex steroids in concentrations corresponding to those
circulating in young and postmenopausal women. The authors suggest that hormone decline, occurring
with age, may play a significant role not only in the maintenance of skin homeostasis but also in the
initiation of aging. Furthermore, skin, the largest organ of the body, offers an alternative approach to
understanding the molecular mechanisms underlining the aging process.
SKIN AGING AND H ORM ONES

Signs of aging become evident with time, whereas skin provides the first obvious evidence of this process. Since the collection of specimens from internal organs, such as brain, heart, vessels, bones and
endocrine glands throughout life for experimental research purposes is associated with major practical
and ethical obstacles in humans, interspecies research but also the use of human skin as a common
research tool offer promising alternative approaches.
Among multiple factors been involved in the process of skin aging the hormone environment plays a
distinct role (Makrantonaki & Zouboulis, 2007). Alterations in appearance due to declining skin quality
The Sebaceous Gland
are common complaints in postmenopausal women. The postmenopausal skin state is associated with
a rapid worsening of skin structure and functions, which can be at least partially repaired by hormone
replacement therapy (HRT) or local estrogen treatment (Brincat, 2000). Improvement of epidermal skin
moisture, elasticity and skin thickness (Fuchs, Solis, Tapawan, & Paranjpe, 2003), enhanced production of surface lipids (Sator, Schmidt, Sator, Huber, & Honigsmann, 2001), reduction of wrinkle depth,
restoration of collagen fibers (Schmidt, Binder, Demschik, Bieglmayer, & Reiner, 1996) and increase
of the collagen III/I ratio (Affinito et al., 1999) have been reported. Further potential benefits of longterm HRT are the prevention of osteoporosis and atherosclerotic cardiovascular diseases (Hulley et al.,
1998). There has been considerable interest in the possibility that HRT may also be protective against
the risk of developing neurodegenerative diseases e.g. Morbus Alzheimer. This remains controversial
and the benefit is at present unproven (Yaffe, Sawaya et al., 1988).
Conventional contraindications to HRT include a history of breast cancer or endometrial cancer,
recent undiagnosed genital bleeding, active, severe liver disease or a history of thromboembolism. In
addition, several current studies, which have shown that the unwished effects are more serious and
severe than the advantageous ones have entirely changed the strategy of HRT and limited it in certain
cases only (Rossouw et al., 2002; Solomon & Dluhy, 2003)
THE SEBACE OUS G LAND

Sebaceous glands or holocrine glands are skin appendages and are found over the entire surface of the
body except the palms, soles and dorsum of the feet. They are largest and most concentrated in the face
and scalp where they are the sites prone for acne. The normal function of sebaceous glands is to produce
and secrete sebum, a group of complex oils including triglycerides and fatty acid breakdown products,
wax esters, squalene, cholesterol esters and cholesterol (Downing et al., 1987; Nikkari, Schreibman, &
Ahrens, 1974; Ramasastry, Downing, Pochi, & Strauss, 1970; Thody & Shuster, 1989). The most accepted function of sebum is skin lubrication in order to protect it against friction and to make it more
impervious to moisture.
Furthermore, sebum lipids transport antioxidants in and on the skin and exhibit a natural light
protective activity. They exhibit an innate antibacterial activity and have a pro- and anti-inflammatory function. The sebaceous gland can regulate the activity of xenobiotics and is actively involved in
the wound healing process (Zouboulis, 2004). It possesses all enzymes required for the intracellular
androgen metabolism and confers upon the skin an independent endocrine function (Fritsch, Orfanos,
& Zouboulis, 2001).
With advancing age the size of sebaceous gland cells tends to decrease, while their number remains
approximately the same throughout life (Zouboulis & Boschnakow, 2001). Sebaceous gland cells show
an age-related reduced secretory output, which results in a decrease in the surface lipid levels and skin
xerosis (Engelke, Jensen, Ekanayake-Mudiyanselage, & Proksch, 1997; Pochi, Strauss, & Downing,
1979) - a major characteristic of aged skin. Hormone substitution with estrogens in vivo could significantly reverse skin xerosis indicating a hormone-dependent function of the sebaceous gland cells (Dunn,
Damesyn, Moore, Reuben, & Greendale, 1997).
Human SZ95 sebocytes are sebaceous gland cells derived from facial skin and transfected with the
SV-40 large T antigen and offer a unique model for investigations on the physiology of aging. They
constitute a better alternative to animal research and they functionally behave in a manner concomitant
332
The Sebaceous Gland
Figure 1. Illustration of a human sebaceous gland connected to a hair follicle on the skin surface (face)
of a 70-year old man. The sebaceous gland constitutes of sebaceous gland cells which are responsible
for the sebum production.
to nontransfected human sebocytes. SZ95 sebocytes show a similar epithelial morphology to normal
sebocytes and they can produce squalene and wax esters, as well as triglycerides and free fatty acids, even after 25-40 passages (Patents and patent applications: WO0046353, EP1151082, AU770518,
US2002034820, CA2360762, CN100366735C, JP2002535984, IL144683D, PL194865, HU0200048,
AT1151082, BE1151082, CH11151082, DE19903920, DK1151082, FR1151082, GB1151082, IE1151082,
IT1151082, KR689120) (Zouboulis, Seltmann, Neitzel, & Orfanos, 1999).
IN VITR O MODE L OF H UMAN H ORM ONA L AGING

Using models of animal aging, such as the nematode Caenorrhabditis elegans, the fly Drosophila melanogaster, and the mouse Mus musculus, the importance of hormonal signals on the aging phenotype
has been already documented. Suppression of the levels of hormones, such as insulin-like peptides,
growth hormone (GH) and sterols (Tatar, Bartke, & Antebi, 2003) or of their receptor expression has
been shown in animals to increase lifespan and delay age-dependent functional decline. Conboy et
al. (2005) showed that the age-related decline of progenitor cell activity of mice could be reversed by
333
The Sebaceous Gland
exposure to young serum and that the cells could retain much of their intrinsic proliferative potential
underlining the great importance of the systemic environment (Conboy et al., 2005).
Within the scope of the Explorative Project Genetic aetiology of human longevity supported by
the German National Genome Research Network 2 (NGFN-2) an in vitro model of human hormonal
aging has been developed. Human sebaceous gland cells (SZ95 sebocytes) were maintained under a
hormone-substituted environment consisting of growth hormone (GH), insulin-like growth factor I,
estrogens, androgens and progesterone in concentrations corresponding to those circulating in 20- and
in 60-year-old women (Makrantonaki et al., 2006). Upon 15,529 tested genes 899 genes showed a differential expression between SZ95 sebocytes under the 20- and 60-year-old hormone mixture, respectively.
This result demonstrates that hormones interact in a complex fashion, and changes in their circulating
blood levels may significantly alter the development of cells by regulating their transcriptome. Among
the 899 genes, genes were regulated, which are involved in stress response, chaperone activity, ubiquitine-proteosome activity, cholesterol biosynthesis and fatty acid metabolism, eicosanoid biosynthesis,
synthesis of extracellular matrix, nucleotides and ATP metabolism, and DNA repair mechanisms - biological processes which are hallmarks of aging.
The most significantly altered signaling pathway identified was that of transforming growth factor-
(TGF-). The TGF- signaling pathway is involved in different biological processes during embryonic
development and plays a distinct role in adult organisms in tissue homeostasis (Massague, 1998). In
human skin, the TGF- signaling pathway has been shown to regulate many cellular processes, such
as differentiation and proliferation of keratinocytes and fibroblasts and the synthesis of extracellular
matrix proteins (Massague, Freidenberg, Olefsky, & Czech, 1983). In addition, a disturbed function
of this cascade has been associated with tumorigenesis, i.e. in pancreatic, prostate, intestine, breast,
and uterine cancer (Levy & Hill, 2006). A differential expression of TGF- isoforms, activins, basal
membrane proteins, MADHs/SMADs, and other components of the TGF- signaling cascade was
shown at SZ95 sebocytes under the 60-year-old hormone mixture. These data suggest that age-specific
hormonal changes are likely to play a determining role not only in the healthy aging process, but also
in tumorigenesis.
Interestingly, genes expressed in signaling pathways operative in age-associated diseases such
as Huntingtons disease (Luthi-Carter et al., 2002; Sipione et al., 2002), dentatorubral-pallidoluysian
atrophy (Luthi-Carter et al., 2002), and amyotrophic lateral sclerosis (Jiang et al., 2005) were also
identified. According to these results, a disturbed hormone status may act a part into the generation of
neurodegenerative diseases.
C ONC LUSI ON
The fundamental aim of the aging research is the better understanding of the mechanisms involved and
the prevention of age-associated diseases by early identification of individual molecular risk profiles.
Recent data suggest that skin represents an adequate model for aging research and that change of hormone levels occurring with age plays a major role in the generation of aging. Thus, these results could
be a basis for an integrated and interdisciplinary approach to the analysis of aging.
334
The Sebaceous Gland
REFERENCES
Affinito, P., Palomba, S., Sorrentino, C., Di Carlo, C., Bifulco, G., Arienzo, M. P., et al. (1999). Effects
of postmenopausal hypoestrogenism on skin collagen. Maturitas, 33(3), 239-247.
Brincat, M. P. (2000). Hormone replacement therapy and the skin. Maturitas, 35(2), 107-117.
Conboy, I. M., Conboy, M. J., Wagers, A. J., Girma, E. R., Weissman, I. L., & Rando, T. A. (2005).
Rejuvenation of aged progenitor cells by exposure to a young systemic environment. Nature, 433(7027),
760-764.
Downing, D. T., Stewart, M. E., Wertz, P. W., Colton, S. W., Abraham, W., & Strauss, J. S. (1987). Skin
lipids: An update. J Invest Dermatol, 88(3 Suppl), 2s-6s.
Dunn, L. B., Damesyn, M., Moore, A. A., Reuben, D. B., & Greendale, G. A. (1997). Does estrogen
prevent skin aging? Results from the First National Health and Nutrition Examination Survey (NHANES
I). Arch Dermatol, 133(3), 339-342.
Engelke, M., Jensen, J. M., Ekanayake-Mudiyanselage, S., & Proksch, E. (1997). Effects of xerosis and
ageing on epidermal proliferation and differentiation. Br J Dermatol, 137(2), 219-225.
Fritsch, M., Orfanos, C. E., & Zouboulis, C. C. (2001). Sebocytes are the key regulators of androgen
homeostasis in human skin. J Invest Dermatol, 116(5), 793-800.
Fuchs, K. O., Solis, O., Tapawan, R., & Paranjpe, J. (2003). The effects of an estrogen and glycolic
acid cream on the facial skin of postmenopausal women: A randomized histologic study. Cutis, 71(6),
481-488.
Hulley, S., Grady, D., Bush, T., Furberg, C., Herrington, D., Riggs, B., et al. (1998). Randomized trial of
estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women.
Heart and Estrogen/progestin Replacement Study (HERS) Research Group. Jama, 280(7), 605-613.
Jiang, Y. M., Yamamoto, M., Kobayashi, Y., Yoshihara, T., Liang, Y., Terao, S., et al. (2005). Gene
expression profile of spinal motor neurons in sporadic amyotrophic lateral sclerosis. Ann Neurol, 57(2),
236-251.
Levy, L., & Hill, C. S. (2006). Alterations in components of the TGF-beta superfamily signaling pathways in human cancer. Cytokine Growth Factor Rev, 17(1-2), 41-58.
Luthi-Carter, R., Strand, A. D., Hanson, S. A., Kooperberg, C., Schilling, G., La Spada, A. R., et al.
(2002). Polyglutamine and transcription: Gene expression changes shared by DRPLA and Huntingtons
disease mouse models reveal context-independent effects. Hum Mol Genet, 11(17), 1927-1937.
Makrantonaki, E., Adjaye, J., Herwig, R., Brink, T. C., Groth, D., Hultschig, C., et al. (2006). Age-specific hormonal decline is accompanied by transcriptional changes in human sebocytes in vitro. Aging
Cell, 5(4), 331-344.
Makrantonaki, E., & Zouboulis, C. C. (2007). William J. Cunliffe Scientific Awards. Characteristics
and pathomechanisms of endogenously aged skin. Dermatology, 214(4), 352-360.
335
The Sebaceous Gland
Massague, J. (1998). TGF-beta signal transduction. Annu Rev Biochem, 67, 753-791.
Massague, J., Freidenberg, G. F., Olefsky, J. M., & Czech, M. P. (1983). Parallel decreases in the expression of receptors for insulin and insulin-like growth factor I in a mutant human fibroblast line. Diabetes,
32(6), 541-544.
Nikkari, T., Schreibman, P. H., & Ahrens, E. H., Jr. (1974). In vivo studies of sterol and squalene secretion by human skin. J Lipid Res, 15(6), 563-573.
Pochi, P. E., Strauss, J. S., & Downing, D. T. (1979). Age-related changes in sebaceous gland activity.
J Invest Dermatol, 73(1), 108-111.
Ramasastry, P., Downing, D. T., Pochi, P. E., & Strauss, J. S. (1970). Chemical composition of human
skin surface lipids from birth to puberty. J Invest Dermatol, 54(2), 139-144.
Rossouw, J. E., Anderson, G. L., Prentice, R. L., LaCroix, A. Z., Kooperberg, C., Stefanick, M. L., et
al. (2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal
results From the Womens Health Initiative randomized controlled trial. Jama, 288(3), 321-333.
Sator, P. G., Schmidt, J. B., Sator, M. O., Huber, J. C., & Honigsmann, H. (2001). The influence of hormone replacement therapy on skin ageing: A pilot study. Maturitas, 39(1), 43-55.
Schmidt, J. B., Binder, M., Demschik, G., Bieglmayer, C., & Reiner, A. (1996). Treatment of skin aging
with topical estrogens. Int J Dermatol, 35(9), 669-674.
Sipione, S., Rigamonti, D., Valenza, M., Zuccato, C., Conti, L., Pritchard, J., et al. (2002). Early transcriptional profiles in huntingtin-inducible striatal cells by microarray analyses. Hum Mol Genet, 11(17),
1953-1965.
Solomon, C. G., & Dluhy, R. G. (2003). Rethinking postmenopausal hormone therapy. N Engl J Med,
348(7), 579-580.
Tatar, M., Bartke, A., & Antebi, A. (2003). The endocrine regulation of aging by insulin-like signals.
Science, 299(5611), 1346-1351.
Thody, A. J., & Shuster, S. (1989). Control and function of sebaceous glands. Physiol Rev, 69(2), 383416.
Zouboulis, C. C. (2004). Acne and sebaceous gland function. Clin Dermatol, 22(5), 360-366.
Zouboulis, C. C., & Boschnakow, A. (2001). Chronological ageing and photoageing of the human sebaceous gland. Clin Exp Dermatol, 26(7), 600-607.
Zouboulis, C. C., Seltmann, H., Neitzel, H., & Orfanos, C. E. (1999). Establishment and characterization
of an immortalized human sebaceous gland cell line (SZ95). J Invest Dermatol, 113(6), 1011-1020.
336
The Sebaceous Gland
KEY TERMS
Aging: A complex process that defines the changes observed throughout the organisms lifespan
and cannot be defined by a single pathway or a single cause. Aging is controlled by both environmental
factors and the genetic constitution of the individual and has been described as a progressive decline of
the ability to withdraw stress, damage and disease. Furthermore, it is characterized by an increase of
degenerative and neoplastic disorders.
Endogenous Skin Aging: Endogenous aging, otherwise called intrinsic or chronological aging,
is influenced by genetics, hormonal changes and metabolic processes, which appear at advanced age.
Endogenous skin aging can be viewed on non UV-exposed areas of the body and can be considered as
model of the aging process taking place in internal organs.
Exogenous Skin Aging: Exogenous aging, otherwise called extrinsic aging, takes place in exposed
areas of the body (e.g. head, neck) which are constantly influenced by various environmental factors
including ionizing and non-ionizing irradiation, air pollution, natural deleterious gases (e.g. ozone and
high concentrations of oxygen), smoking, invasion of pathogenic bacteria, viruses, xenobiotics and
mechanical stress. Among them UV-irradiation is the most fundamental one, as it can damage skin
to such an extent, that makes it seem prematurely aged (photoaging). This premature aging process is
cumulative with sun exposure and affects more individuals of skin phototypes I and II.
Hormone: (from Greek - to set in motion) A chemical messenger from one cell (or group of
cells) to another. The function of hormones is to serve as a signal to the target cells. Endocrine hormone
molecules are secreted (released) directly into the bloodstream, while exocrine hormones (or ectohormones) are secreted directly into a duct, and from the duct they either flow into the bloodstream or they
flow from cell to cell by diffusion in a process known as paracrine signaling.
Skin: The largest organ of the body, exhibits multiple functions, among them it serves as a protective
barrier between internal organs and the environment and is a complex organ with multiple cell types and
structures. It is divided into three major compartments: epidermis, dermis and subcutaneous tissue.
Sebaceous Glands: Can usually be found in hair-covered areas where they are connected to hair
follicles to deposit sebum on the hairs, and bring it to the skin surface along the hair shaft. The structure
consisting of hair, hair follicle and sebaceous gland is known as pilosebaceous unit. They are largest
and most concentrated in the face and scalp where they are the sites prone for acne.
Sebum: (Latin, meaning fat or tallow) Is produced by sebaceous glands and its main function is
to protect and waterproof hair and skin, and keep them from becoming dry, brittle, and cracked. It can
also inhibit the growth of microorganisms on skin. In the sebaceous glands, sebum is produced within
specialized cells and is released as these cells burst; sebaceous glands are thus classified as holocrine
glands. In humans, the composition of sebum is as follows: 25% wax monoesters, 41% triglycerides,
16% free fatty acids and 12% squalene.
SZ95 Sebocytes: Are sebaceous gland cells derived from facial skin and transfected with the SV-40
large T antigen. They functionally behave in a manner concomitant to nontransfected human sebaceous
gland cells.
337
Section VI
Systems Biology Applications

in Medicine
339
Chapter XX
Systems Biology Applied to

Cancer Research
R. Seigneuric
GROW Research Institute, University of
Maastricht, The Netherlands
N.A.W. van Riel
Eindhoven University of Technology, The
Netherlands
M.H.W. Starmans
C.T.A. Evelo
University of Maastricht, The Netherlands
B.G. Wouters
P. Lambin
A. van Erk
University of Maastricht, The Netherlands
abstract
Complex diseases such as cancer have multiple origins and are therefore difficult to understand and
cure. Highly parallel technologies such as DNA microarrays are now available. These provide a data
deluge which needs to be mined for relevant information and integrated to existing knowledge at different scales. Systems Biology is a recent field which intends to overcome these challenges by combining
different disciplines and provide an analytical framework. Some of these challenges are discussed in
this chapter.
Systems Biology Applied to Cancer Research
Intr od ucti on
Systems Biology is emerging as a promising answer to the increasing need for analytical approaches
in molecular medicine. Its goal includes modeling interactions, understanding the behavior of a system from interplay of its components, inferring models from data, data integration, confronting the
prediction of the model to data, proposing most promising experiments. Solutions to these challenges
are often interdisciplinary, and Systems Biology intends to provide such a framework beyond scientific
communities dialects or differences in approaches (Lazebnik, 2004). Cancer is too complex a disease
to be solely and completely described by the existing clinical variables (e.g.: age of the patient, size
of the tumor, histological grade, etc.) which are currently used in practice. It is therefore necessary to
identify new biomarkers which will provide additional information about the cancer type, origin, or
aggressiveness for instance. Within a decade, high-throughput assays have revolutionized biology and
are now being introduced in the clinic. Among these techniques, we focus on DNA microarrays which
can monitor the expression of tens of thousands of genes in parallel and offer a means to individualize treatment. This should contribute to guide clinicians toward tailored therapies which will lead to
reduced over treatments and costs by an improved prognosis, the design of targeted drugs, as well as
more accurate application of drugs. Since this is quite a recent field where each analysis requires a large
number of steps, consensus has not yet been reached. Furthermore, researchers involved come from
various backgrounds (e.g.: Statistics, Engineering, Biology ). Applying tools from all these fields result
in a wide spectrum of approaches that may be confusing at first. Nevertheless there are some trends in
the biomedical research community that we review in this chapter in the context of cancer. The outline
of the chapter is meant to follow a practical analysis pipeline and URLs for accessing resources (i.e.:
softwares and data) are provided in Table 1.
W ha t is a Micr oarra y?
Quite suited to monitor many genes at once, a DNA microrarray is an inert, solid, flat and transparent
surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA reporters (often called probes)
of specified sequences are orderly tethered. Each reporter on the microrarray corresponds to a particular
short section of a gene. More and more, a single gene (e.g.: VEGF) is covered by several reporters which
span different parts of the gene sequence. Firstly available in the mid 1990s, companies are nowadays
developing micorarrays with increased feature density (i.e.: the number of molecular detectors per array)
to scan the genome at regular intervals (tiling arrays) that are re-usable for instance.
A Micr oarra y E xperiment

After a careful experiment design (Kerr & Churchill, 2001; Y. H. Yang & Speed, 2002) to start with,
biological samples are collected either from an in vitro or an in vivo experiment. Then, the RNA is
extracted and labelled (e.g.: with a fluorescent dye). The central reaction is when the labelled RNA is
hybridized (bound) to the microarray reporters. Unbound RNA is subsequently washed out so that
the amount of bound and labelled RNA can be measured. The intensity of the signal of the reporter is
indicative for the relative expression of the corresponding gene. DNA microarrays measure a surrogate
340
Table 1. Systems Biology approach to link microarray data to individualized treatment in molecular
oncology: websites referring to possible alternatives for the analysis steps presented in this chapter.
Microarray General
Y.F.Leung
http://ihome.cuhk.edu.hk/%7Eb400559/
Microarray bibliography
http://www.nslij-genetics.org/microarray/
Pubmed Entrez
http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed
MGED
http://www.mged.org/
MIAME
http://www.mged.org/Workgroups/MIAME/miame.html
ArrayExpress
http://www.ebi.ac.uk/arrayexpress/
GEO
http://www.ncbi.nlm.nih.gov/projects/geo/
SMD
http://genome-www5.stanford.edu/
Oncomine
http://www.oncomine.org/main/index.jsp
Rembrandt
http://tr.nci.nih.gov/rembrandt
Data sets repositories
Quality controls
Bioconductor
http://www.bioconductor.org/ (package affyQCReport)
dCHIP
http://biosun1.harvard.edu/complab/dchip/
GCOS
http://www.affymetrix.com/products/software/specific/gcos.affx
Genomatix
http://www.genomatix.de/
RMA
http://www.bioconductor.org/ (package affycomp)
RMAExpress
http://rmaexpress.bmbolstad.com/
GCRMA
http://www.bioconductor.org/ (package gcrma)
BRB Array Tools
http://linus.nci.nih.gov/BRB-ArrayTools.html
Cluster & Tree View
http://rana.lbl.gov/EisenSoftware.htm
SAM
http://www-stat.stanford.edu/~tibs/SAM/
BRB Array Tools
http://linus.nci.nih.gov/BRB-ArrayTools.html
GenePattern
http://www.broad.mit.edu/cancer/software/genepattern/
TM4
http://www.tm4.org/
Normalization
Unsupervised & supervised Analyses
Analysis Suite
Time series
STEM
http://www.cs.cmu.edu/~jernst/stem/
EDGE
http://www.biostat.washington.edu/software/jstorey/edge/
SAM
http://www-stat.stanford.edu/~tibs/SAM/
CAGED
http://genomethods.org/caged/
Annotation & Gene Ontology
NCBI
http://www.ncbi.nlm.nih.gov/
Ensembl
http://www.ensembl.org/index.html
continued on the following page
341
Table 1. (continued)
NetAffx
https://www.affymetrix.com/site/login/login.affx
Onto Express
http://vortex.cs.wayne.edu/projects.htm
David Ease
http://david.abcc.ncifcrf.gov/ease/ease.jsp
Gene Ontology
http://www.geneontology.org/
Gene Ontology Annotation
http://www.ebi.ac.uk/GOA/
KEGG
http://www.genome.jp/kegg/
BioPax
http://www.biopax.org/
GSEA
http://www.broad.mit.edu/gsea/
KEGG
http://www.genome.jp/kegg/
GenMAPP
www.genmapp.org/
PathVisio
http://www.pathvisio.org
WikiPathways
http://www.wikipathways.org
Ingenuity
http://www.ingenuity.com/
GeneGo
http://www.genego.com/
GeneXpress
http://genexpress.stanford.edu/
Genomatix
http://www.genomatix.de/
Jaspar
http://jaspar.genereg.net/
Maps & Pathways
Transcription Factors
Network Inference, Visualization & Integration

CARRIE
http://zlab.bu.edu/CarrieServer/html/
Cytoscape
http://www.cytoscape.org/
Patika
http://www.patika.org/
NIH Roadmap
http://nihroadmap.nih.gov/
SBML
http://sbml.org/index.psp
Physiome Project
https://www.bioeng.auckland.ac.nz/physiome/physiome_project.php
eCell
http://www.e-cell.org/software/e-cell-system
Virtual Cell
http://www.nrcam.uchc.edu/
BioModels Database
http://www.ebi.ac.uk/biomodels/
Systems Biology & Mathematical Modeling
of RNA abundance with either 1 or 2 channels. Two colour arrays (e.g.: Stanford microarrays) measure
by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5)
compared to its control (labelled with a green fluorescent dye, Cy3). The other type of platform only
has 1 channel (e.g.: Affymetrix GeneChip) and thus measures absolute expression levels. Whatever the
platform, the expression levels are usually converted into log scales. Distribution of the intensities is
then no more skewed to the high intensities treating up and down regulated genes symmetrically. The
variation is also less dependent on the absolute magnitude (the relative difference between high and
342
low expression levels is smaller in the log scale). Expression levels are normalized to a control level,
visualized in black. In this situation, overexpression is represented in red, repression in green.
T ypes of S t udies Using Micr oarra y E xperiments

The most common applications of microarrays are for class comparison, class prediction and class
discovery, although they are often combined in research papers. In the context of class comparison,
classes are defined independently of the gene expression levels. Classes can be defined by a pathologist looking at the shape of cells under the microscope, or when the labels of the samples are known
(e.g.: control versus treated design). The goal is then to determine if expression differs between theses
classes and find which genes are differentially expressed. In the case of class prediction, a classifier (or
predictor) is already available and the objective is to accurately predict the class of a new sample based
on the expression levels of the genes of the predictor. Class discovery aims at identifying expression
structures within a set of samples where classes are not predefined.
Even though microarrays are meant to monitor transcription, some authors have also quantified
changes in translation in cancer cell lines by polysome assays (Koritzinsky et al., 2005). The authors
showed dramatic differences between these 2 levels of regulation of protein expression. Indeed, under a
hypoxic environment a major repression at the transcriptional level was shown to not necessarily imply
a major repression of protein translation. On the contrary, some genes related to angiogenesis and the
unfolded protein response for example were highly translated. The regulation of translation is rather
complex and different means have been suggested as the role of functional internal ribosomal entry
sites, or micro RNAs (short RNA sequences - about 20 nucleotides long).
R ep osit ories of Micr oarra y S t udies

Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are GEO, the Gene Expression Omnibus, of the National Center for
Biotechnology Information (NCBI), its European homolog ArrayExpress hosted at the EMBL-EBI in
the UK, and the Stanford Microarray Data Base (SMD). Databases dedicated to cancer have also been
created such as Oncomine and the REpository of Molecular BRAin Neoplasia DaTa (REMBRANDT).
They can be searched for data sets or even to test hypotheses. In Rembrandt for instance, it is possible to
compute survival curves on line. Good practices or MIAME (Minimum Information About a Microarray
Experiment) are being developed by the MGED (Microarray Gene Expression Data) workgroup. This
group of biologists, computer scientists and data analysts aims to set standards to ensure an unambiguous interpretation and the reproducibility of the results.
Initia l q uesti on, hyp othesis
and the fishing
experiment
The starting point usually stems from a peculiar (or even paradoxical) finding that needs to be further
investigated. The aim of the study can be to provide a better description of a given system (in a new
context and/or with a new combination of assays for instance) or to test a hypothesis. The former ap-
343
proach is known as the fishing experiment. It consists in running an experiment, then analyzing the
data to find out inherent structure embedded into the data.
D ata G eneration
Design
This is one of the key steps that need to be carefully addressed (Churchill, 2002). One of the few points
of consensus is that a sound experiment requires positive and negative controls as well as biological
replicates. Any experiment is sensitive to 3 sources of variation: biological variation, technical variation
and measurement error so that every measurement contains actually a convolution of the 3. Replicates
can either be technical or biological (Churchill, 2002; Quackenbush, 2002). The former monitors a
biological sample from the same source (e.g.: two microarrays -ideally identical but only similar in practice- are used to assess the expression level of genes from the same patient biopsy. Because the origin
of the samples is meant to be the same, the variability in the results is expected to be mostly explained
by technical differences (e.g.: microarrays from a different batch, or differences in the procedure or
handling). Biological replicates mean repeating a certain number of times a complete assay. In this case,
for the same patient, a biopsy would be taken and hybridized on a microarray. This procedure would
be repeated twice for example. The variability is then a combination of technical and biological origins
(Churchill, 2002).
Different designs are possible depending on the goals of the experiment (control versus treated, time
series, etc.). Clinical studies are essentially static in the sense that the data is collected from patients
without the possibility to assign retrospectively a precise starting point before the first diagnostic (observational studies). In the contrary, in a controlled environment (e.g.: in vitro culture of cancer cell
lines) times series are quite suited to assess the behavior of complex systems after an experimental
perturbation (Bar-Joseph, 2004).
Two essential issues in experiment design are bias and chance, the former being the hardest to address (Ransohoff, 2004). It requires a detailed description of the procedure to be detected. Unfortunately,
when a study suffers from bias, hardly anything can be done to compensate for it so that it is often said
that bias times 12 is still bias.
Before running the full experiment, one might consider a pilot study. Samples of interest can be
selected to provide the expected highest information gain (Tibshirani, 2006).
Protocol & Quality Controls of Samples

The design should result in a detailed written protocol carefully followed by the investigators. Regardless of whether the RNA comes from an in vitro or in vivo studies, the primary importance is to yield
samples of excellent quality in order to avoid the situation of garbage in, garbage out. At this stage,
quality controls are essentially attained with the Nanodrop and the Bioanalyzer (Agilent Technologies,
Santa Clara, CA) assays. The former assesses material quantity whereas the later measures the integrity,
purity and quality of the sample.
344
Microarrays & Quality Controls

DNA microarrays can be divided in 2 groups depending on their number of channels. The cDNA microarrays have 2 channels (red and green), but we will focus on the Affymetrix microarray (1 channel)
which is currently one of the most widely used commercial platform. A set of quality controls has been
developed in the open software R. The corresponding packages can be downloaded from the Bioconductor website and run from the R environment or Excel with BRB Array Tools for instance (see Table
1 for details).
D ata Preprocessing
Filtering for Artefacts, Outliers and Controls
After a careful examination of the microarray, this step intends to remove bad spots, subtract background, and deal with controls. Control reporters which are highly expressed or even saturated need
to be removed.
Normalization
Normalization is an attempt to correct for systematic errors in the data (e.g.: one channel of a twochannel array is much higher than the other). Thus it allows comparing the data in an even way for
proper analysis. A common normalization assumption is that the expression level of most genes does
not change (e.g.: Affymetrix platform). Although normalization is quite suitable, it can not compensate
for data of poor quality such as the garbage in, garbage out situation already mentioned or insufficient
number of samples.
How to combine this information detected by different reporters for the same gene is also a part of
the normalization step (Quackenbush, 2002), and an area of intense research. Different normalization
algorithms have been developed over the years. For the Affymetrix platform, the most common ones
are GCOS, RMA, GCRMA, and dChip. These algorithms combine the reporter intensities for a given
gene to define a measure of expression that represents the amount of the corresponding mRNA species.
They have a profound effect on the detection of differentially expressed genes. This was exemplified in
the context of lung cancer where the number of common genes found with 3 different algorithms was
quite small (P. Yang et al., 2004). Since no method is clearly established as more reliable than others, the
authors suggested combining the genes selected by more than one method. In contrast to these findings,
these authors found that once the data was pre-processed, the different analytical tools used to extract
genes differentially expressed (see below) gave very similar results (concordance ~ 90%).
Filters for Noise and/or Intensity

In order to decrease the risk of spurious findings, it is necessary to filter for noise and absence of variation. Different procedures have been used but no consensus exists yet. Regardless of the experimental
design and microarray platform, the idea is to discard reporters with low intensities (probably noisy
measurements) or which expression levels remain roughly constant in the conditions studied (e.g.: both
345
control and treated samples). Some platforms provide a measure of the reliability of detection (e.g.: the
present or absent call for Affymetrix microarrays).
C urse of D imensionality & D ata R eduction T echniques

A massive amount of data arises from DNA microarrays. Typically, the number of reporters on a microarray varies between 20,000 to approximately 60,000 while the number of samples is relatively small.
Currently, typical (static) in vitro cancer studies contain roughly 50 samples (less than 10 usually for
time series). The largest clinical studies span approximately 300 samples but the usual size is around
100 patients. It is thus a challenge (or the so-called curse of dimensionality) to select a set of genes of
interest when the number of potential candidates outreaches by far the number of samples.
Data reduction techniques can be helpful to downsize the data set to a more manageable size. In
the so called expression space, each experiment is represented by an axis (dimension). For each gene,
its expression level (coordinate) is reported in log scale units along the corresponding axis. In the case
where just 3 experiments would be run, the axes experiment1-experiment2-experiment3 would be similar to the x-y-z basis used in three-dimensional geometry. A distance can also be defined to measure
the closeness (similarity in expression) of 2 or more points (genes). Like in geometry, the Euclidean
metric is a common similarity distance but Pearson correlation, Manhattan, etc. similarity measures
can also be selected depending on the purpose of the analysis. Although it is not possible to represent
it, similarity measures between genes can be computed in the expression space extended to any number
of dimensions.
Principal Component Analysis (PCA) is based on the singular value decomposition and remains one
of the most common techniques to perform data reduction. The idea is to rotate the cloud of data points
in the expression space so as to identify the direction of greatest variation (1st principal component).
Iteratively, each next component gets the most of the remaining variance. Due to data representation
issues, only the 2 or 3 first principal components are computed. These first superaxes represent the
major part of the total variance in the dataset and may allow the detection of different clusters of data
points.
D ata A nalysis
Gene Expression Profiling
This step provides the list of genes of interest (also called gene expression-based signature or gene
signature) which are supposed to be differentially expressed between 2 or more conditions (the control
and the treated experiments for instance). The two main ways to tackle this are by an unsupervised or
by a supervised approach. Both methods rely on defining a similarity measure which is used to group
the genes with similar expression. In general, the choice of the similarity measure is more important
than the algorithm to compute it. The differentially expressed genes or gene signatures can be used to
identify specific subtypes within individual human tumors and predict their influence on patient treatment. Such gene signatures, metagenes (linear combinations of gene expression values extracted to
predict phenotype) or modules (sets of genes that act in concert to carry out a specific function) are
identified to represent the molecular behaviour of the response, phenotype or function under study.
Predictive gene signatures have also been identified from tumor samples (van de Vijver et al., 2002).
346
Recently, gene signatures have also been extracted from in vitro experiments with cell lines cultured
under defined conditions (Chang et al., 2004; Chi et al., 2006; Seigneuric et al., 2007); with subsequent
application to publicly available clinical data sets. Starting from in vitro studies allows not only to
record gene expression in very controlled and reproducible situations but also to reduce the chances of
overfitting. Overfitting occurs when an identified pattern is actually based on existing random features
within the data (a consequence of the curse of dimensionality and of a poor procedure to detect structures
embedded in the data). Therefore, the pattern can not be generalized to new, similar data sets.
Unsupervised Methods
Unsupervised clustering relies on an algorithm without any a priori knowledge about the number or the
type of classes it should find. In this respect, it is unbiased as it seeks to determine inherent structures
(clusters) embedded into the data set. Cluster analysis is extremely common to assess shared functions
and common regulation often referred to as guilt by association. It is often done by average-linkage with
a Euclidean distance to focus on the highest expression changes. Data can be clustered and visualized
with the popular software Cluster & TreeView (Eisen et al., 1998) for instance. Then, the user selects a
group or cluster of genes for further analysis.
Unsupervised approaches have been used to identify molecular portraits of tumors and improve
the classification of breast cancer (Perou et al., 2000). To improve breast cancer diagnosis for instance,
one would like to refine the pathologists classes (e.g.: histological grade) assessed with a microscope
by discovering new subclasses on the molecular level based on gene expression. A collection of samples
belonging to the same grade is interrogated by an unsupervised algorithm to detect groups of samples that
exhibit different expression patterns between the groups while maintaining similarity within the groups.
This approach has been applied to provide a means to reduce overtreatment (Wang et al., 2005).
Supervised Methods
This approach is based on identifying genes that fit an a priori determined feature (e.g.: pattern of
expression). It requires to first define classes based on a different criteria than expression (e.g.: time
point for a time series) and assign each sample to its class. The common pitfalls in the building of a
predictor for cancer outcome have also been recently addressed in detail (Dupuy & Simon, 2007).
A wide variety of algorithms (coming from the statistical and the machine learning communities) is
available: the t-test, ANalysis Of Variance (ANOVA), Linear Discriminant Analysis (LDA), Support
Vector Machine (SVM), etc. It is in fact necessary to restrict one to the simplest possible tools (e.g.: by
starting an analysis with LDA rather than with SVM) to reduce the risk of overfitting. This is known as
Occams razor after the Franciscan logician from the 14th century who stated that: entities should not
be multiplied beyond necessity. This law of parsimony requires keeping the simplest model which can
explain the data at hand. Since the number of samples in microarray data studies is always a limiting
factor, some strategies have been selected to get the most out a given data set. One such approach is the
Leave-One-Out-Cross-Validation (LOOCV). It consists in identifying genes differentially expressed
in all samples but one, and to use the predictor to classify the remaining sample. With n samples this
procedure is repeated n times. To decrease overfitting and provide more robust predictors, it is necessary to apply the LOOCV to the full procedure.
347
List of D ifferentially E xpressed G enes

Independent Validation
The list of differentially expressed genes needs to be validated by an independent assay, usually by Reverse Transcription-Polymerase Chain Reaction (RT-PCR). Although the magnitude of the change may
be different, it is important that its direction (up- or down-regulation) holds. Because the gene signatures
are often large (presently ranging from 11 genes up to more than 500 in oncology), one option may be
to select randomly a representative fraction of the (up- and down-regulated) genes to be tested.
Univariate Analysis: Kaplan-Meier Survival Curves

Kaplan-Meier curves are used to show the clinical relevance of the gene expression based signature
on patient data. Data sets containing gene expression from tumor specimens together with follow-up
and clinical variables (e.g., age, tumor size, grade) can be downloaded from the databases mentioned
above and listed in Table 1. They span many cancer types including leukemias and carcinomas. Since
clustering is the main method used to stratify patients in groups dependent on the expression of the
genes in the signature, Kaplan-Meier survival analysis is the most widely used method to evaluate a
gene signature. Log-rank tests are applied to test whether this results in a statistical difference in survival between the groups. Meta-analyses allow merging of small (homogeneous) data sets and therefore
increase statistical power.
Multivariate Analysis
In the context of molecular oncology, the objective of the multivariate analysis is to quantify the prognostic power of a list of biomarkers (e.g., a gene signature) with respect to the existing multiple variables
as: age of the patient, tumor size, grade, mutation status of the gene coding for the essential protein p53
(called the guardian of the genome), etc. Additional methods are necessary to characterize and validate
gene signatures (e.g., area under the curve, sensitivity, specificity or the concordance index).
Gene Ontology
In order to get some insight into the biological theme of the list of genes, one often runs a gene ontology
query. Ontology is a data model that represents a set of concepts within a domain and the relationships
between those concepts (http://en.wikipedia.org/wiki/Ontology_(computer_science)). The current gene
ontology contains approximately 20,000 terms to assign the biological process, the cellular component
and the molecular function of genes. A wide variety of programs exists to find which of these are overrepresented in the list at hand (links are provided in Table1). Due to the a priori constraints imposed in a
supervised analysis, it is harder to define its biological theme relatively to an unsupervised approach.
Pathways Involved
An additional step in the analysis is to find the pathways in which the genes of interest are actually
involved. This relies on our current understanding and representation of these pathways. The KEGG
(Kyoto Encyclopedia of Genes and Genomes) initiative was launched in 1995 in order to link genomes
348
to biological systems and its implementation has been recently up-dated by the use of Markup Language
and API (Application Program Interface which allows customization). The pathway maps (for metabolic, genetic information processing, information processing and disease) are drawn manually and a
z-score quantifies those maps where the genes of interest are overrepresented . The Gene Microarray
Pathway Profiler (GenMAPP) software is designed for viewing and analyzing gene expression data in
the context of biological pathways. Microarray data can be superimposed on existing maps to visualize
up and down regulations. Pathways in GenMAPP can be edited and visualized with the PAthway VISualizatIOn (PathVisio) program which is also meant to display additional types of data (e.g.: proteomics
data). A collective effort now intends to develop pathways important to biological research communities
on the web (http://www.wikipathways.org/index.php/WikiPathways). Recently, the Biological Pathway
eXchange (BioPax) initiative was launched to create a data exchange format for biological pathways in
order to integrate knowledge from multiple sources in a coherent and reliable way. BioPax is based on
Web Ontology Language (OWL). This language is meant for ontology construction and deployment to
evolve from a human-readable graph to a machine-readable format for pathways analysis softwares.
Network Inference
The regulation of gene expression is achieved through genetic regulatory systems structured by networks of interactions between DNA, RNA, proteins and small molecules. Due to the massive number of
such components and interlocking positive and negative feedback, feedforward loops, formal methods
and computer tools are needed to unravel the relevant interactions and their topology. Depending on
the goal (fine vs. coarse-grain model) and the collected data, different approaches have been proposed
for inferring, modeling and simulating genetic, metabolic and signaling pathways: interaction graphs
(Radulescu et al., 2006), Bayesian networks, Boolean networks (and their generalizations), ordinary and
partial differential equations, qualitative (piecewise linear) differential equations, stochastic equations,
and rule-based formalism.
Bayesian Networks (BNs) and their temporal extension the Dynamic Bayesian Networks (DBNs)
have received a wide interest. Indeed, unlike most other approaches, BNs and DBNs can handle missing
data thanks to hidden variables. It allows circumventing the fact that most of the time, a study focuses
on one level (e.g.: gene or protein expression). If one is monitoring gene expression data only (e.g.: by
microarrays) no information is available on the protein level. In our perspective, DBNs are of special
interest because they can overcome a limitation of BN which are restricted to infer network without
feedback loops. DBNs do so by untangling their Directed Acyclic Graph (DAG) as a function of time,
allowing reconstructing feedback and feedforward loops essential in biology. Network can be reconstructed from this framework allowing the identification of highly connected nodes (hubs) and apply
other analyses using methods from graph theory. Performing experiments on such hubs (e.g.: RNAi,
overexpression) will provide optimized set-ups for the molecular biologist who will maximize the effect
of his experimental perturbation. Comparison between experiments and reconstructed network will
fine tune the model while allowing one to select the most interesting experiments.
Mathematical Modeling
Model is a common word which happens to be often misleading depending on the background of
scientists using it. A model is a set of hypotheses or concepts that allows a simplified description and
understanding of the system of interest.
349
For a molecular biologist for instance, a model will be a (mouse) model where cancerous cells are
injected to create tumor xenografts. The behaviour of cancer cells can thus be studied in a more controlled manner. But in the same field, a model can also be a molecular device which is used to change
the expression status of a gene of interest. In gene therapy, the property of certain types of bacterias
(like salmonella) to preferentially be located in hypoxic environments is exploited to target tumors. In
this context, a model may be a flip-in construct to switch ON or OFF a specific gene. In molecular biology, a model can also be a breast cancer cell line (e.g.: MCF7) which represents an (in vitro) model for
monitoring gene expression. In the field of data mining, this could be a (machine learning) model that
is a set of rules identified from a specific data set. With an engineering sciences background, a model
could be a (mathematical) model composed of equations containing parameters and variables. This is
the meaning of model within this paragraph.
There is an increasing need for developing mechanistic systems models (van Riel, 2006) in biological
sciences as outlined by the National Health Institute RoadMap (http://nihroadmap.nih.gov/). The Hyper
Text Markup Language (HTML) language provides tags to give meaning and (hierarchical) structure
to a web document. A more flexible version, eXensible Markup Language (XML), can be customized
to create and define such tags. This feature is widely used in the Systems Biology Markup Language
(SBML) language which aims at representing in a computer-readable format: pathways, biochemical
reactions and gene regulations. SBML also provides simulations tools for quantitative modelling as well
as a model repository (e.g.: curated cell cycle or MAPKKK cascade models).
Numerical attempts to mathematically model and simulate the human heart (Cardiome) and the whole
human body (Physiome) are based on this language. With ordinary differential equations, quantitative
detailed descriptions spanning large temporal and spatial scales are provided to better understand physiological conditions. Both projects are collaborative efforts of several laboratories world-wide which
exemplify some applications of Systems Biology.
Perspecti
ves
Any microarray study is composed of a succession of several steps which were discussed in this chapter.
Each one of this step (with many options) represents an active field of research by its own. Due to such a
situation (moving target) it is a rule rather than an exception that investigators get somewhat different
results. The requirement of the design, the raw and analyzed data on publishers website may greatly
help clarifying the results and thus our understanding of the molecular biology of cancers.
Our body is composed of 100,000,000,000,000 cells from about 250 different types, yet, it is working in harmony as a whole. It is made of organs, tissues, cells, genes, proteins, etc. which span many
temporal and physical scales, from less than nanoseconds to tens of years and from atoms to a whole
body. These different scales require various assays so that going from the wetlab to the bedside is a
challenging endeavor which is indeed by essence multidisciplinary.
Co nc lusi on
Systems Biology is a promising framework to improve our current understanding of the biology of
tumors for instance. By unraveling the function of key genes, it may lead to individualized treatment
350
instead of the one size fits all approach. In the future, this may help developing better drugs or identifying beforehand patients who will be most likely to develop drug resistance or who will most benefit
from a treatment.
R eferences
Bar-Joseph, Z. (2004). Analyzing time series gene expression data. Bioinformatics, 20(16), 24932503.
Chang, H. Y., Sneddon, J. B., Alizadeh, A. A., Sood, R., West, R. B., Montgomery, K., et al. (2004).
Gene expression signature of fibroblast serum response predicts human cancer progression: Similarities
between tumors and wounds. PLoS Biol, 2(2), E7.
Chi, J. T., Wang, Z., Nuyten, D. S., Rodriguez, E. H., Schaner, M. E., Salim, A., et al. (2006). Gene
expression programs in response to hypoxia: Cell type specificity and prognostic significance in human
cancers. PLoS Med, 3(3), e47.
Churchill, G. A. (2002). Fundamentals of experimental design for cdna microarrays. Nat Genet, 32
Suppl, 490-495.
Dupuy, A., & Simon, R. M. (2007). Critical review of published microarray studies for cancer outcome
and guidelines on statistical analysis and reporting. J Natl Cancer Inst, 99(2), 147-157.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, 95(25), 14863-14868.
Kerr, M. K., & Churchill, G. A. (2001). Statistical design and the analysis of gene expression microarray
data. Genet Res, 77(2), 123-128.
Koritzinsky, M., Seigneuric, R., Magagnin, M. G., van den Beucken, T., Lambin, P., & Wouters, B. G.
(2005). The hypoxic proteome is influenced by gene-specific changes in mrna translation. Radiother
Oncol, 76(2), 177-186.
Lazebnik, Y. (2004). Can a biologist fix a radio? -- or, what i learned while studying apoptosis, (cancer
cell. 2002 Sept. 2(3), 179-82). Biochemistry (Mosc), 69(12), 1403-1406.
Perou, C. M., Sorlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., et al. (2000). Molecular
portraits of human breast tumours. Nature, 406(6797), 747-752.
Quackenbush, J. (2002). Microarray data normalization and transformation. Nat Genet, 32 Suppl, 496501.
Radulescu, O., Lagarrigue, S., Siegel, A., Veber, P., & Le Borgne, M. (2006). Topology and static response of interaction networks in molecular biology. J R Soc Interface, 3(6), 185-196.
Ransohoff, D. F. (2004). Rules of evidence for cancer molecular-marker discovery and validation. Nat
Rev Cancer, 4(4), 309-314.
Seigneuric, R., Starmans, M. H., Fung, G., Krishnapuram, B., Nuyten, D. S., van Erk, A., et al. (2007).
Impact of supervised gene signatures of early hypoxia on patient survival. Radiother Oncol.
351
Tibshirani, R. (2006). A simple method for assessing sample sizes in microarray experiments. BMC
Bioinformatics, 7, 106.
van de Vijver, M. J., He, Y. D., vant Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., et al. (2002). A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med, 347(25), 1999-2009.
van Riel, N. A. (2006). Dynamic modelling and analysis of biochemical networks: Mechanism-based
models and model-based experiments. Brief Bioinform, 7(4), 364-374.
Wang, Y., Klijn, J. G., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., et al. (2005). Gene-expression
profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460),
671-679.
Yang, P., Sun, Z., Aubry, M. C., Kosari, F., Bamlet, W., Endo, C., et al. (2004). Study design considerations
in clinical outcome research of lung cancer using microarray analysis. Lung Cancer, 46(2), 215-226.
Yang, Y. H., & Speed, T. (2002). Design issues for cdna microarray experiments. Nat Rev Genet, 3(8),
579-588.
K ey T erms
Biomarker: By definition any bio(logical) marker like a gene or a protein. In molecular oncology,
due to the complexity of cancer diseases, highly parallel techniques are needed to identify a set of biomarkers rather than a unique biomarker. Sets of biomarkers are believed to be stronger predictors since
they would reflect more reliably the multidimensionality of cancer.
Cancer: A genetic disease emerging when cells have acquired at least six important factors contributing to pathogenesis. These hallmarks include evading apoptosis, self-sufficiency in growth signals,
insensitivity to anti-growth signal, limitless replicative potential, sustained angiogenesis, and tissue
invasion and metastasis.
Complex System: Such systems (e.g.: a cell, an organ, a whole human body) are complex because of
the large number of players involved and/or because of their time and context dependent interactions. The
nature of these interactions or regulatory motifs (e.g.: positive or negative feed-back loop, feed-forward
loop) increase the complexity of even a simple system with only a handfull of variables.
Microarray: DNA microarray is a technique to monitor the abundance of tens of thousands RNA
transcripts at once (by extension, gene expression). Molecular reporters corresponding to complementary
sequences of genes of interest are orderly deposited on a glass surface. This is today the most mature
of the highly parallel techniques.
Model: A commonly used but very misleading term which heavily depends on the background of
the investigator. In the broad sense, a model is used in any attempt to describe and explain a system of
interest which can not be directly observed. A set of hypotheses are required to represent a simplification of it (i.e: a model).
352
Network Inference: The attempt to discover the relationships between the components (or nodes
such as genes, proteins, metabolites) of the network. It is a form of the inverse problem where one
starts from the observations (gene expression levels from DNA microarrays for instance) and intends
to identify the causes that led to such observations. Due to the very large number of potential players,
this is a non-trivial problem which requires a massive amount of data.
Normalization: The correction for known systematic biases in the data to allow a fair comparison.
Omics: Due to the recent advent of highly parallel assays, it is now possible to monitor the behavior
of not just one or a couple of variables but rather tens of thousands of variables at once. A growing
number of disciplines with the -omics suffix like genomics, transcriptomics, metabolomics and so on,
intend to describe and understand completely a given level.
Systems Biology: A field which studies complex biological systems at different levels to decipher
the interactions of its key components and provide a mathematical model integrating this heterogeneous
information.
353
354
Chapter XXI
Systems Biology Strategies in

Studies of Energy Homeostasis
In Vivo
Matej Orei
VTT Technical Research Centre of Finland, Finland
Antonio Vidal-Puig
Institute of Metabolic Science, Addenbrookes Hospital, UK
abstract
In this chapter the authors report on their experience with the analysis and modeling of data obtained
from studies of animal models related to obesity and metabolic syndrome. The complex interactions of
genetic and environmental factors contributing to the failure of energy balance that lead to obesity, as
well as tight systemic regulation to maintain energy homeostasis, require application of the systems biology strategy at the physiological level. In vivo systems offer the possibility of investigating not only the
effects of specific genetic modifications or treatments in selected tissues and organs, but also to elucidate
compensatory allostatic mechanisms induced to maintain the homeostasis of the whole system. A key
challenge for systems biology is to characterize different systems responses in the context of activated
pathways. One possible strategy is based on reconstruction of tissue specific pathways using lipidomics,
or metabolomics in general, in combination with proteomic and transcriptomic profiles. This approach
was applied to obese mouse model and revealed activation of multiple liver pathways that may lead to
metabolic products, which may impair insulin sensitivity.
Intr od ucti on
The system controlling the energy balance is tigtly regulated. The failure of mechanisms controlling the
energy balance may lead to obesity. The causes of such failure may be genetic defects in the mechanisms
Systems Biology Strategies in Studies of Energy Homeostasis In Vivo
controlling food intake, energy expenditure, partition of nutrients towards specific organs, expandability of the adipose tissue, or genetically inherited traits leading to inactivity. The environmental factors
interacting with genetically determined traits are clearly involved, as the obesity is a rather recent problem of the last few decades. The biological redundancy adds another layer of complexity. Since energy
homeostasis is so central to survival, the system has evolved towards a tightly regulated redundant
system characterised by: (1) similar responses induced by different pathways and (2) compensatory
mechanisms systems aiming to restore the steady state of energy homeostasis.
Systems biology investigations aiming to address the complexity of obesity should therefore not
only consider identification of mechanisms that may lead to obesity, but also aim to identify the compensatory biological strategies that are found in vivo. Some of these compensatory mechanisms may
be targeted to increase the success rate of current strategies to lose weight. As obesity may lead to a
number of complications such as diabetes and cardiovascular disease, a systems biology approach may
be applied to identify early pathways that may lead to obesity-related complications, before they result
in clinically identifiable specific diseases.
Lipid omics
Lipids play an important role as structural components (e.g., cell membranes), energy storage components (triglycerides in adipose tissue), and as signalling molecules (Vance & Vance, 2004). For example,
changes in lipid function due to peroxidation, imbalanced fatty acid composition or their increased flux
to peripheral tissues may contribute to development of disorders such as atherosclerosis, diabetes, metabolic syndrome or Alzheimers disease (Watson, 2006; Wenk, 2005). Traditional clinical lipid measures
quantify total amounts of triglycerides, cholesterol, or lipoproteins. However, serum lipid profile is much
more complex at the molecular level. However, the modern lipidomics and metabolomics platforms
enable quantitative characterization of 100s of diverse lipid molecular species across multiple lipid
classes such as sphingolipids, phospholipids, sterol esters, and acylglycerols. In most cases, exact fatty
acid composition for each detected lipid can be determined.
Lipid metabolism is regulated both by genetic and environmental factors. For example, using a
unique monozygous twin study design in which young adult obese monozygous twins were compared
with their non-obese co-twins, we have recently shown that that obesity already in its early stages and
independent of genetic influences is associated with deleterious alterations in the lipid metabolism
known to facilitate atherogenesis, inflammation and insulin resistance (Pietilinen et al., 2007). The
study also demonstrated the sensitivity of the metabolomics platforms since subtle pathophysiological
changes were detected well prior to changes in commonly utilized clinical measures. Of special interest
and clinical relevance was the finding that the atherogenic lipid profile of the obese co-twins was associated with whole body insulin resistance, something that could not be detected using classical lipid
measures and inflammatory markers only.
IN vivo st udies
Lipidomics is increasingly utilized in functional characterization of genetic or environmental interventions in vivo. In vivo systems offer the possibility of investigating not only the effects of specific genetic
355
modifications or treatments in selected tissues and organs, but also to elucidate compensatory allostatic
mechanisms (McEwen & Wingfield, 2003) induced to maintain the homeostasis of the whole system.
For example, in the context of energy balance, we have observed marked differences in adipogenesis
and fat deposition between in vivo and in vitro models. In fact, whereas lack of proadipogenic transcription factor PPAR2 in vitro results in markedly impaired adipogenesis and fat deposition, its genetic
ablation in mice results in normal development of adipose tissue, thus suggesting robust compensatory
mechanisms operating in vivo to facilitate fat deposition in adipose tissue. Remarkably, despite its normal
adipose tissue appearance, lipidomic analysis revealed qualitative and quantitative differences in the
repertoire of lipids stored. We found that lack of PPAR2 resulted in accumulation of more immature
lipids characterized by decreased long chain triacylglycerols and accumulation of lipid precursors such
as specific phospholipid species (Medina-Gomez et al., 2005).
Lipotoxicity is rapidly emerging as an important concept in the pathogenesis of multiple diseases as
well as in aging (Slawik & Vidal-Puig, 2006; Summers, 2006; Summers & Nelson, 2005; Unger, 2002).
The lipotoxicity is attributed to products, i.e. toxic lipids such as ceramides, of excessive non--oxidative metabolism of fatty acids in peripheral tissues such as skeletal muscle, pancreas, and myocardium
(Unger, 2002). High levels of lipotoxic lipids in peripheral tissues are believed to be associated with
diabetes, insulin resistance, cardiovascular disease by disrupting cell function and by promoting programmed cell death (lipoapoptosis).
In order to investigate the effect of lipid overflow in adipose tissue, and its effect on peripheral
tissues, the PPAR2 KO was also studied in the obese mouse model background (ob/ob). Indeed, this
approach allowed the identification of potentially pathogenic ceramide lipid species in adipose tissue.
We are currently using this lipidomic approach to identify and characterise lipotoxic dysregulated
metabolic networks in relevant metabolic organs for diabetes (Medina-Gomez et al., 2007). The use
of this information may identify not only organ specific lipotoxic pathways suitable for pharmacological intervention in the context of the metabolic syndrome but also specific metabolic signatures with
prognostic value.
Lipid pathw ay rec onstr ucti on fr om iN vivo st udies of

lipot oxicity
One question that may arise from the studies of lipotoxicity in vivo is which specific pathways leading
to accumulation of reactive lipids in peripheral tissues are being activated. This immediately leads to a
problem, as lipidomics today studies characterizes lipids at the intant molecular level, while the common
pathway databases such as KEGG (Kanehisa, Goto, Kawashima, Okuno, & Hattori, 2004) contain information only at the generic lipid class level, although it is expected that with advent of lipid bioinformatics
(Fahy et al., 2005; L. Yetukuri, Ekroos, Vidal-Puig, & Oresic, 2008) the information on lipids available
in public databases will rapidly grow in the future. For example, cardiolipin contains four fatty acids.
Assuming there are approximately 40 most common naturally occuring fatty acids in human, there are
theoretically approximately 404=2,560,000 possible cardiolipin molecular entities. However, only one
entry for cardiolipin exists in KEGG. It is thus not difficult to imagine that reconstruction of complete
lipid metabolism at the molecular pathway level would lead to a combinatorial explosion.
As a possible solution to the combinatorial explosion problem, we recently proposed a method for
bridging the experimental knowledge with known lipid pathways in combination with omics data
356
(Yetukuri et al., 2007). Each compound entry is linked to the available information on lipid pathways
(e.g., based on KEGG) as well as contains information necessary for identification from the lipidomics
experiments. The pathway instantiation is performed for selected lipids based on identification of
Figure 1. From metabolomics data to instantiated pathways. Adapted from (Yetukuri et al., 2007).
G ro u p 1
G ro u p 2
L -S e rine
M etabolic profiles
M etabolic profiles
T ranscriptional and
other data
T ranscriptional and
other data
C er(d 1 8 :1/18:0 )
C oe x p ressi o n n etwo rks
P athw ay in s tan tiatio n,

e.g., f o r C er(d18:1 /18:0)
357
coregulated lipid species (Figure 1). Using our new methodology, we have recently reconstructed the
lipotoxic pathways in fatty livers of obese mice (Yetukuri et al., 2007). We found that two ceramide
synthesis pathways are upregulated in fatty livers: (1) de novo ceramide synthesis pathways due to
increased flux of fatty acids into the cells, (2) glucosylceramidase and galactosylceramidase pathways,
leading to release of ceramide from the membrane glycosphingolipids. Reassuringly, recent study showed
that pharmacological inhibition of glucosylceramide synthase enhances insulin sensitivity in the obese
mouse model (Aerts et al., 2007).
Co nc lusi on
Systems approaches to the studies of in vivo systems must address the regulation of biological pathways
in the physiological context. Characterization of pathways as either causal or compensatory requires
knowledge of changes at the systems level. Studies of lipid metabolism as related to regulation of energy
balance is one such example. As shown in this chapter, many practical bioinformatics and computational
challenges remain for in vivo systems biology due to the need to bridge multiple special and temporal
scales, which will require close cooperation between the theoretical and experimental work.
R eferences
Aerts, J. M., Ottenhoff, R., Powlson, A. S., Grefhorst, A., van Eijk, M., Dubbelhuis, P. F., et al. (2007).
Pharmacological inhibition of glucosylceramide synthase enhances insulin sensitivity. Diabetes, 56,
1341-1349.
Fahy, E., Subramaniam, S., Brown, H. A., Glass, C. K., Merrill, A. H., Jr., Murphy, R. C., et al. (2005).
A comprehensive classification system for lipids. J. Lipid Res., 46(5), 839-862.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., & Hattori, M. (2004). The KEGG resource for
deciphering the genome. Nucl. Acids Res., 32(Database Issue), D277-280.
McEwen, B. S., & Wingfield, J. C. (2003). The concept of allostasis in biology and biomedicine. Horm.
Behav., 43(1), 2-15.
Medina-Gomez, G., Gray, S., Yetukuri, L., Shimomura, K., Campbell, M., Curtis, K., et al. (2007).
PPAR gamma 2 prevents lipotoxicity by controlling adipose tissue expandability and peripheral lipid
metabolism. PLoS Genet., 3(4), e64.
Medina-Gomez, G., Virtue, S., Lelliott, C., Boiani, R., Campbell, M., Christodoulides, C., et al. (2005).
The link between nutritional status and insulin sensitivity is dependent on the adipocyte-specific Peroxisome Proliferator-Activated Receptor-{gamma}2 isoform. Diabetes, 54(6), 1706-1716.
Pietilinen, K. H., Sysi-Aho, M., Rissanen, A., Seppnen-Laakso, T., Yki-Jrvinen, H., Kaprio, J., et
al. (2007). Acquired obesity is associated with changes in the serum lipidomic profile independent of
genetic effects - A monozygotic twin study. PLoS ONE, 2(2), e218.
358
Slawik, M., & Vidal-Puig, A. J. (2006). Lipotoxicity, overnutrition and energy metabolism in aging.
Ageing Res. Rev., 5(2), 144-164.
Summers, S. A. (2006). Ceramides in insulin resistance and lipotoxicity. Prog. Lipid Res., 45(1), 4272.
Summers, S. A., & Nelson, D. H. (2005). A role for sphingolipids in producing the common features of
Type 2 Diabetes, Metabolic Syndrome X, and Cushings Syndrome. Diabetes, 54(3), 591-602.
Unger, R. (2002). Lipotoxic diseases. Annu Rev Med, 53, 319-336.
Vance, D. E., & Vance, J. E. (Eds.). (2004). Biochemistry of lipids, lipoproteins and membranes (4th
ed.). Amsterdam, The Netherlands: Elsevier B. V.
Watson, A. D. (2006). Thematic review series: systems biology approaches to metabolic and cardiovascular disorders. Lipidomics: A global approach to lipid analysis in biological systems. J. Lipid Res.,
47(10), 2101-2111.
Wenk, M. R. (2005). The emerging field of lipidomics. Nat. Rev. Drug Discov., 4, 594-610.
Yetukuri, L., Ekroos, K., Vidal-Puig, A., & Oresic, M. (2008). Informatics and computational strategies
for the study of lipids. Mol. BioSyst., 4(2), 121-127.
Yetukuri, L., Katajamaa, M., Medina-Gomez, G., Seppnen-Laakso, T., Puig, A. V., & Oresic, M. (2007).
Bioinformatics strategies for lipidomics analysis: characterization of obesity related hepatic steatosis.
BMC Syst. Biol., 1, e12.
K ey T erms
Adipose Tissue: Loose connective tissue composed of fat cells or adipocytes. Its main role is to
store energy in the form of fat, although it also cushions and insulates the body.
Allostasis: The process of achieving stability, or homeostasis, through physiological or behavioral
change. Allostasis is generally adaptive in short term, and can be carried out, e.g., by cytokines, autonomic nervous system, or metabolome.
Ceramide: Sphingolipid which can induce apoptosis and is a key mediator of lipotoxicity. It consists
of sphingosine linked to fatty acid via an amide bond. Structural and signaling molecule.
Lipids: A diverse class of biological molecules that play a central role as structural components of
biological membranes, energy reserves, and signaling molecules. They are broadly defined as hydrophobic
or amphipathic small molecules that may originate entirely or in part by carbanion based condensation
of thioesters, and/or by carbocation based condensation of isoprene units.
Metabolomics: Metabolomics is a discipline dedicated to the global study of small molecules (i.e.,
metabolites), their dynamics, composition, interactions, and responses to interventions or to changes
in their environment, in cells, tissues, and biofluids .
359
Lipidomics: Lipidomics as a subfield of metabolomics aims at characterization of lipid molecular

species and their biological roles with respect to the expression of proteins involved in lipid metabolism
and function including gene regulation.
Lipotoxicity: Accumulation of (lipo)toxic reactive lipids such as ceramides in non-adipose tissues
of metabolically important organs such as pancreatic -cells, skeletal muscle, liver, and heart.
Pathway Instantiation: Reconstruction of possible activated pathways, originating from a specific molecular entity (e.g., protein or metabolite). based on molecular interactions and enzymatic reactions.
360
361
Chapter XXII
Approaching Type 2 Diabetes

Mellitus by Systems Biology
Axel Rasche
Max-Planck-Institute for Molecular Genetics, Germany
abstract
We acquired new computational and experimental prospects to seek insight and cure for millions of
afflicted persons with an ancient malady. Type 2 diabetes mellitus (T2DM) is a complex disease with a
network of interactions among several tissues and a multifactorial pathogenesis. Research conducted in
human and multiple animal models has strongly focused on genetics so far. High-throughput experimentation technics like microarrays provide new tools at hand to amend current knowledge. By integrating
those results the aim is to develop a systems biology model assisting the diagnosis and treatment. Beside
experimentation techniques and platforms or rather general concepts for a new term in biology and
medicine this chapter joins the conceptions with a rather actual medical challenge. It outlines current
results and envisions a possible alley to the comprehension of T2DM.
Intr od ucti on
Type 2 Diabetes mellitus (T2DM) is the most common metabolic disease with more than 170 million
patients worldwide. It rapidly increases in the developed and developing countries and is a huge, growing burden for health care systems. In the USA T2DM already accounts for over 130 billion Dollar of
the health care costs (Stumvoll, Goldstein, & Haeften, 2005).
In the past, T2DM was rarely seen in young people and thus called age-onset diabetes. But its prevalence increases due to changes in the lifestyle. For babies born in 2000 an estimated chance of 33%50% to develop T2DM leads to 11 to 18 years reduced life expectancy. Several risk factors account for
this prevalence including genetics, nutrition, low physical activity and low birthweight. The genetical
prevalence is identified by looking at the offspring of diabetic patients, a positive family history confers for a 2.4 fold increased risk for T2DM. For first-degree relatives to an afflicted person the risk is
Approaching Type 2 Diabetes Mellitus by Systems Biology
increased by 15% to 25%. In twin studies the difference of the concordance rate between monozygotic
and dizygotic twins returns an estimate of the genetic contribution, as dizygotic twins only share 50%
of the genetic code. The concordance rate is 35% to 38% for monozygotic twins and 17% - 20 % for
dizygotic twins. Prevalence merges with the environment. Increased availability of food combined with
reduced physical activity lead to obesity which itself becomes the major influencing factor. In the USA
1991 12% of the population have been classified to be obese, increasing to 20.9% in 2001 and even 30%
today resulting in an anticipated epidemic increase of T2DM in the next decades. Physical inactivity is
a controllable factor, so 20 minutes exercise per day is enough for a noticeable improvement in treating
T2DM. Unhealthy diet with a surplus of fatty acids completes the problem.
Several molecular mechanisms are proposed to link obesity to T2DM. But the connection to the
pathophysiology still remains unknown. Less than 10% of the T2DM variants are monogenic disease
forms, on the other hand a high number of susceptibility genes is accounted for T2DM. Alterations in
an entire network of genes is thought to be responsible for the disease. Most of the costs do not derive
directly from diabetes but from its associated complications like macrovascular and microvascular diseases or accelerated atherogenesis. In an unclear pathogenesis, T2DM is preceded by impaired glucose
tolerance, where glucose needed for cell energy cannot penetrate the cell wall anymore. Followed by an
impairment in insulin action, increased adiposity drives a progression into insulin resistance.
Since available T2DM therapies are of limited effectiveness, new insight into the disease by biomedical research must be sought. The classic genetic approaches have been more successful in monogenic
diabetes like maturity-onset diabetes in the young or mitochondrial diabetes. The unknown hereditary
mode poses a challenge, so far resulting in a number of candidate genes. Transgenic and knock-out
mice are helpful in dissecting the transcriptional regulatory network. With the dawn of high-throughput
methods a novel way to tackle these challenges arises. Microarrays allow us:
1.
2.
3.
To dissect the diversity of the disease primarily on the transcriptomic level.

To identify transcription factor target sets using ChIP-on-Chip.
To search for single nucleotide polymorphisms (SNP) using genotyping arrays.
The complex pathophysiological interactions between the tissues fat, muscle, liver, pancreas and
brain are captured in distinct expression profiles for different mouse strains and different diets. On the
Figure 1. Different influences on the metabolism cause T2DM with its subsequent complications
362
level of proteomics one now knows the secretory proteins in adipose tissue (Chen, 2005), while massspectrometry experiments are pursued in several labs.
In this chapter, systems biology is defined as the development of mathematical models by the aggregation of experimental results. In the following we give an overview of the physiology of T2DM to
sketch the complexity of the disease. T2DM is under study in several animal models based on different
changes in the genomes. These genetic changes have been the most promising approach in the last years,
now complemented by high-throughput gene expression, proteomic or metabolic surveys. The results
are merged in the discipline of systems biology to generate and validate mathematical models.
T ype 2 D iabetes
Me llit us: Physi ology
In this subchapter the biologic topic, T2DM, shall be outlined. Central parts of the metabolism are
affected. There is no space to address T2DM exhaustively, but only to comprehend the complexity
of the affliction. The first paragraph distinguishes different types of diabetes, afterwards we look at
the physiological interplay to subsequently describe the pathogenesis, and we finally link this topic to
medicine with the diagnosis criteria.
D iabetes
The main characteristic of diabetes mellitus is an abnormally high level of glucose in blood (Dean &
McEntyre, 2004; Stumvoll et al., 2005). Normal people can mediate their blood glucose, whereas in
diabetics glucose levels remain high. Insulin mediates the blood glucose level, and in diabetes insulin
is not produced at all, insufficiently or not as effectively as needed. Most common forms are the autoimmune disorder type 1 diabetes (5% of the cases) and obesity associated type 2 diabetes (95%). Some
rare variants exist, for example by single gene mutations. T2DM generally occurs in obese adults. Many
underlying factors contribute to high blood glucose levels. An important factor is a resistance of the
body to insulin, ignoring its insulin secretions. Therefore T2DM is a combination of deficient secretion
and deficient insulin action. The rise of obesity in the population is the driving force behind the increase
of diabetes. Today it can be difficult to maintain healthy body weight in the presence of abundant food
and a sedentary life. Being overweight or obese is defined by looking at the Body Mass Index (BMI).
A BMI of 18-25 is healthy, 25-30 overweight and above that level obese.
Physiology
Here we follow the interaction of the key players for energy control in the human body. Three interactors are discussed and subsequently the involved tissues:

Glucose, an essential energy source for the body

Insulin, regulator of glucose and energy balance
Glucagon, opponent of insulin
363
Blood glucose levels are variable depending on the needs of metabolism, rising for three reasons:
diet, breakdown of glycogen or hepatic synthesis of glucose. Glucose is regulated by several hormones,
in particular by insulin. When glucose is abundant, insulin is released from pancreas and stimulates:

Muscle and fat to remove glucose from the blood

Cells to convert glucose in ATP
Liver/muscle tissue to store glucose as glycogen
Fat tissue to store glucose as fat
Cells to use glucose in protein synthesis
Glucagon opposes the action of insulin and rises in scarcity. Insulin activates formation of glycogen
and glucagon activates glycogen breakdown. Glucagon also helps the body to use alternative resources
such as fat and protein. Fasting results in a fall of blood sugar level, leading to lower insulin and higher
glucagon. Glucagon raises blood glucose by calling of glycogen from the liver short term reserve and
glucose production by converting amino acids in the liver. Glucagon level is stimulated by several causes,
like protein-rich food or stress. By fasting for a couple of days, the liver is exhausted by glycogen but
continues to make glucose from amino acids and glycerol in fat.
When glycogen stores are full, lipogenesis converts glucose into fat. Insulin supports lipogenesis by
increasing transporters, mediated release of fatty acids and inhibition of fat breakdown, lowering fatty
acid in blood. Insulin also stimulates entry of amino acids into cells and protein production.
Four tissues are affected by this interplay:

Liver: Buffers and produces glucose

Fat: Stores energy
Skeletal muscle: Stores energy as glycogen
Pancreas: Insulin is produced, stored and released from the pancreatic islets.
One of the most important organs in this interplay is the liver. It produces and consumes glucose and
buffers glucose levels. From digestion the liver receives glucose-rich blood and removes large amounts
of glucose to mediate the blood glucose level. Rise of blood glucose detect the pancreatic beta cells and
respond with the release of insulin. This rise of glucose also lowers the release of glucagon and thus the
production of glucose from other sources.
Figure 2. Insulin as main regulator
364
In every tissue inside the cell, glycolysis uses some of the glucose. Glycolysis is a central pathway
of carbohydrate metabolism which occurs in all body cells and releases energy and carbohydrate intermediates for use in metabolism.
Pathogenesis
In the pathogenesis of T2DM, insulin is no more able to stimulate the glucose usage in fat and muscle
and to inhibit the glucose production adequately. Impaired insulin sensitivity and glucose intolerance
are early phenomena, leading to hyperglycemia, hylerlipidemia and, eventually, to a failure of pancreatic
cells to produce and secrete a sufficient amount of insulin. However, most genes and their associated
molecular network contributing to the onset and course of the disease are yet unknown. An understanding of the interplay between obesity and insulin resistance is crucial but not completely resolved (Kahn
& Flier, 2000; Stumvoll et al., 2005). Obesity and physical inactivity is strongly associated with insulin
resistance. Several mechanisms mediating this relationship are identified. Circulating hormones, cytokines and metabolic fuels from the adipocytes influence insulin action. Sometimes lipids are related to
skeletal muscle insulin resistance, highlighting excessive fat storage in non-adipose cells.
D iagnosis
Diabetes mellitus is diagnosed on the basis of WHO recommendations from 1999, which includes
both fasting and 2h after 75g glucose load criteria into a practicable diagnostic classification (Table 1).
impaired fasting glucose and impaired glucose tolerance are conditions predisposing overt diabetes
mellitus. If not treated a substantial part of people with these problems will progress to overt diabetes
(Stumvoll et al., 2005).
S pecies
For the means of molecular biology different tissues of an organism are analysed. Due to ethical restrictions, human tissue is rarely available. To bypass this problem animal models are used in research. In
the following some of the most important strains are presented.
As mentioned it is difficult to get human tissue for analysis demanding the cooperation of many
medical doctors for statistically meaningful samples. In addition, nutrition and lifestyle are not under
Table 1. Diagnostic criteria of diabetes mellitus and other categories of hyperglycaemia

(mmol / L)
Fasting
2h after 75g
glucose load
Diabetes mellitus
7.0
or
11.1
Impaired glucose
tolerance (IGT)
< 7.0
and
7.8 11
Impaired fasting glucose
6.1 6.9
and
< 7.8
Normoglycemia
< 6.1
and
< 7.8
365
control in contrast with lab mice. On the other hand the heritable predisposition for the disease is observable in family studies, twin studies and ethnic or population differences. In the chapter about the
genetics I describe approaches to track involved genes.
With animal models more finegrained study designs are possible by controlled conditions with regard
to nutrition and lifestyle but also genetics. Time series for different disease states and back crossing
experiments for genetical insight are further features. T2DM affects the basic metabolic process and
therefore is traceable in organisms from mice and rats down to Caenorhabditis Elegans, where we find
the most relevant pathways as aging pathways. Although rat models for T2DM exist like the Zucker
Diabetic Fatty rat (ZDF rat) or Zucker Fatty rat, most of the available high-throughput screens and genome-wide genetical studies are performed in mouse, now pointed out in more detail.
A complete overview about the mouse models used in T2DM research is provided by (Clee & Attie,
2007) highlighting the history of the mouse strains and their susceptibility to impaired glucose tolerance or T2DM. For example the C57BL/6, which accounts for 14% of the experiments, shows diabetessusceptible and diabetes-resistant aspects. With a so-called ob mutation in the Leptin gene, the same
mouse strain is severely diabetic. Likewise the BTBR strain shows strong diabetes-susceptibility. The
New Zealand Obese (NZO) mouse is a polygenic mouse model reflecting the whole metabolic syndrom,
surrounding obesity and T2DM. All of the described mice models are available from the Jackson Labs
including the focussed T2DM models (Jackson Labs, 2006). Insight into the metabolism and insulin
resistance drawn from mouse models is described in (Nandi, Kitamura, Kahn, & Accili, 2004). The
authors break down the plurality of knock-out and transgenic mice by phenotypes and tissue to find
unsuspected players, e.g. transcription factors, which emerge from the underlying studies.
In summary, major findings can arise from a variety of organisms to understand human metabolism.
A caveat is, that the disease in the animal models may have causes different from the human setting. So
the results lack comparability and have to be reproduced in different models. To understand the missing
comparability we continue with a look at the methods for analyzing differences in genetic code.
G enetics
T2DM is clearly associated with a genetic prevalence which is demonstrated for example in family
studies. A positive family history is related to a 2.4 fold increased risk (Stumvoll et al., 2005). Twin
studies contributed distinguishing non-genetic from genetic factors. For impaired glucose tolerance
concordance rates were 88% in monozygotic twins.
The two main strategies are the genome-wide scanning and the candidate gene approach. In genomewide scanning different genomes from the same species are compared to each other to narrow down
disease related regions. In the candidate gene approach gene sequences of physiologically important
proteins are compared among population samples. Both approaches seek to isolate causative genetic
changes.
G enome-W ide S canning

For the genome-wide scanning in human association studies are performed on patient cohorts raised
over years using genotyping microarrays isolating chromosomal regions or, more advanced, single
nucleotid polymorphisms (Permutt, Wasson, & Cox, 2005; Sladek et al., 2007).
366
In a variant, called linkage approach, the entire genome of affected family members is compared
using genetic markers, to combine the alterations with the family genealogy over several generations
and affected sibling pairs to look for associations between parts of the genome and the risk of developing
diabetes. This locates genes by the rationale, that family members not only share the phenotype but also
chromosomal regions surrounding the involved gene. Positive associations are found in one or more studies, but the following positional cloning of causative genes has mostly been unsuccessful. To make the
genetic quest more complicated a possibility arose of a relation of T2DM to imprinted genes. Imprinted
genes are genes for which expression varies depending on the sex of the transmitting parent.
The genome-wide scan is also used in mice. Through backcrossing between susceptible and unsusceptible strains, arbitrary quantitative trait loci are isolated with a similar goal, with a much smaller
demand of individuals as in the human case (Permutt et al., 2005). In the animal model it is more easy to
follow and direct the family history. The chromosomal regions linked to the trait, here T2DM, are called
quantitative trait loci. Recently the genetic component has been directly linked to the expression level
using expression microarrays and genotyping arrays on backcrossed mouse strains (Lan et al., 2006).
Thus, it integrates two different information levels and results in narrow genomic candidate regions.
Genetic linkage and association to a phenotype, like T2DM, often has a poor replicability, which is
regularly attributed to a number of factors amongs others ethnic stratification, population-specific linkage
disequilibrium between markers and causal variants but also gene-by-gene and gene-by-environment
interactions. Because of the late onset of T2DM, susceptibility gene variants may exist in the control
group and reduce the power of genetic-linkage and association analyses.
C andidate G ene A pproach

The candidate gene approach examines specific genes on their role in T2DM. In unrelated individuals,
the statistical association of an allele and the phenotype T2DM or impaired glucose tolerance is tested.
Obvious biological candidates for insulin resistance have been examined (Dean & McEntyre, 2004;
Parikh & Groop, 2004). But the candidate gene approach had minor success in identifying causative
factors. In many candidate genes, like ABCC8 or GCGR, variants were extensively analysed, but mostly
the initial association could not be replicated in subsequent studies (OMIM, 2000).
The candidate gene approach is scientifically more simple focussing on disease status and alleles or
haplotypes in insulin signalling or glucose metabolism. (Dean & McEntyre, 2004; Parikh & Groop, 2004)
describe work and results performed on the most promising candidates. In the animal models genetical
manipulation is possible often leading into mouse models as described above (Clee & Attie, 2007).
S ummary
Due to the heterogeneity, genetic studies produced very diverse results. At this time their incorporation
into a systems biology model is a challenge, as most of the models do not account for changes in genetic
code but focus on the physiology. An exhaustive collection of entangled genome regions is provided by
OMIM under the identifier #125853 (OMIM, 2000).
Mutations of a single gene can result in disease. This happens in rare forms of diabetes. Such mutations can be investigated with sequencing to find responsible alterations in the DNA, so called single
nucleotide polymorphisms (SNP). T2DM is assumed to be polygenic. Disease genes may show subtle
but common differences in the gene sequence. It is difficult to link these common gene variations to an
367
increased risk of developing T2DM. Therefore it is a remarkable result, that microarray study results
converge on the same functional modules by deriving metabolic pathways from expression results (Toye
& Gauguier, 2003).
G ene E xpressi on in T 2DM

DNA microarrays present a powerful tool for studying the mechanisms of complex diseases and have
been used to dissect every aspect of T2DM (Sun, 2007). Behind physiologic and pathologic conditions the technology permits a more comprehensive understanding of the multiple genes involved in
the mechanisms of the disease. Microarrays allow the categorization of disease states according to the
changes in the mRNA expressed. This is used in parallel in several tissues and different cells at the
same time or in the same cell type at several time points. (Sun, 2007) outlines the current status of DNA
microarray applications in the field of obesity and T2DM. A number of high-quality microarray data
sets can be found at (mult., 2002).
Animal models and cultured cells analysed with microarrays are used in a variety of studies. In T2DM
they returned a tremendous amount of information about the pathophysiology. Studies in vivo and in vitro
profiling adipocytes from intra-abdominal and subcutaneous adipose tissue lead to coordinated depotspecific differences in expression of multiple genes in embryonic development and pattern specification.
In the case of the liver the nonalcoholic fatty liver disease is related to lipid metabolism, extracellular
matrix remodeling, liver regeneration, apoptosis and detoxification process. In rat skeletal muscle the
activation of the hexosamine biosynthesis pathway a nutrient-sensing mechanism decreased genes
involved in oxidative phosphorylation. Diet effects alter the expression of hundreds of genes primarily
related to lipid metabolism and transcription factors in adipocyte differentiation. Expression profiling in
the central nervous system is only possible in animals by finding the cytokine signalling box-containing
protein 4 expressed in brain areas linked to energy homeostasis.
In human analog studies are carried out, like in adipose tissue returning disperse results, e.g. linking
genes to lipid and glucose metabolism, membrane transport and promotion of the cell cycle. In liver, for
patients with nonalcoholic steatohepatitis gene groups related to (also) lipid metabolism and extracellular
matrix remodeling are differentially expressed. In skeletal muscle probably the most important finding is
the upregulation of the oxidative phosphorylation pathway in accord with the rat results (see above).
The current application of microarrays is accompanied by some problems and challenges. A set of
guidelines called Minimum Information about a Microarray Experiment (MIAME) is introduced by
the International Microarray Gene Expression Data Society. However there are still some issues to be
considered. The total number of probes on a microarray and the selection of the probe sequences from
the gene/transcript sequences differs between the chip manufacturers and therefore hinders comparability. Direct comparisons may focus on the overlapping part, in all other cases the differences have to be
accounted when interpreting the results. Biological parameters are to keep constant in the experiment
like age of the animals or patients, balanced gender ratio and sample collection at same time points in the
menstrual cycle. Data analysis is complex due to the large amount of genes and possible study designs.
A small number of samples is most often the biggest issue resulting in low power of the study.
Microarray experiments mostly result in a list of differentially expressed genes. A plurality of tools
has been developed to enhance the expression results, and only selected approaches can be mentioned.
(Jensen & Steinmetz, 2005) discuss the comparability of array results in terms of integrating data sets to
368
raise the power and confidence in the results. The number of available experiments suggests to perform
first meta-analyses identifying core processes underlying T2DM. A simple way to ascend from the gene
expression level to biological function in terms of metabolic or signalling pathways is enrichment. The
technique has originally been introduced in (Mootha et al., 2003) as gene set enrichment analysis. In
(Liu et al., 2007) an advanced approach, gene network enrichment analysis, is used to identify network
models for T2DM (see below). Not only sets of apparently dependent genes are taken into account but
rather subnets of the protein-protein-interaction network. The gene networks are compared with the
results of several microarray data sets from (mult., 2002).
Altogether the analysis of the transcriptome with microarrays is a valuable ressource of data for
systems biology. The mass of data may be frightening and data quality issues are regularly brought up.
Nevertheless the alternative, RT-PCR, returns only selective results and thus high-throughput methods
support the systems biology approach outlined in the next subchapter.
S ystems
B iology f or T 2DM
Systems biology comprises different aspects exemplified in this chapter. It is sustained by genetical
analyses and high-throughput methods on different species. The center is the aggregation of the different
evaluations to a comprehensive model (Klipp, Herwig, Kowald, Wierling, & Lehrach, 2005).
S ystems B iology S tarts with D ata Integration

On first-level heterogenous resources, like experimental data sets or databases are integrated to parse
data and query information. Several types of data sets are described above and can be set in relation
to gene or protein sequence databases, interaction data or pathway data for several organisms. In a
second level correlations across different sources lead to a more comprehensive and coherent picture.
For example the genetic hints by now do rarely correlate with results from expression analysis. On a
third level the newly gained information is casted into networks and pathways for the understanding
of T2DM on a cellular level.
S ystems B iology is Modeling

T2DM is a complex process not following any elementary principles, at least we are not aware any.
The outcome from the genotype and lifestyle cannot be foreseen by experience. The properties of an
appropriate model let us distinguish between system states, for example T2DM healthy, obese lean
or high-fat diet healthy diet. The variables, parameters and constants of the model clear up our possibilites to regulate and control the disease.
The substantial changes in T2DM are on the metabolic level. Perhaps a single model cannot serve all
purposes. Metabolism is a generic term for catabolic and anabolic reactions. These metabolic reactions
transform molecules from one type into molecules of a different type. Metabolic reactions are modeled
on three levels of abstraction:

Enzyme kinetics study the dynamic properties of the reactions.

Stoichiometric analysis describes the network character.
Metabolic control analysis (MCA) quantifies the effect of perturbations in the network.
369
Figure 3. Systems biology facilitates to attain new perceptions
Gene expression can be modelled in two ways. The expression of one to few genes can be described
with transcription, translation, including binding of transcription factors and RNA polymerases to DNA
and effect of activators or inhibitors and subsequently the different stages of maturation of mRNA. This
is not appropriate for microarray experiments with ten thousands of measurements. Reverse engineering
methods reconstruct the underlying regulator networks. This approach neglects the complex regulatory
machinery covering a larger fraction of genes in a cell and focussing on activators and inhibitors.
S ystems B iology Uses C omputers and Internet

Modeling tools help formulate theoretical hypotheses and extract information relevant to the hypotheses.
General purpose tools like R, Matlab or Mathematica have a steep learning curve. Focussed tools for the
special needs of biologists are easier to use like PyBioS (Wierling, Herwig, & Lehrach, 2007), where
the handling of the differential equations is covered behind a web interface. Parameters and kinetics
are controlled through this interface. Exchange of models meanwhile is standardized by the systems
biology markup language (SBML) also used for some of the models described below.
With the rapid increase of biological data the need to organise and structure the data increases and
so does the number of databases in the internet. No comprehensive database with the focus T2DM is
available. But different databases provide the information needed for the modeling. E.g. several microarray datasets are available at (mult., 2002) or the KEGG pathway database provides the topological
model of a T2DM signalling pathway.
S ystems B iology Models S erve D ifferent Purposes

A descriptive model elucidates causes and developments or explains the expression signatures that we
observe in the experiments. Already projects are under way to use expression signatures in the diagnosis
of certain diseases. After the refined diagnosis an optimized drug treatment uses the best possible medi-
370
cine administered in an adequate dose. The admission process for new treatments could be facilitated
if accompanying models show that it complies to the restrictions.
Mode ls of T 2DM
In the last subchapter we discuss mathematical modeling in the context of T2DM. This subject is not
new, but known as physiological modeling for years. New are the computer-assisted techniques and the
support by high-throughput data resulting in models of increased scale. The body glucose balance and
T2DM itself has been subject to modeling for several years. But despite the availability of these models,
we still miss key aspects explaining the pathogenesis or adequate drug targets. Here we first discuss the
topic of mathematical modeling and then present five examples of models in this research subject.
The goal of the Systems Biology approach is to create a comprehensive representation of all of the biochemical reactions in human and to understand the control of signalling pathways in complex networks.
Therefore mathematical models of descriptive, predictive or elucidative nature are used. Ultimatively a
quantitative, kinetic model reflects the outcome of the experiments. Such models are rare by now and
none exists for T2DM comprising all tissues and aspects of the disease. Mathematical models have the
advantage to be cheap and reduce the number of necessary animal experiments and, at the same time,
facilitate the conceptual clarification of the disease and observed biology, independent of the kind of
the model, a graph, network model or kinetic, differential equation model (Klipp et al., 2005). Special
challenges in T2DM are a suspected entanglement of the mitochondrion, the influence of nutrition and
genetic prevalence. No templates exist to incorporate such aspects into a model.
S elected Models
Here we present 5 examples of models for T2DM. (Kansal, 2004) or the introduction of (Jiang, Cox, &
Hancock, 2007) review several approaches to the topic and are recommended for further reading. The
models are intentionally chosen to present different modeling approaches and therefore do not have much
in common but the subject T2DM. Model I and V are differential equation models and return numerically exact results. Model II is a large scale model incorporating a maximum of components. Model III
and IV are interaction maps for discussion of known interactions in terms of control theory as well as
highlight previously unevaluated protein-protein-interactions for possible relationship to T2DM.
Model I
1979 the minimal model is presented in (Bergman, Ider, Bowden, & Cobelli, 1979). It models glucose
clearance for understanding the action of insulin. It comprises two coupled differential equations:
dG (t )
= {SG + X (t )} G (t )
dt
dX (t )
= p2 I (t ) p3 X (t )
dt
371
where G is the concentration of glucose in the plasma, I is the concentration of insulin in the plasma,
X is the concentration of insulin in remote compartment. SG is the glucose effectiveness and p2 and
p3 are model parameters. For a glucose injection the two equations describe the glucose decline and
insulin movement from blood to insulin-sensitive tissue. The first equation is the decline of glucose as
an insulin-independent process and an insulin-dependent process. The second equation stands for the
insulin removal from blood dependent on the available blood insulin as well as the insulin already in
the insulin-sensitive tissue. The model is successful as a clinical tool und meanwhile has been checked
and extended in various directions. It is shown here (Figure 4) for its simplicity.
Model II
For skeletal muscle the authors of (Pollard et al., 2005) present a large scale model with 157000 molecular components and more than 210000 relationships between those components. Building on several
microarray data sets, the authors take the challenge to model the entity of the skeletal muscle cells. For
an approach of this size reactions cannot be composed by hand but have to be built computationally.
On the other hand the intricate complexity of large scale schemes is reduced by restricting expression
changes to three categories: expression increased, unchanged or decreased in T2DM. Thus it is directly
set up as a qualitative, discrete model reducing parameter space and computation time. The authors apply
their model to the treatment of T2DM with different drugs or post-transplant diabetes mellitus.
Model III
An interaction map is presented in (Kitano et al., 2004) like in terms of control theory, clearly separating
the interactions between the tissues. The robustness of the body energy balance is discussed highlighting several feedback loops for unstable food supply. The interaction map elucidates cross talk between
tissues and pathways. This is the only model so far to divide and link the related tissues. In model IV
tissues are intermingled and models II and V are focussed on one tissue type each.
Figure 4. Modelling and simulation of the minimal model with PyBioS
372
Model IV
In (Liu et al., 2007) the authors identified an insulin signaling and a nuclear receptor network consistently
and differentially expressed in several occurences of insulin resistance.
Integrating different microarray data sets from (mult., 2002) and combining the results with the
protein-protein-interaction network in this model the authors determine the two subnetworks of the
protein-protein network. The method does not consider tissues or species but intermingles the data sets
and thus presents a robust network. The results for the two subnetworks suggest, that different members
are transcriptionally altered under different insulin-resistances. This may be a consequence of noisy
microarray data but is also consistent with the conception of T2DM as a combinatorial disease with
different gene sets independently causing T2DM under different circumstances.
Figure 5. The insulin signalling network of model IV combining protein-protein-interactions and gene
expression data
373
Model V
A kinetic model in (Jiang et al., 2007) targets glucose and insulin in pancreatic cells. The model consists
of 44 enzymatic reactions, 59 metabolic state variables and 272 parameters integrating subsystems such
as glycolysis, TCA cycle, respiratory chain and pyruvate cycle. This model is focussed on one tissue and
cell type but considers compartmentalization of the reactions in cytoplasm and mitochondrial matrix.
The analysis of the model elucidates restrictions and chances for the analysis of interspecies differences. Where human and rat have malate dehydrogenase in pancreatic islets, it lacks in mouse pancreatic
islets. One of the key findings of the presented model is no dramatic effect on ATP concentration or
oscillatory behavior with or without malate dehydrogenase and thus illustrates the robustness of the
model and ATP regulation.
Prospect
Modelling approaches in T2DM have to tackle enormous challenges like incorporating genetics, nutrition and mitochondria. For this challenges no predecessors exist in other diseases. With the advent of
high-throughput experiments and the methods of systems biology a new chance to resolve these issues
arose. Models exist but there are still several steps ahead. Previous knowledge is limited as for example
only some metabolic pathways are available as kinetic models, where most of the signalling pathways
only exist as interaction maps.
R eferences
Bergman, R. N., Ider, Y. Z., Bowden, C. R., & Cobelli, C. (1979). Quantitative estimation of insulin
sensitivity. Am J Physiol, 236(6), E667-677.
Chen, X. a. C., S.W. and Pannell, L.K. and Hess, S. (2005). Quantitative proteomic analysis of the secretory proteins from rat adipose cells using a 2D liquid chromatography-MS/MS approach. J. Proteome
Res., 4(2), 570-577.
Clee, S. M., & Attie, A. D. (2007). The genetic landscape of type 2 diabetes in mice. Endocr Rev, 28(1),
48-83.
Dean, L., & McEntyre, J. (2004). The genetic landscape of diabetes: NCBI.
Jackson Labs. (2006). Human disease and mouse model detail for NIDDM.
Jensen, L. J., & Steinmetz, L. M. (2005). Re-analysis of data and its integration. FEBS Letters, 579(8),
1802-1807.
Jiang, N., Cox, R. D., & Hancock, J. M. (2007). A kinetic core model of the glucose-stimulated insulin
secretion network of pancreatic beta cells. Mamm Genome, 18(6-7), 508-520.
Kahn, B. B., & Flier, J. S. (2000). Obesity and insulin resistance. J Clin Invest, 106(4), 473-481.
Kansal, A. R. (2004). Modeling approaches to type 2 diabetes. Diabetes Technol Ther, 6(1), 39-47.
374
Kitano, H., Oda, K., Kimura, T., Matsuoka, Y., Csete, M., Doyle, J., et al. (2004). Metabolic syndrome
and robustness tradeoffs. Diabetes, 53 Suppl 3, S6-S15.
Klipp, E., Herwig, R., Kowald, A., Wierling, C., & Lehrach, H. (2005). Systems biology in practice.
Wiley-VCH.
Lan, H., Chen, M., Flowers, J. B., Yandell, B. S., Stapleton, D. S., Mata, C. M., et al. (2006). Combined
expression trait correlations and expression quantitative trait locus mapping. PLoS Genet, 2(1), e6.
Liu, M., Liberzon, A., Kong, S. W., Lai, W. R., Park, P. J., Kohane, I. S., et al. (2007). Network-based
analysis of affected biological processes in type 2 diabetes models. PLoS Genet, 3(6), e96.
Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., et al. (2003).
PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated
in human diabetes. Nat Genet, 34(3), 267-273.
mult. (2002). Diabetes Genome Anatomy Project. from http://www.diabetesgenome.org/
Nandi, A., Kitamura, Y., Kahn, C. R., & Accili, D. (2004). Mouse models of insulin resistance. Physiol
Rev, 84(2), 623-647.
OMIM. (2000, 04.10.2005). Online Mendelian Inheritance in Man, OMIM (TM). from http://www.ncbi.
nlm.nih.gov/omim/
Parikh, H., & Groop, L. (2004). Candidate genes for type 2 diabetes. Rev Endocr Metab Disord, 5(2),
151-176.
Permutt, M. A., Wasson, J., & Cox, N. (2005). Genetic epidemiology of diabetes. J Clin Invest, 115(6),
1431-1439.
Pollard, J., Jr., Butte, A. J., Hoberman, S., Joshi, M., Levy, J., & Pappo, J. (2005). A computational model
to define the molecular causes of type 2 diabetes mellitus. Diabetes Technol Ther, 7(2), 323-336.
PyBioS (15.03.2008) from http://pybios.molgen.mpg.de/
Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445(7130), 881-885.
Stumvoll, M., Goldstein, B. J., & Haeften, T. W. v. (2005). Type 2 diabetes: Principles of pathogenesis
and therapy. The Lancet, 365, 1333-1346.
Sun, G. (2007). Application of DNA microarrays in the study of human obesity and type 2 diabetes.
Omics, 11(1), 25-40.
Toye, A., & Gauguier, D. (2003). Genetics and functional genomics of type 2 diabetes mellitus. Genome
Biol, 4(12), 241.
Wierling, C., Herwig, R., & Lehrach, H. (2007). Resources, standards and tools for systems biology.
Brief Funct Genomic Proteomic, 6(3), 240-251.
375
K ey T erms
Differential Expression: In a case study experiment comparing two expression states of the same
tissue or same cell type.
DNA Microarray (microarray, array, DNA chip): A microarray is a collection of microscopic DNA
spots, commonly representing sequence extracts of single genes, arrayed on a solid surface by covalent
attachment to a chemical matrix. DNA arrays are commonly used for gene expression experiments.
Impaired Glucose Tolerance (IGT): Impaired Glucose Tolerance is a pre-diabetic state of dysglycemia, that is associated with insulin resistance and increased risk of cardiovascular pathology. IGT
may precede type 2 diabetes mellitus by many years.
Kinetic Model: A mathematical model using differential equations or similar to allow quantitative,
continuous predictions for the behavior of a system.
Mathematical Model: A mathematical model is an abstract model that uses mathematical language
to describe the behavior of a system.
Protein-Protein Network: The union of undirected pairwise interactions between proteins. Interactions experiments may return large lists of protein-protein-interactions, but do not provide information
about the character of the interaction, e.g. transient binding or complex building.
Type 2 Diabetes Mellitus (T2DM): Type 2 Diabetes mellitus is a metabolic disorder that is primarily characterized by insulin resistance, relative insulin deficiency, and hyperglycemia.
376
377
Chapter XXIII
Systems Biology and Infectious

Diseases
Alia Benkahla
Institut Pasteur de Tunis, Tunisia
Lamia Guizani-Tabbane
Ines Abdeljaoued-Tej
ESSAI-UR Algorithmes et Structures, Tunisia
Slimane Ben Miled

Institut Pasteur de Tunis, Tunisia and
ENIT-LAMSIN, Tunisia
Koussay Dellagi
abstract
This chapter reports a variety of molecular biology informatics and mathematical methods that model the
cell response to pathogens. The authors first outline the main steps of the immune response, then list the
high throughput biotechnologies, generating a wealth of information on the infected cell and some of the
immune-related databases; and finally explain how to extract meaningful information from these sources.
The modelling aspect is divided into modelling molecular interaction and regulatory networks, through
dynamic Boolean and Bayesian models, and modelling biochemical networks and regulatory networks,
through Differential/Difference Equations. The interdisciplinary approach explains how to construct a
model that mimics the cells dynamics and can predict the evolution and the outcome of infection.
Intr od ucti on
Systems biology is an emerging interdisciplinary field in life science research which purpose is to study
the dynamic interactions and the network structure in cells or tissues or whole organisms. It is commonly
accepted that the design of research in this field, encompasses the following six steps (Ng, Bursteinas,
Gao, Mollison, & Zvelebil, 2006; Philippi & Kohler, 2006): (i) generation of suitable biological data
sampled from a population; (ii) enrichment of the collected data by publicly available data; (iii) curation
Systems Biology and Infectious Diseases
of this data; (iv) exhaustive integration of the curated data; (v) modelling of the mechanism(s) of interest;
and (vi) validation of the predicted model. Knowing the complexity of systems biology considered at
the tissue or organism level, we will focus in our review on systems biology at the cell level.
Systems biology research aims at understanding the mechanisms that shape the biological functions
of a cell. It is achieved through the integration of biological information about the cell of interest (Box
1), then the extraction of mathematical models that mimic the behaviour of this cell under different
conditions. These models should then guide biologists to design the appropriate experiments to be tested
on the bench, which hopefully will yield decisive results.
The biological function of the cell is determined by the dynamic interactions between different
components or molecules involved in a given pathway. Pathways are not isolated but highly intricate
and interconnected, and constitute a large, single, coherent but complex cellular network. The study of
the cellular process modulated in response to an external stimulus is a phenomenon that depends not
only on the number but also on the diversity and the dynamic of intracellular interactions.
The response of host cells to pathogens has been extensively investigated as a model of cellular response to external stimuli. Such model takes systems biology a step further as it integrates data that are
generated from two interacting/conflicting living organisms: the host and the pathogen. The specificity
of this system should be taken into account during the modelling process and advanced strategies that
capture this cohabitation and its consequences are required for modelling (Forst, 2006). Numerous
hosts and pathogens proteins are usually involved in these interactions and trigger various biochemical reactions regulated in cascade. The outcome of the infection will depend on this complex network
of cellular interactions.
Another level of complexity is introduced by the interactions/exchanges between the hosts cells
and the infectious agent (parasite, virus, bacteria). Especially as high throughput technologies, dedicated to the characterisation of these interactions/exchanges and their effects on the biological function
of the two organisms, are still missing. However the use of organisms oriented biotechnologies (i.e.:
transcriptome) allowed a coherent and understandable view of the studied system. Even though these
experiments fail to unveil the crosstalk between host and pathogen, they give separate insights into the
host or pathogen responses. Developing tools that could integrate and process this data is a major task
in infectious systems biology research.
Box 1. The modelling of the mechanisms that shape the biological functions of an infected cell is achieved
through the integration of information concerning the complexity of the system.
378
Pathogens are recognised through the hosts cells receptors. The diversity of pathogens and the large
array of cells they could infect add complexity to the system. One given pathogen may infect different
cell types which may express a different set of receptors, resulting in the activation of different signalling cascades and thus transcription of different sets of genes. In a given cell several pathways could be
activated by the pathogen depending on the specific receptor(s) they use to enter the cell and the duration of contact between the infectious agents and the cell. These activated pathways may be different
depending on the tissue origin and the activated state of the cell.
In this chapter, we will introduce the basics of the host immune response then present the high
throughput technologies used for collecting biological data, review the data mining techniques that could
integrate these data and how mathematical modelling will help to understand the underlying biological
mechanism. The limits of system biology approaches in the investigation of host pathogen interactions
will be discussed in the conclusion.
T he Immune R esponse
In mammalians, immunity against pathogens could be divided into innate and adaptive immunity.
The innate response developed by polymorphonuclears monocytes/macrophages, NK cells and dentritic cells (Figure 1) represent the bodys first line of defence that aims at rapidly destroying invading
micro-organisms or at least at containing their multiplication to a certain level. Several pathogens (ie:
Leishmania, Mycobacteria, Listeria), specifically target cells of the innate immune system to infect
them. In such cases, these cells play a dual role in infection: On the one hand, they offer a shelter and
Figure 1. Hematopoetic stem cells are multipotent cells from which all the cells of the immune system
develop. These cells give rise to all the blood cell types including myeloid (monocytes and macrophages,
neutrophils, basophils, eosinophils, erythrocytes, dendritic cells), and lymphoid lineages. Lymphocytes
are T cells (T for thymus were they mature), B cells (B for bone marrow where they originate and Natural
Killer (NK) cells.
379
a replication niche to the pathogen in which the latter can indefinitely persist; and on the other hand,
they trigger a primary immune response that could significantly impact the adaptive immune response
that will develop later (second line of defence). The issue of infection will result from the net balance
between these counteracting mechanisms and systems biology may give useful information on how the
micro-organism develop intracellularly and interact with the cell machinery.
Cells of the innate immunity express on their surface, receptors that specifically recognize microbial
products and that allow sensing the invading pathogens and their crude identification as bacteria, protozoa, fungi or viruses. These receptors participate not only to the engulfment and internalisation of the
Figure 2. Among the receptors involved in the recognition of pathogens the Toll-Like Receptors (TLR).
Different TLR have been identified. These receptors recognize different Pathogen-Associated Molecular Pattern (PAMPs) conserved motifs unique to micro-organisms and absent on host. Some of these
receptors are expressed on cell surface (TLR1, 2, 4,), other are found on endosomes (TLR3, 7, 8, 9).
TLR2 and TLR4 are the principal receptors involved in the recognition of various bacterial cell wall
components. TLR4 is crucial for effective responses to Gram-negative LPS. Delivery of LPS to TLR4
requires the accessory proteins LBP (LPS-binding protein; found in serum), CD14 and MD-2 (the latter two proteins can exist in soluble form, or bound to the membrane or TLR itself). TLRs 3, 7, and 8
appear to play important roles in responses to viruses. TLR3 responds to double-stranded viral RNA,
and TLRs 7 and 8 mediate responses to single-stranded RNAs (CHAUDHURI, Toll-like receptors and
chronic lung disease Clinical Science (2005) 109, (125133)).
380
microbe but also activate different biochemical pathways (Figure 2 and 3) and transcription factor that
modulate gene transcription and lead to the secretion of different proteins including cytokines, chemokines, and growth factors. This allows the establishment of an inflammatory response, the recruitment
of other cell types to the site of infection and the activation and migration of Antigen-Presenting Cells
(APCs) to lymph nodes to initiate the cellular or humoral adaptive immune responses (Figure 4). The
activation of the latter as a second line of defence will lead to various immune effectors mechanisms
associated to pathological processes and also to the reduction of the pathogen load and possibly cure
(specific T cell cytotoxicity, activation of macrophage killing activities). Finally, the differentiation of
T-cells into memory cells will lead to the development of a state of immunity against re-infection of
variable strength and duration. For some pathogens only a state of partial premunition against re-infection will develop (i.e.: Malaria) and persistent infection may be maintained long life despite immune
responses (i.e.: Herpes viruses, EB virus, Mycobacteria).
The outcome of infection depends to a large part on the initial steps of host-pathogen interaction that
take place after phagocytosis of the micro-organism by leukocytes. Several antimicrobial mechanisms
Figure 3. TLR signalling pathways. Toll-like receptors (TLRs) recruit adapter molecules within the
cytoplasm of cells in order to propagate a signal. Four adapter molecules are known to be involved
in signalling (MyD88, Tirap, Trif, Tram). MyD88-dependent pathway possessed by all the TLR family
members except for TLR3 is a common pathway to induce inflammatory cytokines. TLRs can also use
MyD88 independent pathway. Stimulation with TLRs induces not only pr-oinflammatory cytokine genes
but also type I interferon genes.
www.biken.osaka-u.ac.jp/act/act_akira_e.php
381
Figure 4. A variety of receptors (Rc) (Fc Rc, Mannose-fuccose Rc, Complement receptors (CR), scavenger Rc, Toll-Like Receptors (TLR),), through which infectious agents bind to the cells, are present
on the cell membranes of Antigen Presenting cell (APC). Some of these receptors are also involved in
internalisation of pathogens. Ingested invaders are next digested and pathogen-proteins are finally presented in the context of Major Histocompatibility Complex (MHC). In the lymph node, this presentation
together with the synthesis of IL-12 by APC, allow the stimulation of T helper cells. These cells in turn
produce IFN, active on macrophages.
Receptor for pathogen
TNF
APC
Tcell
are then activated including chemical attack of pathogens upon the fusion of phagosomes with lysosomes
and the generation, via the macrophage oxidative burst, of Reactive Oxygen Intermediates (ROI) such
as hydrogen peroxide (H2O2). Generation of Reactive Nitrogen Intermediates (RNI), another effector
function of macrophages, is induced only upon the activation of phagocytes. Two cytokines orchestrate
this activation: TNF produced by phagocytes and the IFN synthesized by the NK-cells and activated
T-cells. Activation of T-cells requires among others, the recognition by T cell receptors of specific antigenic peptides presented by monocytes and dendritic cells in association with products of the Major
Histo-Compatibility complex (MHC). IFN synthesis by NK cells and T-lymphocytes requires IL-12 a
cytokine produced also by phagocytes (Figure 4). In many cases, especially when dealing with virulent
pathogens, these defence mechanisms could be inadequate and fail to control infection. In the latter
situation, the infected cell orchestrates a series of biochemical events that lead to its programmed cell
death or apoptosis. This cell suicide has been recognised as an important component in the host defence
against pathogens tightly interconnected with the innate and adaptive immunity.
As pathogens co-evolve with their hosts, they frequently have developed strategies to circumvent the
defence mechanisms described above. These escape strategies tend to generate an inefficient response
and to turn the normal cellular pathways to the pathogen advantage. Thus, pathogens may inhibit receptor mediated immunity by either down-regulating cell surface receptor expression or blocking down
stream transducing signals. The intracellular survival of several pathogens may also depend on their
ability to alter phagolysosome biogenesis. Blocking antigen processing and up- and down-regulation
of various cytokines and/or chemokines, are examples of strategies induced by several pathogens in
382
Table 1. Survival strategies of different pathogens within macrophages

PATHOGENS
ALTERED FUNCTION
T. Gondii
Leishmania
M. tuberculosis
T. Cruzi
Traf.cking to phagolysosome
-
Inhibition of acidification
-
Delayed formation
-
Arrest of phagosome maturation
-
Escapes into the cytoplasm
T. gondii, Leishmania
Salmonella
T. gondii
Leishmania
M. tuberculosis
Leishmania
Ysersinia
pseudotuberculosis
Bacillus anthracis
T. gondii
Ysersinia
pseudotuberculosis
T. gondii
Leishmania
Anti-microbial mediators
-
Inhibition of NO production
-
Prevention of the trafficking of NADPH oxidase,
iNOS.
Antigen presentation
- Down-regulation of MHC molecules expression
Intracellular signalling
-
Disruption of IFN pathway
Decreased transcription of IFNRc
Impaired transcription of IFN responsive genes
Inhibition of IFN signalling pathway (JAK/STAT)
-
Reduction of PKC activity
-
Enhance intracellular Ca++
-
Down-regulation of PTK activity
-
Activation of phosphatases
-
MAPK pathway
Inhibition of MAPK activation
Proteolytical degradation of MAPKs.
Transcription factors
-
Inhibition of STAT1 translocation
-
NF-B pathway
Inhibition of NF-B translocation
Inhibition of IKK activation
Cytokines production
-
Inhibition of TNF synthesis
Delayed IL-12 production
-
-
Inhibition of IL-12 synthesis
REFERENCE
(Mordue, 1999)
(Desjardins, 1997)
(Fratti, 2001)
(Burleigh, 1995)
(Luder, 2003)
(Proudfoot, 1995)
(Vazquez-Torres, 2000)
(Luder, 1998)
(Meier, 2003)
(Ting, 1999)
(Nandan, 1995)
(McNeely, 1987)
(Eilam, 1985)
(Orth, 1999)
(Shapira, 2002)
(Butcher, 2002)
(Carrera et al., 1996)
order to avoid or delay the activation of immune cell responses (Table 1). Furthermore, apoptosis can
be either up- or down-regulated by the pathogen (or pathogen components) to facilitate its intracellular
survival or to contribute to its dissemination within the host.
A better understanding of the dynamic interactions between host cells and pathogens and the identification of the key elements that are at the crossroads of the biology of these two living organisms
should lead to the development of improved treatments, diagnostics and vaccines.
H igh thr oughp ut bi otechn ologies

Today, a large set of high-throughput biotechnologies that could characterize and quantify the co-evolution of host cells and pathogens at selected time points are routinely available. They allow landscapes
of the cell omics (Table 2): transcriptome (by DNA microrarrays), proteome (by protein microarrays,
383
Table 2. This table lists all omics biotechniques for systems biology
Proteomics
Transcriptomics
OMIC
Technology
Description
cDNA or
oligonucleotide
microarrays
cDNA or oligonucleotides are spotted onto the array. Each gene is represented by one or
more than one probe: the ensemble of probes mapping to different regions of the gene is
usually called probe set.
Serial Analysis of
Gene Expression
(SAGE)
SAGE is a technique used by molecular biologists to produce a snapshot of the messenger

RNA population in a sample of interest.
qPCR
Quantitative polymerase chain reaction (qPCR) is a modification of the polymerase chain

reaction used to rapidly measure the quantity of RNA present in a sample.
Two-Dimensional Gel
Electrophoresis
Gel electrophoresis research leverages software-based image analysis tools to analyze biomarkers by quantifying individual, as well as showing the separation between one or more
protein spots on a scanned image of a 2-DE product. Differential staining of gels with
fluorescent dyes (difference gel electrophoresis) can also be used to highlight differences in
the spot pattern.
Mass spectrometry
Mass spectrometry is an analytical technique that can accurately measure the molecular
weights of individual biomolecules, such as proteins and nucleic acids, and determine their
structures. Detection of compounds can be accomplished at very low concentrations and
in chemically complex mixtures. This can be achieved by labeling one sample with stable
isotopes which leads to a mass shift in the mass spectrum. This technique has both qualitative
and quantitative uses and is based on.
Yeast two-hybrid
analysis
This system utilizes a genetically engineered strain of yeast. This mutant yeast strain can be
made to incorporate foreign DNA in the form of plasmids. Separate bait and prey plasmids
are simultaneously introduced into the mutant yeast strain. A change in the cell phenotype
indicates successful interaction between proteins.
Protein microarrays
A protein microarray is a piece of glass on which different molecules of protein have been
affixed at separate locations in an ordered manner thus forming a microscopic array. These
are used to identify protein-protein interactions, to identify the substrates of protein kinases,
or to identify the targets of biologically active small molecules. The most common protein
microarray is the antibody microarray, where antibodies are spotted onto the protein chip and
are used as capture molecules to detect proteins from cell lysate solutions.
Immunoaffinity
chromatography
followed by mass
spectrometry
Usually the starting point is an undefined heterogeneous group of molecules in solution, such
as a cell lysate, growth medium or blood serum. The molecule of interest will have a well
known and defined property which can be exploited during the affinity purification process.
The process itself can be thought of as an entrapment, with the target molecule becoming
trapped on a solid or stationary phase or medium. The other molecules in solution will not
become trapped as they do not possess this property. The solid medium can then be removed
from the mixture, washed and the target molecule released from the entrapment in a process
known as elution.
ChIP-on-chip
ChIP-on-chip (also known as ChIP-chip) is a technique that combines chromatin

immunoprecipitation (ChIP) with microarray technology (chip). Like regular ChIP, ChIP-onchip is used to investigate interactions between proteins and DNA in vivo.
or phage display, or immunoaffinity chromatography followed by mass spectrometry), metabolome (by

mass spectrometry), glycome (carbohydrates in a cell), localizome (sub cellular localizations proteins),
interactome (protein-protein interactions) and fluxome (flux of metabolites through enzymatic reactions
within a cell over time).
Systems biology research aims at correctly modelling cell behaviour over time. To this end, experiments should be conducted at critical points of the infection process and at appropriate time intervals.
As data produced experimentally are measured at sparse time series, they cannot reflect the flow of
events in the infected cells. Hints about the survival strategies developed by different pathogens are
384
summarized in Table 1; others about the kinetic associated to the response to mycobacterium are summarised in the review of Hestvik and colleagues (Hestvik, Hmama, & Av-Gay, 2005).
High-throughput biotechnologies, such as transcriptome, proteome, metabolome, yeast-two-hybrid
or mass-spectrometry, are used to independently measure host and pathogen gene products profiling.
Global investigations of genes expression in infected human or mouse cells (macrophages, dendritic
cells, neutrophils, endothelial, leukocytes, or epithelial cells) or pathogens (Leishmania, Mycobacterium,
Plasmodium, Toxoplasma, Trypanosoma, Staphylococcus, Streptococcus, Vibrio cholerae or Yersinia
pestis) have been studied (Jansen & Yu, 2006; Jenner & Young, 2005; Maynard, Myhre, & Roy, 2007;
Waddell, Butcher, & Stoker, 2007). These studies and others have allowed the identification of a common set of about 500 hosts genes which expression is independent of the pathogen and is shared among
different cell types (Jenner & Young, 2005). However, pathogen specific transcriptional responses are
also activated, sometimes by the same receptors. The identification of this signature is crucial.
Moreover, there is a need to improve the quality of data because of the poor reproducibility of techniques such transcript profiling or MALDI protein profiling or the difficulty to identify cis-regulatory
elements. Such limitations are amplified by the use of non-standardised protocols and by the genetic
variability of hosts and pathogens which hinder the comparison between different experiments run
under different conditions. Despite these limitations high throughput biotechnologies have proved to
be very useful. Integrating data from these studies and combining them with other approaches give a
systemic view of the cell response to infection.
D ata Integration
Systems biology aims at simultaneously investigating and identifying all the cellular components that
interact. Sufficient level of details or types of information must be integrated in a common database.
Every piece of information about the genes products (transcripts, proteins, or metabolites) has to be
tracked and processed to allow a functional presentation of the biological question in hand. This requires
the access to tools that can rapidly process large scale data and transform it into more functional information such as gene catalogues, gene functional annotation, protein interactions, pathways, etc.
The data to be collected must be taken cautiously as they are heterogeneous in quality and as tools
to interpret them correctly are still lacking. Indeed, most public databases do not distinguish between
direct and indirect, robust and less robust interactions, raising more uncertainty about the impact and the
interpretation of the output of some experiments. To improve their quality, data should be carefully evaluated and where possible, manually curated. Thus, extensive efforts should be spent in the development
of algorithms that extract, refine and organise this data into flexible repositories. This would guarantee
that the generated models reflect the reality of immune cells dynamics, the fast evolution of the system
and the conflicting interactions that take place between the host cell and the invading pathogen.
Three data mining steps could be distinguished concerning the integration of these heterogeneous
data (Klipp, Herwig, Kowald, Wierling, & Lehrach, 2005). The first step aim to parse any data, to query
for information and to integrate it into object oriented depositories (i.e.: object=gene products). The
second step (i.e.: sections 3.2-5 & 3.7) consists in the identification of associations across different datasets in order to gain a coherent view of the same object in the light of diverse data sources and curate
the data to be integrated. The third step (i.e.: sections 3.1-2 & 3.6) focuses on the mapping of information gained on the interaction between objects, into graphs that will be used as basic models to depict
385
the targeted cellular system. The resulting database has to reflect the topology, the robustness and the
quantity of the connections between objects.
The first step requires the use of molecular biology oriented mining techniques. Due to the heterogeneity of the used technologies and of the data to be integrated formats, several tools (syntactical
and semantical) have been developed in order to draw, in a coherent way the different facets of a same
object. Thus, sequence similarity search tools (Altschul, Gish, Miller, Myers, & Lipman, 1990; Kent,
2002; Pearson & Lipman, 1988), classical literature mining tools (Becker et al., 2003; Couto et al.,
2006; Gladki, Siedlecki, Kaczanowski, & Zielenkiewicz, 2008; Hammamieh et al., 2007; Maier et al.,
2005; Muller, Kenny, & Sternberg, 2004; Rubinstein & Simon, 2005; Settles, 2005; Yuan et al., 2006),
controlled vocabulary systems (Bairoch et al., 2005; Cimino, 1998; Hide, Smedley, McCarthy, & Kelso,
2003; Kaplan, Vaaknin, & Linial, 2003; Kasprzyk et al., 2004; Letovsky, Cottingham, Porter, & Li,
1998; Muller et al., 2004), or tables with identifiers that let the cross-mapping of the same object, are
routinely used to establish a relationship between objects.
The second step requires sophisticated statistical techniques to establish correlative association between heterogeneous data. Its ultimate goal is to curate data considering complementary data sources
(i.e.: transcriptome/proteome, transcriptome/transcription factor binding site, interactome/proteome).
To reach this goal mathematical functions such us distances and correlations (Klipp et al., 2005) are
commonly used. Conflicts of data measures must be handled.
The third step focuses on mapping the previous information gained into graphs that will be used
in the modelling. This issue can be addressed by using Boolean and Bayesian techniques (Friedman,
Linial, Nachman, & Peer, 2000; Peer, 2005). Its aim is to predict a snapshot of molecules interaction
and structures by combining data from heterogeneous sources (i.e.: predict interactions out of gene
profiling data).
In the following subsections we will present examples of data (genome, transcriptome and/or
proteome databases, etc.) that could be integrated and detail the protocol that can translate them into
meaningful information.
G en ome
Understanding the host, the pathogen and the vector genome structures is crucial in systems biology.
The genomes of many mammals, including human as well as a growing number of different pathogenic
microorganisms and vectors genomes have been sequenced and their genes identified (see Table 3). The
purpose of these genome projects is to unravel the information encoded into the host, pathogen and
(when appropriate) vector genomes and to decipher the impact of their interaction (vector/pathogen,
host/pathogen, and possibly host/vector) on the outcome of the infection.
The information about hosts genome structure is mainly centralized in ENSEMBL (Hubbard et
al., 2007), a database that provides an accurate and automatic analysis of many chordate genomes.
Functional genomics data in ENSEMBL can be retrieved using BioMart (Kasprzyk et al., 2004) - a
tool that can extract exhaustive lists of biological attributes concerning ENSEMBL objects. It contains
a flexible query builder interface that allows the user to select an object type (e.g. genes), to specify
the genomic regions, and refine the result by using various filters. It can generate a number of different
types of output, including FASTA sequence or data in flat files.
The information about the pathogens and vectors genome is not centralized and has to be retrieved
individually from the corresponding databases. For example, GeneDB (Hertz-Fowler et al., 2004) and
386
Species
Reference database
Availability
Homo sapiens
ENSEMBL
www.ensembl.org
Mus musculus
ENSEMBL
www.ensembl.org
Rattus norvegicus
ENSEMBL
www.ensembl.org
Canis familiaris
ENSEMBL
www.ensembl.org
HIV-1 and HIV-2
HIV database
www.hiv.lanl.gov
Trypanosomatidae
GeneDB
www.genedb.org
Plasmodium falciparum
PlasmoDB
www.plasmodb.org
Mycobacterium
tuberculosis
Mycobacterium
tuberculosis Database
www.broad.mit.edu/annotation/genome/mycobacterium_
tuberculosis_spp/MultiHome.html
Aedes aegypti
Vectorbase
www.vectorbase.org
Vectors
Pathogens
Mammal
hosts
Table 3. Infection related genome reference databases
VectorBase (Lawson et al., 2007) provides an annotation of several genomes data. A large proportion
of the identified genes have a hypothetical function, some might point to pathogen/vector specific
processes. Comparative genomics approaches complement positively this data and often let the identification of pathogen/vector specific enzymatic pathways or of genes playing a key role during the
infection (Zhang & Zhang, 2006).
The biological objects that should be characterized are genes products. Diverse attributes concerning these objects have to be collected in order to facilitate their connection with the information present
in other databases: identifiers of the genes and their products in other databases (EMBL, SwissProt,
HGNC, Affymetrix) and the sequence of the products. These attributes are collected through the data
mining tools that are usually provided by the genome databases.
Several host, pathogen and vector gene families are known to play key roles in infection (Jenner &
Young, 2005; Trowsdale & Parham, 2004): chemokines, cytokines, cytoskeleton, inflammatory response,
protein folding and targeting, response to stress, cell communication and signalling, transcription related, apoptosis, extra-cellular and membrane associated proteins. In order to detect these key actors,
it is recommended to identify all members of these families and to describe their behaviour during the
infection. Information that could lead to the identification of these families members is often encoded in
a unified controlled vocabulary called Gene Ontology (Diehl, Lee, Scheuermann, & Blake, 2007; Harris
et al., 2004) generally provided by complete genome databases. At the third analysis step, integration of
information concerning the functions of the identified gene products and their implication into immune
related biological processes or molecular functions can generate graphs (Cho, Hwang, Ramanathan, &
Zhang, 2007; Daraselia, Yuryev, Egorov, Mazo, & Ispolatov, 2007).
However genome data is insufficient to correctly address and integrate infection related information.
This includes the redundancy of biochemical pathways in a given organisms or the fact that the same
key functions can also be carried out by unrelated proteins in different physiological conditions. It is
387
therefore recommended to integrate genes products profiling (transcriptome, proteome and metabolome)
as a complement to genomic approaches.
Gene Product Profiling

Host/pathogen/vector gene product profiling data are available in general gene expression public databases like Gene Expression Omnibus (Barrett et al., 2007) and ArrayExpress (Parkinson et al., 2007)
or in more specific gene expression databases Innate Immune database (Korb et al., 2008) and GPXMacrophage Expression Atlas (Grimes et al., 2005) and PlasmoDB (Bahl et al., 2002).
Important information to be collected and connected, to genes products concerns the physiological
status of the tested cell (resting or stimulated), and the genes products expression levels. The tools that
are usually used to query for this information are the identifiers of the genes and their products (especially the NCBI gene identifier, Affymetrix identifiers for transcriptome and SwissProt identifiers for
proteome), and sequence similarity search tools.
Clustering gene product profiling data and confronting them to other data types such us literature
data, curate the data in hand and help shaping a coherent picture of the functional relationship among
large and heterogeneous sets of genes (Chaussabel & Sher, 2002; Chi, Ibrahim, Bissahoyo, & Threadgill, 2007; Koehler et al., 2005; Menten et al., 2005; Rubinstein & Simon, 2005). Significant expression
changes are also commonly used to extract comprehensive and functional information concerning the
genes expressed in the cell of interest (Chaussabel et al., 2003; El Fadili et al., 2008; Grinde, Gayorfar,
& Hoddevik, 2007; Hofman et al., 2007; Zaffuto et al., 2007). These changes can be used to quantify
the edges in graphs or to direct edges in the graphs and network generated in sections 3.3, 3.4 and 3.5
(see below) (Hart et al., 2005; Maciag et al., 2006; Takigawa & Mamitsuka, 2008; Wilczynski et al.,
2006). Diverse other techniques has been developed in order to assess the data in hand (Chopra et al.,
2008; Gana Dresen, Boes, Huesing, Neuhaeuser, & Joeckel, 2008) and to map the information gained
into graphs and pathways (de Jong, 2002; Friedman et al., 2000).
Interacti
on graph
The interaction graph is the collection of all interactions in a given cell at a given time. These interactions involve large molecules composed of proteins, nucleic acids and small molecules. In terms of
proteomics, it refers to protein-protein interactions.
APID (Prieto & De Las Rivas, 2006) and DIP (Salwinski et al., 2004) are two repositories of protein-protein interactions whereas BIND (Bader, Betel, & Hogue, 2003) is a repository of molecules
interactions (protein-protein, protein-DNA, protein-RNA, protein-small molecules, RNA-RNA, etc.).
Interactions between gene products are the information that has to be collected during the first step. It
is recommended to collect complementary information about: (i) the technologies used to identify those
interactions, (ii) the number of times those interactions were verified independently, (iii) the cellular
localisation of the gene product, etc. The tools that are usually used to query for this information are
the identifiers of the genes and their products, and sequence similarity search tools.
Integrating these interactions leads to the generation of non-oriented graphs in which vertexes are
molecules and edges are the interactions. Semantic rules and statistical techniques are used, in the
388
second step, to weight the edges (Komurov & White, 2007; Scott & Barton, 2007). Such non-oriented
graphs can be oriented by integrating what is known about the direction of the interaction or rendered
time dependent by integrating the gene products expression data at different time points. The graph
obtained is either Boolean, Polynomial or Bayesian
R egulatory G raph
A regulatory graph is a collection of gene products which interact with DNA segments, thereby governing the rates at which genes are transcribed into mRNA. This graph is sometimes considered as part
of the interaction graph.
Distinct rounds are decided in the first step: determine the set of genes that are expressed in a given
cell at a given time, delimitate their promoter region, identify the cis-regulatory elements governing
their expression, and infer the graph. The whole process should lead to the generation of an oriented
graph in which vertexes are composed of a hybrid population of gene products: those that are regulated
and do regulate (transcription factors) and those that are only regulated. Edges are the impact (activation or silencing; induction or repression) of the first population of gene products on the expression of
the second one.
Several methods were developed to carry out the above rounds. Achieving the first and second
round is quite straightforward: the list of transcribed genes can be obtained from the transcripts profiling (see 3.2) and their promoter region is the region surrounding their Transcription Start Sites (TSS).
TSS can be extracted through the mapping of known alternative transcripts data or the integration of
the transcripts maps generated in the framework of pilot projects (Birney et al., 2007; Carninci et al.,
2005). The latter round might be more complex if we try to target genes alternative promoters (Kawaji
et al., 2006). The third round is difficult to accomplish as the consensus sequences of cis-regulatory
elements may be present everywhere in the genome and could generate a large number of false positives. Several smart tricks and biotechnologies (Nardone, Lee, Ansel, & Rao, 2004) were developed
in order to reduce this rate. Besides focusing on the cis-regulatory elements that are localised in the
promoter region or into conserved non-coding sequences (Pennacchio et al., 2006), we also cite Chipon-Chip or ChIP technologies (Blais & Dynlacht, 2005; Ren & Dynlacht, 2004), and data generated in
the framework of pilot projects (Birney et al., 2007; Carninci et al., 2005). Coupling these approached
and developing sophisticated algorithms (Liu, Jessen, Sivaganesan, Aronow, & Medvedovic, 2007) is
powerful and should lead to the prediction of a reliable set of regulatory elements. The fourth round
is realized by connecting the genes products (TF to genes) using the connections obtained in the third
round and by integrating their expression levels (transcripts and proteins) (Hayete, Gardner, & Collins,
2007; Nilsson et al., 2006). Expression data at different time points can be used, as for the interaction
graph, to validate the edges of the obtained graph and to add a dynamic.
B iochemica l netw ork

Pathways are the organised successive steps used by the cell to keep its homeostasis. A biochemical
network inter relates a wide range of biochemical pathways occurring within a cell. Each pathway is
composed of a set of biochemical reactions occurring in cascade. These reactions are accelerated and
more accurately catalyzed, by enzymes helped by co-factors such as dietary minerals and vitamins. The
389
kinetics of the biochemical reaction is important for the understanding of the outcome of the reaction
and replacing it in the cellular context.
The biochemical pathway is an oriented graph in which vertexes are molecules and edges are the
biochemical reaction. Collected information will be used to generate an oriented network in which vertexes are expressed genes products and edges are the biochemical pathway and the time of the reaction.
The biochemical network can be presumed as a superposition of biochemical pathways.
For mammalian hosts, several databases record reliable pathways (Vastrik et al., 2007). For pathogens, a reliable functional annotation is in some cases available (Dieterich, Karst, Fischer, Wehland, &
Jansch, 2006). KEGG (Kanehisa, 2002; Kanehisa et al., 2006) is an example of databases that records
both host and pathogen pathways and that can be easily used because the pathways were constructed in
a uniform way. Kanehisa et al. (2006) have made a tremendous effort in order to connect the pathways
they have reconstructed to databases.
Information that has to be queried for and connected to the genes products are the biochemical reactions and their kinetics. Tools that are usually used to query for this information, during the first level,
are the identifiers of the genes and their products, and the sequence similarity search tools.
However, we must stress out here that the pathways recorded in these databases represent several possible biochemical reactions whereas the ones activated after a given stimulus are specific and depend of
the pathogen, the cell type and the activated state of a given cell. The interaction and regulatory graphs
and profiling data are commonly used to assess the network and to reduce the number of possible reactions (Bebek & Yang, 2007; Sanguinetti, Noirel, & Wright, 2008; Takigawa & Mamitsuka, 2008).
Literature
The integration of information absent from public databases and present in the literature is recommended
in systems biology and essential in infectious systems biology. The reasons are the absence of highthroughput biotechnologies dedicated to the study of host-pathogen interactions and the temporal and
spatial location (nucleosome, phagolosome, plasma) of the interactions. The host-pathogen interaction
and the temporal issues can only be addressed through literature mining. The spatial issue can also be
addressed through the GO cellular components identifiers or through the integration of the information
compiled in dedicated databases (Wiwatwattana, Landau, Cope, Harp, & Kumar, 2007) or through the
literature mining.
Querying for information from the literature is done through classical techniques of information
extraction and text mining (first step). In order to be integrated into the system, all collected information
must have a direct connection to gene products (identifier collected in 3.1). They are used to consolidate
and complement the information that was integrated about these products and the graphs and network
predicted in 3.3-5 (Koehler et al., 2005; Maier et al., 2005).
Orthology D ata
The orthology (Fitch, 2000) is the relationship between any two homologous characters from two different species (e.g.: genes, regulatory regions) which common ancestor lies in the cenancestor of the
taxa from which the two sequences were obtained.
Computed orthology information is used to make bridges between cellular networks from different
organisms and might be used to curate data and to validate the graphs and network predicted in 3.3-5.
390
The input data are proteins from host/pathogen/vectors (downloaded in 3.1), whereas the tools that are
commonly used are the sequence similarity search tools (Li, Stoeckert, & Roos, 2003; OBrien, Remm,
& Sonnhammer, 2005) (first step).
The integration or non-integration of orthology data depends on the evolutionary level at which the
biological question has to be addressed.
Mathematical Modelling
Mathematics are used in the different stages going from the identification of correlative associations
cross different datasets, to the extraction of key biological processes that are triggered by the infection,
to the final drawing of an in silico model that mimics accurately the different cellular events.
Mathematical modelling leads to the construction and the understanding, from biological data integrated in section 3, of the cells dynamic. Its objective is to overcome a number of difficulties met by
infectious molecular biologists, in understanding the kinetics, the evolution and outcome of the infection, and to help them in orienting their research.
Interaction between the host cell and the pathogen induces a series of physiological and molecular
changes in both organisms. The challenge is to measure these changes and to extract out of these measures hints about the cells dynamical behaviour. Deterministic techniques that stay faithful to real data
(Boolean models, Polynomial models, Ordinary or Partial Differential Equations (ODE/PDE), etc.) are
relevant when we have enough data, whereas stochastic techniques based on probabilities are recommended when dealing with missing or limited data (Markov assumptions Gillespie algorithms, Bayesian
models, Chemical Master Equation (CME), etc). Both approaches take as input graphs that integrate a
quantifiable measure of the interactions and a set of state variables (node-degree distribution, mean path
length, clustering coefficients, architectural features, existence and hierarchical molecules, etc). These
complementary approaches have to be taken carefully because of the high complexity of the studied
system. Confronting the two approaches is recommended.
The specificity of infected cells is that the vertexes of the graphs can represent molecules from different organisms (host and pathogen). The absence of data that reflect the co-evolution of these organisms
complicates this modeling. This lack of data has to be integrated into the system and compensated by
the development of supplementary sophisticated tools that can circumvent it.
In this section will be given a description of the principal approaches for building a model from a
certain number of experimental evidences and by using reverse engineering tools based on equations
and relation graphs that can be refined step by step.
B ottom-Up, T op-D own and H ybrid Modelling A pproaches

Three distinct strategies for modelling the systems behaviour are recognized the bottom-up and top-down
approaches which are integrated into a third hybrid strategy. The fundamental philosophical difference
between the bottom-up and top-down approaches is the regulation of the system. Is it driven from the
bottom (molecular reaction) or from the top (cellular function) or from both?
The modelling approach that departs from the result of high throughput biotechnologies represents
the bottom-up approach (Bruggeman & Westerhoff, 2006). In the bottom level one has the molecular
properties (integrated in section 3). In the middle level one has the generation of the graphs and network
presented in 3.3, 3.4 and 3.5 and the development of: (i) dynamical Bayesian models that enable the
391
description of the dependences of molecules; (ii) dynamical Boolean models that govern the rules of
the dependences; (iii) CME or ODE/PDE models that deal with quantitative variables. In the top level,
there is the physiological reaction of the cell.
The top-down approach (Forst, 2006; Ideker & Lauffenburger, 2003b), represents the reverse direction of the bottom-up approach. In this approach, the biochemical, regulatory and signalling networks,
with molecules common to the host and pathogen, are reconstructed starting from systems reaction.
Such networks construction is called Reverse Engineering; where mathematical techniques are used
to find the model that fit the data (molecular reaction) and predict the future behaviour of molecules
from the present state.
Hybrid approaches combines bottom-up and top-down approaches. Gene products profiling data
is commonly used to complement interactions and biochemical reaction data. Assumptions such as
molecules with similar expression are likely interacting can be used to reinforce the model (Ideker &
Lauffenburger, 2003a; Sjberg, 2002).
Mode lling of the ce llular netw orks

Modelling the interaction, regulatory and biochemical networks is one of the scopes of systems biology.
Graphs produced in 3.3 and 3.4 are multiple snapshots of the gene products interactions at different time
points. The aim of dynamical Boolean and Bayesian models is to transform these static graphs into a
dynamical graphs. In the network produced in 3.5, each signalling pathway can be decomposed into
different sets of elementary phosphorylation or dephosphorylation reactions. Each of these reactions
is time dependent. The analysis of biochemical networks and the modelling of intracellular dynamics
can be done through a deterministic or a stochastic way. The deterministic approach requires the solution of Ordinary or Partial Differential/Difference Equations (ODE/PDE, reaction rate equations) with
concentrations as continuous state variables. The stochastic approach involves Differential/Difference
Equations with probabilities as variables (Chemical Master Equations, CMEs). Compared to Boolean
and Bayesian models, the ODE/PDE and CME approaches allow a continuous and quantitative analysis.
All these approaches aim to extract the kinetic of the reaction.
Modelling technique appropriate for a given biological system does not only depend on the investigated
biological phenomena but is also influenced by the assumptions one makes to simplify the analysis.
D ynamical B oolean Models

The qualitative connexion between gene products (3.3 and 3.4) can be expressed through dynamical
Boolean model (Saez-Rodriguez et al., 2007). This model is attractive because it deals with the algebraic
and topological aspects of the graph and allows the establishment of a global schema which enables
bringing the interactions from a static to a dynamical stage. A Boolean graph is a directed graph that
consists of molecules (vertexes) sharing a causal relationship (edges). A Boolean function states the
behaviour of these molecules with respect to their interactors. This function formulates the logical
rules that fit the best the data (presence or absence of interaction) and assigns a dynamic to the graph
as asymptotic limit sets.
Dynamical Boolean models are limited because the reference molecule is considered to be either
present or absent. Since intermediate levels of expression are neglected, this approach allows an altered
392
comprehension of the dynamic of the cell network. Several other approaches overcome this limitation,
among others can be cited the polynomial models (where a gene can be expressed on more than 2 states)
(Laubenbacher, 2005).
D ynamical B ayesian Model

Dynamical Bayesian models encode real relationships in oriented graphs; they can be considered
as a dynamical Boolean model with probabilities attached to the Boolean function (Murphy & Mian,
1999).
The qualitative connexion between gene products (3.3 and 3.4) can also be expressed through dynamical Bayesian models. These allow large graphs to be analysed and overflow missing, noisy and
inconsistent data (Friedman, Murphy, & Russell, 1998; Kim, Imoto, & Miyano, 2004). They are a special case of a more general class called graphical models in which vertexes (gene products) represent
random variables, and the lack of edges represent conditional independence assumptions between gene
products; an example of stochastic models with a well defined probabilistic semantics.
Ordinary D ifferential E quations

Phosphorylation or dephosphorylation or gene regulation lead to the definition of ODE/PDE which
describes how each concentration changes over the time (3.4 and 3.5).
A first approximation of the system can be done through the Mass Action Law (Sekiguchi & Okamoto,
2006) where the graph vertexes (gene products concentration) are taken as a variable and the graph
edges (speed of the reaction, activation or inhibition) are considered as proportional to the concentration
of the product. Biochemical reactions or genes regulation are encoded through differential relations
between the variables (de Jong, 2002; Klipp et al., 2005). Due to a high number of reactions, a high
number of non linear differential equation has to be manipulated and the dynamics can have a chaotic
behaviour. The analysis of this chaos necessitates the development of hard tools and algorithms: classical techniques of dynamical systems theory (Periodic orbits, Stability, Liapounov exponent, Attractors,
etc) (Katok & Hasselblatt, 1996) and classical numerical analyses (Gradient like methods, Non-linear
calculus, etc) (Voit et al., 2006).
As for dynamical Boolean models, the goal of ODE/PDE is to determine asymptotic limit sets like
attractors (stable states, cyclic attractors) and to analyse their properties (Huang & Ingber, 2000).
C hemical Master E quation

Whereas the ODE/PDE approach produces concentrations by solving differential equations; the CME
involves also the constitution of a set of equations with probabilities as variables (Gillespie, 1977; Ullah
et al., 2006). However, CMEs produce counts of molecules as realisations of random variables drawn
from the probability distribution. The numbers of CME grow exponentially with the number of states
in the models. Novel strategies were developed in order to enable handling a large number of states
(Sjberg, Ltstedt, & Elf, 2007) and large sets of data.
393
Box 2.
Dynamical Boolean models:
Let n = 4 be a number of vertexes, and k = Z/2Z be a finite field of order 2. A dynamical graph
N = ( f1, f2, f3, f4), on n vertexes is given by local update functions:
f1 (x ) = x1 + x2 x4
f
(x ) = x1 x2 x3
f3 ( x) = x1 + x3
f 4 ( x) = x2 + 1 x4
and a global transition function:

f (x ) = ( f1 (x ), f 2 (x ), f 3 (x ), f 4 (x ))
where x = (x1 , x2 , x3 , x4 ) k 4 is a collection of variables corresponding to n proteins. Figure 5(a) illustrate an example of Connecting graph for n proteins and Figure 5(b) the Dynamic graph N over k =
Z/2Z result of the association of the function f and the Connecting graphs.
Ordinary Differential Equations:
Let n = 4 be a number of vertexes. A dynamical graph N = ( f1, f2, f3, f4) on n vertexes is given by local
update functions:
f1 (x ) =
f
(x ) =
1 x1
x2 x4
x1 x2 x3
f3 ( x) =
2 x1
f 4 ( x) =
4 x2
3 x3
1
x4 A
and a global transition function:

f (x ) = ( f1 (x ), f 2 (x ), f 3 (x ), f 4 (x ))
where x = (x1 , x2 , x3 , x4 ) R+4 is a collection of variables corresponding to n proteins. Figure 5(a) illustrate an example of Connecting graph for n proteins. The dynamic associated to f depends on the
parameters 1 , 2 , 3 , , and .
394
Figure 5. (a) Connecting graph associated in Boolean and ODE dynamics. (b) Dynamical model associated to Boolean graph.
(b)
(a)
Co nc lusi on
The complexity of the biological systems to be modelled, may lead to the construction of equations that
are too much complexes and in many case have no solution. Also, the number of states grows exponentially with the number of molecules. Therefore it is important to use a set of simple equations that fit the
best the biological model. Often it is quite difficult to conciliate between these two imperatives. This
aspect needs the development of sophisticated statistical tools that can calculate and test quantitatively
how far are our mathematical models from the real cell dynamics?
Mathematics is used as formalism to help the generation of models used as a tool to test hypothesis
or predict behaviours to understand how cells works. However as a human science they cannot succeed
or explain biological events.
Nowadays, systems biology have been successfully applied on normal or cancerous mammalian
cells and helped for example elucidating molecular networks that cause diseases (Chen et al., 2008)
or mapping human cancer signalling network (Cui et al., 2007) and promises to be successful in the
modelling of infected cells.
The ultimate goal of systems biology in the specific field of infectious diseases is to decipher in vivo
host-pathogen interactions at the cellular and molecular levels and naturally address the drug discovery
issue. In vitro experiments are a major obstacle to drug discovery since they are done in cells isolated
from APCs, soluble mediators (cytokines, chemokine, etc.), etc. These experiments miss the structural
framework of the tissue itself which surely influence the kinetics and the nature of the response. A
compromise would be to work on animal models in vivo then to try to extrapolate and predict the effect
of developed drugs on human using systematic approaches. However these models definitely suffer
limitations when they involve inbred animals that do not reflect the diversity of out bred populations.
395
A ckn ow ledgment
Many thanks for Karyn MEGY for her critical reading of this manuscript.
R eferences
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search
tool. J Mol Biol, 215(3), 403-410.
Bader, G. D., Betel, D., & Hogue, C. W. (2003). BIND: The biomolecular interaction network database.
Nucleic Acids Res, 31(1), 248-250.
Bahl, A., Brunk, B., Coppel, R. L., Crabtree, J., Diskin, S. J., Fraunholz, M. J., et al. (2002). PlasmoDB:
The Plasmodium genome resource. An integrated database providing tools for accessing, analyzing
and mapping expression and sequence data (both finished and unfinished). Nucleic Acids Res, 30(1),
87-90.
Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., et al. (2005). The Universal Protein Resource (UniProt). Nucleic Acids Res, 33(Database issue), D154-159.
Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., et al. (2007). NCBI
GEO: Mining tens of millions of expression profiles--Database and tools update. Nucleic Acids Res,
35(Database issue), D760-765.
Bebek, G., & Yang, J. (2007). PathFinder: Mining signal transduction pathway segments from proteinprotein interaction networks. BMC Bioinformatics, 8, 335.
Becker, K. G., Hosack, D. A., Dennis, G., Jr., Lempicki, R. A., Bright, T. J., Cheadle, C., et al. (2003).
PubMatrix: A tool for multiplex literature mining. BMC Bioinformatics, 4, 61.
Birney, E., Stamatoyannopoulos, J. A., Dutta, A., Guigo, R., Gingeras, T. R., Margulies, E. H., et al.
(2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE
pilot project. Nature, 447(7146), 799-816.
Blais, A., & Dynlacht, B. D. (2005). Constructing transcriptional regulatory networks. Genes Dev,
19(13), 1499-1511.
Bruggeman, F., & Westerhoff, H. (2006). The nature of systems biology. TRENDS in Microbiology,
15(1), 45-50.
Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M. C., Maeda, N., et al. (2005). The transcriptional landscape of the mammalian genome. Science, 309(5740), 1559-1563.
Chaussabel, D., Semnani, R. T., McDowell, M. A., Sacks, D., Sher, A., & Nutman, T. B. (2003). Unique
gene expression profiles of human macrophages and dendritic cells to phylogenetically distinct parasites.
Blood, 102(2), 672-681.
Chaussabel, D., & Sher, A. (2002). Mining microarray expression data by literature profiling. Genome
Biol, 3(10), RESEARCH0055.
396
Chen, Y., Zhu, J., Lum, P. Y., Yang, X., Pinto, S., Macneil, D. J., et al. (2008). Variations in DNA elucidate molecular networks that cause disease. Nature.
Chi, Y. Y., Ibrahim, J. G., Bissahoyo, A., & Threadgill, D. W. (2007). Bayesian hierarchical modeling
for time course microarray experiments. Biometrics, 63(2), 496-504.
Cho, Y. R., Hwang, W., Ramanathan, M., & Zhang, A. (2007). Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics, 8(1), 265.
Chopra, P., Kang, J., Yang, J., Cho, H., Kim, H. S., & Lee, M. G. (2008). Microarray data mining using
landmark gene-guided clustering. BMC Bioinformatics, 9, 92.
Cimino, J. J. (1998). Auditing the unified medical language system with semantic methods. J Am Med
Inform Assoc, 5(1), 41-51.
Couto, F. M., Silva, M. J., Lee, V., Dimmer, E., Camon, E., Apweiler, R., et al. (2006). GOAnnotator:
Linking protein GO annotations to evidence text. J Biomed Discov Collab, 1, 19.
Cui, Q., Ma, Y., Jaramillo, M., Bari, H., Awan, A., Yang, S., et al. (2007). A map of human cancer signaling. Mol Syst Biol, 3, 152.
Daraselia, N., Yuryev, A., Egorov, S., Mazo, I., & Ispolatov, I. (2007). Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks. BMC Bioinformatics, 8(1), 243.
de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: A literature review. J Comput Biol, 9(1), 67-103.
Diehl, A. D., Lee, J. A., Scheuermann, R. H., & Blake, J. A. (2007). Ontology development for biological
systems: immunology. Bioinformatics, 23(7), 913-915.
Dieterich, G., Karst, U., Fischer, E., Wehland, J., & Jansch, L. (2006). LEGER: Knowledge database and
visualization tool for comparative genomics of pathogenic and non-pathogenic Listeria species. Nucleic
Acids Res, 34(Database issue), D402-406.
El Fadili, K., Imbeault, M., Messier, N., Roy, G., Gourbal, B., Bergeron, M., et al. (2008). Modulation
of gene expression in human macrophages treated with the anti-leishmania pentavalent antimonial drug
sodium stibogluconate. Antimicrob Agents Chemother, 52(2), 526-533.
Fitch, W. M. (2000). Homology a personal view on some of the problems. Trends Genet, 16(5), 227231.
Forst, C. V. (2006). Host-pathogen systems biology. Drug Discov Today, 11(5-6), 220-227.
Friedman, N., Linial, M., Nachman, I., & Peer, D. (2000). Using Bayesian networks to analyze expression data. J Comput Biol, 7(3-4), 601-620.
Friedman, N., Murphy, K., & Russell, S. (1998). Learning the Structure of Dynamic Probabilistic Networks. Paper presented at the Fourteenth Conf. on Uncertainty in Artificial Intelligence (UAI).
Gana Dresen, I. M., Boes, T., Huesing, J., Neuhaeuser, M., & Joeckel, K. H. (2008). New resampling
method for evaluating stability of clusters. BMC Bioinformatics, 9, 42.
397
Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem., 81,
2340-2361.
Gladki, A., Siedlecki, P., Kaczanowski, S., & Zielenkiewicz, P. (2008). e-LiSe - An online tool for finding needles in the (Medline) haystack. Bioinformatics.
Grimes, G. R., Moodie, S., Beattie, J. S., Craigon, M., Dickinson, P., Forster, T., et al. (2005). GPX-Macrophage Expression Atlas: A database for expression profiles of macrophages challenged with a variety
of pro-inflammatory, anti-inflammatory, benign and pathogen insults. BMC Genomics, 6, 178.
Grinde, B., Gayorfar, M., & Hoddevik, G. (2007). Modulation of gene expression in a human cell line
caused by poliovirus, vaccinia virus and interferon. Virol J, 4, 24.
Hammamieh, R., Chakraborty, N., Wang, Y., Laing, M., Liu, Z., Mulligan, J., et al. (2007). GeneCite:
A stand-alone open source tool for high-throughput literature and pathway mining. Omics, 11(2), 143151.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., et al. (2004). The Gene
Ontology (GO) database and informatics resource. Nucleic Acids Res, 32(Database issue), D258-261.
Hart, C. E., Sharenbroich, L., Bornstein, B. J., Trout, D., King, B., Mjolsness, E., et al. (2005). A mathematical and computational framework for quantitative comparison and integration of large-scale gene
expression data. Nucleic Acids Res, 33(8), 2580-2594.
Hayete, B., Gardner, T. S., & Collins, J. J. (2007). Size matters: Network inference tackles the genome
scale. Mol Syst Biol, 3, 77.
Hertz-Fowler, C., Peacock, C. S., Wood, V., Aslett, M., Kerhornou, A., Mooney, P., et al. (2004). GeneDB: A
resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res, 32(Database issue), D339-343.
Hestvik, A. L., Hmama, Z., & Av-Gay, Y. (2005). Mycobacterial manipulation of the host cell. FEMS
Microbiol Rev, 29(5), 1041-1050.
Hide, W., Smedley, D., McCarthy, M., & Kelso, J. (2003). Application of eVOC: controlled vocabularies
for unifying gene expression data. C R Biol, 326(10-11), 1089-1096.
Hofman, V. J., Moreilhon, C., Brest, P. D., Lassalle, S., Le Brigand, K., Sicard, D., et al. (2007). Gene
expression profiling in human gastric mucosa infected with Helicobacter pylori. Mod Pathol, 20(9),
974-989.
Huang, S., & Ingber, D. E. (2000). Shape-dependent control of cell growth, differentiation, and apoptosis:
switching between attractors in cell regulatory networks. Exp Cell Res, 261(1), 91-103.
Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., et al. (2007). Ensembl 2007.
Nucleic Acids Res, 35(Database issue), D610-617.
Ideker, T., & Lauffenburger, D. (2003a). Building with a scaffold: Emerging strategies for high- to lowlevel cellular modeling. Trends Biotechnol, 21(6), 255-262.
Ideker, T., & Lauffenburger, D. (2003b). Building with a scaffold: Emerging strategies for high- to lowlevel cellular modeling. TRENDS in Biotechnology, 21(6), 255-262.
398
Jansen, A., & Yu, J. (2006). Differential gene expression of pathogens inside infected hosts. Curr Opin
Microbiol, 9(2), 138-142.
Jenner, R. G., & Young, R. A. (2005). Insights into host responses against pathogens from transcriptional
profiling. Nat Rev Microbiol, 3(4), 281-294.
Kanehisa, M. (2002). The KEGG database. Novartis Found Symp, 247, 91-101; discussion 101-103, 119128, 244-152.
Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K. F., Itoh, M., Kawashima, S., et al. (2006). From
genomics to chemical genomics: New developments in KEGG. Nucleic Acids Res, 34(Database issue),
D354-357.
Kaplan, N., Vaaknin, A., & Linial, M. (2003). PANDORA: Keyword-based analysis of protein sets by
integration of annotation sources. Nucleic Acids Res, 31(19), 5617-5626.
Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., et al. (2004). EnsMart: A
generic system for fast and flexible access to biological data. Genome Res, 14(1), 160-169.
Katok, A., & Hasselblatt, B. (1996). Introduction to the modern theory of dynamical systems. Cambridge
University Press.
Kawaji, H., Frith, M. C., Katayama, S., Sandelin, A., Kai, C., Kawai, J., et al. (2006). Dynamic usage
of transcription start sites within core promoters. Genome Biol, 7(12), R118.
Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Res, 12(4), 656-664.
Kim, S., Imoto, S., & Miyano, S. (2004). Dynamic Bayesian network and nonparametric regression
for nonlinear modeling of gene networks from time series gene expression data. Biosystems, 75(1-3),
57-65.
Concepts, implementation and application. Wiley-VCH.
Koehler, J., Rawlings, C., Verrier, P., Mitchell, R., Skusa, A., Ruegg, A., et al. (2005). Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised
Data Structures. In Silico Biol, 5(1), 33-44.
Komurov, K., & White, M. (2007). Revealing static and dynamic modular architecture of the eukaryotic
protein interaction network. Mol Syst Biol, 3, 110.
Korb, M., Rust, A. G., Thorsson, V., Battail, C., Li, B., Hwang, D., et al. (2008). The Innate Immune
Database (IIDB). BMC Immunol, 9(1), 7.
Laubenbacher, R. (2005). Algebraic models in systems biology. Algebraic Biology, 33-40.
Lawson, D., Arensburger, P., Atkinson, P., Besansky, N. J., Bruggner, R. V., Butler, R., et al. (2007).
VectorBase: A home for invertebrate vectors of human pathogens. Nucleic Acids Res, 35(Database issue), D503-505.
Letovsky, S. I., Cottingham, R. W., Porter, C. J., & Li, P. W. (1998). GDB: The Human Genome Database. Nucleic Acids Res, 26(1), 94-99.
399
Li, L., Stoeckert, C. J., Jr., & Roos, D. S. (2003). OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res, 13(9), 2178-2189.
Liu, X., Jessen, W. J., Sivaganesan, S., Aronow, B. J., & Medvedovic, M. (2007). Bayesian hierarchical
model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data.
BMC Bioinformatics, 8(1), 283.
Maciag, K., Altschuler, S. J., Slack, M. D., Krogan, N. J., Emili, A., Greenblatt, J. F., et al. (2006).
Systems-level analyses identify extensive coupling among gene expression machines. Mol Syst Biol,
2, 2006 0003.
Maier, H., Dohr, S., Grote, K., OKeeffe, S., Werner, T., Hrabe de Angelis, M., et al. (2005). LitMiner
and WikiGene: Identifying problem-related key players of gene regulation using publication abstracts.
Nucleic Acids Res, 33(Web Server issue), W779-782.
Maynard, J. A., Myhre, R., & Roy, B. (2007). Microarrays in infection and immunity. Curr Opin Chem
Biol, 11(3), 306-315.
Menten, B., Pattyn, F., De Preter, K., Robbrecht, P., Michels, E., Buysse, K., et al. (2005). arrayCGHbase: An analysis platform for comparative genomic hybridization microarrays. BMC Bioinformatics,
6, 124.
Muller, H. M., Kenny, E. E., & Sternberg, P. W. (2004). Textpresso: An ontology-based information
retrieval and extraction system for biological literature. PLoS Biol, 2(11), e309.
Murphy, K., & Mian, S. (1999). Modelling gene expression data using dynamic bayesian networks.
Berkeley: Tech. rep. MIT Artificial Intelligence Laboratory.
Nardone, J., Lee, D. U., Ansel, K. M., & Rao, A. (2004). Bioinformatics for the bench biologist: How
to find regulatory regions in genomic DNA. Nat Immunol, 5(8), 768-774.
Ng, A., Bursteinas, B., Gao, Q., Mollison, E., & Zvelebil, M. (2006). Resources for integrative systems
biology: from data through databases to networks and dynamic system models. Brief Bioinform, 7(4),
318-330.
Nilsson, R., Bajic, V. B., Suzuki, H., di Bernardo, D., Bjorkegren, J., Katayama, S., et al. (2006). Transcriptional network dynamics in macrophage activation. Genomics, 88(2), 133-142.
OBrien, K. P., Remm, M., & Sonnhammer, E. L. (2005). Inparanoid: A comprehensive database of
eukaryotic orthologs. Nucleic Acids Res, 33(Database issue), D476-480.
Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., et al.
(2007). ArrayExpress--A public database of microarray experiments and gene expression profiles. Nucleic Acids Res, 35(Database issue), D747-750.
Peer, D. (2005). Bayesian network analysis of signaling networks: A primer. Sci STKE, 2005(281), l4.
Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl
Acad Sci USA, 85(8), 2444-2448.
Pennacchio, L. A., Ahituv, N., Moses, A. M., Prabhakar, S., Nobrega, M. A., Shoukry, M., et al. (2006).
In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444(7118), 499-502.
400
Philippi, S., & Kohler, J. (2006). Addressing the problems with life-science databases for traditional
uses and systems biology. Nat Rev Genet, 7(6), 482-488.
Prieto, C., & De Las Rivas, J. (2006). APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids
Res, 34(Web Server issue), W298-302.
Ren, B., & Dynlacht, B. D. (2004). Use of chromatin immunoprecipitation assays in genome-wide location analysis of mammalian transcription factors. Methods Enzymol, 376, 304-315.
Rubinstein, R., & Simon, I. (2005). MILANO: Custom annotation of microarray results using automatic
literature searches. BMC Bioinformatics, 6, 12.
Saez-Rodriguez, J., Simeoni, L., Lindquist, J. A., Hemenway, R., Bommhardt, U., Arndt, B., et al. (2007).
A logical model provides insights into T cell receptor signaling. PLoS Comput Biol, 3(8), e163.
Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U., & Eisenberg, D. (2004). The Database
of Interacting Proteins: 2004 update. Nucleic Acids Res, 32(Database issue), D449-451.
Sanguinetti, G., Noirel, J., & Wright, P. C. (2008). MMG: A probabilistic tool to identify submodules
of metabolic pathways. Bioinformatics.
Scott, M. S., & Barton, G. J. (2007). Probabilistic prediction and ranking of human protein-protein
interactions. BMC Bioinformatics, 8, 239.
Sekiguchi, T., & Okamoto, M. (2006). WinBEST-KIT: Windows-based biochemical reaction simulator
for metabolic pathways. J Bioinform Comput Biol, 4(3), 621-638.
Settles, B. (2005). ABNER: An open source tool for automatically tagging genes, proteins and other
entity names in text. Bioinformatics, 21(14), 3191-3192.
Sjberg, P. (2002). Numerical solution of the master equation in molecular biology. Uppsala University,
Uppsala.
Sjberg, P., Ltstedt, P., & Elf, J. (2007). Fokker-Planck approximation of the master equation in molecular biology. Comput. Vis. Sci., 10.
Takigawa, I., & Mamitsuka, H. (2008). Probabilistic path ranking based on adjacent pairwise coexpression for metabolic transcripts analysis. Bioinformatics, 24(2), 250-257.
Trowsdale, J., & Parham, P. (2004). Mini-review: Defense strategies and immunity-related genes. Eur
J Immunol, 34(1), 7-17.
Ullah, M., Schmidt, H., Cho, K. H., & Wolkenhauer, O. (2006). Deterministic modelling and stochastic
simulation of biochemical pathways using MATLAB. Paper presented at the IEE Proceedings - Systems
Biology.
Vastrik, I., DEustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G., Croft, D., et al. (2007). Reactome:
A knowledge base of biologic pathways and processes. Genome Biol, 8(3), R39.
Voit, E. O., Almeida, J., Marino, S., Lall, R., Goel, G., Neves, A. R., et al. (2006). Regulation of glycolysis in Lactococcus lactis: An unfinished systems biological case study. Syst Biol (Stevenage), 153(4),
286-298.
401
Waddell, S. J., Butcher, P. D., & Stoker, N. G. (2007). RNA profiling in host-pathogen interactions. Curr
Opin Microbiol, 10(3), 297-302.
Wilczynski, B., Hvidsten, T. R., Kryshtafovych, A., Tiuryn, J., Komorowski, J., & Fidelis, K. (2006).
Using local gene expression similarities to discover regulatory binding site modules. BMC Bioinformatics, 7, 505.
Wiwatwattana, N., Landau, C. M., Cope, G. J., Harp, G. A., & Kumar, A. (2007). Organelle DB: an
updated resource of eukaryotic protein localization and function. Nucleic Acids Res, 35(Database issue), D810-814.
Yuan, X., Hu, Z. Z., Wu, H. T., Torii, M., Narayanaswamy, M., Ravikumar, K. E., et al. (2006). An
online literature mining tool for protein phosphorylation. Bioinformatics, 22(13), 1668-1669.
Zaffuto, K. M., Piccone, M. E., Burrage, T. G., Balinsky, C. A., Risatti, G. R., Borca, M. V., et al. (2007).
Classical swine fever virus inhibits nitric oxide production in infected macrophages. J Gen Virol, 88(Pt
11), 3007-3012.
Zhang, R., & Zhang, C. T. (2006). The impact of comparative genomics on infectious disease research.
Microbes Infect, 8(6), 1613-1622.
K ey T erms
Bayesian Network: Represent a probabilistic relationships between a set of nodes: it is a directed
graph which vertexes encode conditional independencies between the nodes.
Boolean Network: A set of nodes whose state is determined by other nodes in the network: it is a
directed graph where nodes are either present (node of value=1) or absent (node of value=0).
Biochemical Reaction: Is a chemical reaction that involves biological molecules.
Gene Products: RNA, mRNA, protein and metabolites.
402
403
Chapter XXIV
Systems Biology of HumanPathogenic Fungi

Daniela Albrecht
Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany
Reinhard Guthke
Olaf Kniemeyer
Axel A. Brakhage
Leibniz Institute for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Germany and Friedrich Schiller University (FSU), Germany
abstract
This chapter describes a holistic approach to understand the molecular biology and infection process
of human-pathogenic fungi. It comprises the whole process of analyzing transcriptomic and proteomic
data. Starting with biological background, information on Aspergillus fumigatus and Candida albicans,
two of the most important fungal pathogens, is given. Afterwards, techniques to create transcriptome
and proteome data are described. The chapter continues with explaining methods for data processing
and analysis. It shows the need for, and problems with data integration, as well as the role of standards,
ontologies, and databases. General aspects of these 3 major topics are explained and connected to the
research on human-pathogenic fungi. Finally, the near future of this research topic is highlighted. This
chapter aims to provide an overview on analyses of data from different cellular levels of human-pathogenic fungi. It describes their integration and application of systems biology methodologies.
Systems Biology of Human-Pathogenic Fungi
INTR OD UCTI ON
In the past, biologists mostly studied one or few genes or gene products at a time. In recent years, it
was realized that it is not enough to understand the basic elements of a system. It is also necessary to
understand the system as a whole. Systems biology is a holistic, cross-disciplinary approach of studying
biological systems (Ideker, Galitski, & Hood, 2001). It considers organisms and their environment as series
of hierarchical levels. Systems biology can be divided into two branches. Bottom-up approaches, which
aggregate detailed knowledge of single components and their interactions into suitable modules, are the
first one. The second branch are top-down approaches, which decompose global data to gain knowledge
of smaller modules. Bottom-up approaches are most important in research on human-pathogenic fungi
at present. They work by systematically perturbing the biological system under study, monitoring the
response, integrating data, and finally modelling the biological process. Iteratively, experiments are
made to validate the model and the model is refined to fit experimental findings.
A completely sequenced genome of an organism is the framework of any global approach. Together
with transcriptomic and proteomic data, it builds the foundation for systems biology (Aggarwal & Lee,
2003). For several human-pathogenic fungi, including Aspergillus fumigatus and Candida albicans, the
genome has been sequenced. Several studies have been conducted using omics methodologies and storing results in databases. This chapter wants to provide insight into current work that is done on different
cellular levels of human-pathogenic fungi and the state of systems biology research in this area.
BACKGR
OUND
Mycoses
Experts believe that 1.5 million fungal species exist. Only 100-150 of them are associated with human
infections. First human mycoses were discovered in the middle of the 19th century. In the last two
decades, the number of patients suffering from invasive forms of fungal infections has grown rapidly.
Today, A. fumigatus and C. albicans are the two major causes of such invasive diseases (Kullberg &
Oude Lashof, 2002). Main reason for this development is the rising number of immunocompromised
patients that are mostly affected by opportunistic pathogens. Humans that undergo immunosuppressive
therapy, transplantations, intensive care or massive surgeries as well as humans infected with HIV are
most susceptible. In addition, aging of the population and increased survival chances for premature
newborns have increased the number of endangered persons. The investigation of the infection process
on molecular level will help to reduce the risk of disease for these groups of people.
In very recent times, molds other than A. fumigatus (e.g., A. flavus, A. terreus) and yeasts other than
C. albicans (e.g., C. glabrata, C. tropicalis) emerged as human pathogens (Kullberg & Oude Lashof,
2002; Nucci & Marr, 2005). Also, other fungal species, for example Coccidioides posadasii or Cryptococcus neoformans, cause mycoses in humans. These fungi are far less common, so this chapter is
mainly restricted to A. fumigatus and C. albicans.
Aspergillus Fumigatus
Aspergillus fumigatus was first described and characterized by J. B. G. W. Fresenius in 1863. It is the
primary mold pathogen and most important airborne fungal pathogen. The fungus can be found in soil
404
and decaying organic matter and is ubiquitously distributed all over the world. It is highly thermotolerant,
surviving temperatures up to 70C. It is characterized by very small gray-green conidia (spores) that
can easily reach the lung alveoli (Brakhage, 2005). For most patients, site of penetration and infection
is the respiratory tract, especially the lung. Main defense mechanism in humans is the innate immune
system, mainly phagocytic cells. The role of anatomical barriers, humoral components or acquired immunity has been rarely studied (Latg, 1999).
Diseases caused by A. fumigatus can be divided in non-invasive and invasive forms. Non-invasive
aspergillosis includes allergic bronchopulmonary aspergilloses with effects from asthma to fatal lung
destruction. Also aspergilloma, which appear as fungal balls in preexisting lung cavities caused by former
lung disorders, are non-invasive. Both diseases can occur in immunocompetent hosts. Invasive forms
are characterized by infection of lung or sinus tissue, and dissemination through the blood stream. They
are mostly detected in immunocompromised patients. Aspergilloses occur less often than candidiases,
but their mortality can be much higher, ranging from 30% to 90%.
Candida Albicans
Candida albicans was discovered by C. P. Robin in 1853 as Oidium albicans. Afterwards it was renamed
several times and got its final name in 1923 by C. M. Berkhout. The yeast is a commensal of mucosal
surfaces (e.g., oral cavity, gastrointestinal tract, vagina) of up to 71% of the human population. It is a
frequent cause of superficial infections of skin and mucosae. In immunosuppressed persons, it can also
cause invasive infections by entering the bloodstream and penetrating nearly all organs of the body.
Such candidaemia has the highest mortality rate of all bloodstream infections of around 40% (Mavor,
Thewes, & Hube, 2005).
C. albicans enters the blood stream in three ways. Penetration of epithelial cells (mostly from within
the gastrointestinal tract) is the most frequent kind of invasion. Medical devices like catheters provide a
second possibility of entry. The third way is a damage of body barriers for example by trauma, surgery
or drug treatment. The fungus has the ability to change from commensalism to parasitism (Hube, 2004).
It can colonize different body sites and cause different types of infections. This makes C. albicans the
most frequently isolated fungal pathogen from blood samples and the most important yeast pathogen.
Innate immunity plays a major role in fighting invasive infections. Mainly neutrophils, monocytes, and
macrophages attack fungal cells. Superficial infections are mainly attacked by T-cell immunity.
W ORKING
ON DIFFERENT
CE LLULAR LE VE LS
G enome
Genomics aims to identify genes and non-coding potentially important regions of a genome. The first
sequencing project started in 1986. Currently, genomes of 26 fungal species are completely sequenced and
published (GOLD, Genomes OnLine Database; retrieved August 29, 2008, http://www.genomesonline.
org/). Among them are A. nidulans and Saccharomyces cerevisiae as model organisms for filamentous
fungi and yeasts, respectively. Six genomes origin from human pathogens, including A. fumigatus and
C. albicans. 19 more human-pathogens are in the process of being sequenced.
405
The genome of A. fumigatus was published in 2005 (Nierman et al., 2005). 29.4 megabases arranged
on 8 chromosomes with 9926 predicted genes were assembled. Many genes are annotated with unknown
function; a lot of annotation work still has to be done. The 21st assembly of C. albicans sequencing was
published in 2007 (van het Hoog et al., 2007). 6090 genes on 8 chromosomes in 15.8 megabases (haploid)
were assembled. Sequencing was difficult because the fungus is diploid and many genes show distinct
alleles. Annotation is a sophisticated process, controlled by Candida Annotation Working Group.
With this wealth of genomic information at hand, it is possible to apply knowledge from model organisms on pathogens and to carry out comparative genomics. For A. fumigatus, this was done for example
in a comparison of its genome to the ones of A. nidulans and A. oryzae. A similar project comparing
C. albicans and C. dubliniensis can be found in literature. Genomic information also enables proper
interpretation of functional genomic studies.
T ranscriptome
Transcriptomics was the first omics technology to emerge after completion of first genome sequences.
Publication of the A. fumigatus genome was already accompanied by a first analysis of its transcriptome
in temperature shift experiments (Nierman et al., 2005). Gene expression studies are very important.
Genomic sequences alone do not explain interplay between genes or how cells work. Immediate product
of transcription is mRNA, which provides the most direct view on genes and their regulatory networks.
Transcriptomics works with the entire mRNA complement of a biological sample, the transcriptome. It
provides a broader, more complete, and less biased view than looking at only one or few genes.
DNA microarray technology is the workhorse for gene expression studies. The concept of this technology was developed in 1985. In 1995, the first gene expression article was published on a study with
Arabidopsis thaliana. In research on human-pathogenic fungi, microarrays are mostly glass slides, on
which DNA molecules are attached at fixed locations (spots). Several thousand spots per slide are possible. Thus, a whole genome can be arranged and investigated on one array. Many different variants of
probe design, target preparation, array imaging, and other parts of the protocol for microarrays exist
(Kawasaki, 2006). One possibility to group microarray experiments is by coloring method (Figure 1).
The first gene expression article described a robotic printed cDNA array used with probes labeled with
two different dyes (Cy3 and Cy5). Results of this method are relative measurements of spot intensities
of both colors (channels). Today, also long (50 to 70mer) oligonucleotides are used, which provide a
higher sensitivity. Another array format, developed by Affymetrix (Santa Clara, USA), uses photolithographic in-situ hybridized 20 basepair oligos. This format allows a very high density of spotted probes.
Here, only one color is used and resulting values are absolute measurements. These two formats are
widely used in all fields of biological research today. In studies on human-pathogenic fungi, the twocolor version is used nearly exclusively. Application of microarray techniques in this field includes at
first identification of virulence-associated genes in different fungal species such as A. fumigatus and
C. albicans. Also detection of cis-regulatory elements and regulons (i.e., sets of co-regulated genes) is
important. Furthermore, expression profiles are used as fingerprints for classification purposes or tests
for relatedness of different fungal species.
Apart from microarrays of any kind, there are several other techniques to investigate mRNA in
organisms (Lockhart & Winzeler, 2000). They include differential display and serial analysis of gene
expression (SAGE). Few studies using differential display in human-pathogenic fungi can be found.
Most of them have been carried out with C. albicans. SAGE was used sparsely in studies with fungal
406
Figure 1. Process of creating DNA microarray data
pathogens (e.g., Cryptococcus neoformans). Both techniques provide additional knowledge to microarray-derived information. Often, results from two or more approaches are similar, but they never seem
to be identical.
Proteome
The word proteome was introduced in 1994. It describes the ensemble of protein forms expressed in a
biological sample. Proteome analysis provides valuable information about living systems. Alternative
splicing, different protein isoforms or protein complexes cannot be measured by analyzing the transcriptome. Proteins, not transcripts, are acting in most biological processes in a cell.
407
Traditionally, proteomics was defined as large-scale analysis of proteins by two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) including mainly measuring of expression levels. 2D-PAGE
followed by mass spectrometry (MS) for protein identification is the workhorse of this field.
Two-dimensional gels are used since 1975 to simultaneously resolve large numbers of proteins. They
are able to show post-translational modifications of a protein as a series of spots and thus make them
available for analysis. First dimension is isoelectric focusing (IEF) which separates proteins by their
isoelectric point. Second dimension is an SDS polyacrylamide gel, separating proteins by molecular
weight (Figure 2). Publication of the first genome sequences enhanced the use of this technique dramatically. Function could be assigned to many proteins on the basis of preliminary annotation of these
sequences. Also, development of MS techniques as matrix assisted laser desorption/ionization - time
of flight (MALDI-TOF) for identifying proteins in 1987 made two-dimensional gel electrophoresis
technology more powerful. A more recently (1997) developed special form of 2D-PAGE is difference in
gel electrophoresis (DIGE). Here, two samples and one internal standard are colored with different dyes
and run on the same gel. Resulting values are ratios (Cy3/Cy2)/(Cy5/Cy2) instead of absolute values.
This reduces gel-to-gel variation on the quantity of a protein spot and makes matching of several gels
and thus comparisons between gels much easier. This creates statistically significant data with fewer
gels than in traditional 2D-PAGE (Marouga, David, & Hawkins, 2005). Classical 2D-PAGE and DIGE
are used in various applications in research on human-pathogenic fungi. Important are investigations
of yeast-to-hyphal transition for example in C. albicans, Penicillium marneffei, and other dimorphic
fungi. Research on composition of fungal cell walls is equally important because it is the first fungal
part in contact with host cells. Drug response is investigated largely on proteomic level. Biofilm formation for example of C. albicans is also a topic that is important for infecting humans and is investigated
by using 2D-PAGE.
Although 2D-PAGE is widely used, it has some disadvantages. It ignores the possibility of co-migration. In addition, types of proteins that can be visualized on a gel are limited. For example, proteins
with extreme values of isoelectric point and molecular weight cannot be displayed. To overcome these
drawbacks, other methods have emerged. There are several mass spectrometry based approaches like
(high performance) liquid chromatography coupled to tandem mass spectrometry ((HP)LC-MS/MS).
HPLC-MS/MS was used for example to investigate vaccines against Coccidioides posadasii. Protein
arrays are beginning to be used quite similar to mRNA arrays (Pandey & Mann, 2000). They have been
applied in few analyses on C. albicans. Again, no single platform exists providing a solution to every
problem. Different ones have to be merged for a systematic view.
Other Omics
Apart from genomics, transcriptomics, and proteomics, many other omics approaches exist. Prominent
examples are metabolomics, lipidomics, interactomics, and phenomics. Metabolomics investigates all
metabolites in an organism. Lipidomics is the field of research on all lipids of a cell. The interactome
is the totality of molecular interactions in an organism. A collection of all phenotypic information observed in wild-type organisms and upon mutations of genes is called phenome. Many more are listed
with explanations in http://www.genomicglossaries.com/content/omes.asp. Few of them play a role in
investigation of human-pathogenic fungi. Metabolomics studies can be found in literature. Phenomics
and interactomics research have been conducted in S. cerevisiae as model organism. Most of the other
global approaches do not have impact on fungal research yet.
408
Figure 2. Process of creating two-dimensional gel data
PR OCESSING
DATA
AND ANA LYSIS OF TRANSCRIPT
OMIC AND PR OTE OMIC
Two-channel DNA microarrays and 2D-PAGE are the most common techniques for analyzing transcriptome and proteome of human-pathogenic fungi. This section aims at describing processing and
analysis of such data.
Image A nalysis Programs

When experiments on transcriptomic and proteomic level are performed, DNA microarrays or 2D gels
are scanned and these scans are analyzed. Raw data of both approaches are monochrome images. Trans-
409
formation into gene expression matrices is not trivial. In research on human-pathogenic fungi, this is
preferentially done by special software packages. GenePix (Molecular Devices Corporation, Sunnyvale,
USA) and ArrayVision (GE Healthcare UK Ltd, Buckinghamshire, England) are widely used for twochannel microarrays. Manual assistance is often unavoidable because spots might be irregular in shape
or fluorescence impurities on chip surface may confuse algorithms. For 2D-PAGE images, DeCyder
(GE Healthcare UK Ltd, Buckinghamshire, England) and Delta 2D (Decodon GmbH, Greifswald, Germany) are examples for software able to analyze images from classical 2D-PAGE and DIGE. All these
programs provide functionalities for comparing spots across arrays or gels. This is a particular critical
step in proteomics. Proteins move freely in a gel and spots do not appear at fixed known positions. None
of the available software platforms provides perfect automatic spot detection and matching between
gels. Manual inspection is needed. As it is of interest for most experiments, all above platforms detect
differentially expressed genes or proteins and give significance measures.
In research on human-pathogenic fungi, analyses often rely on this software only and report results
without any other bioinformatical inspection of data. In other fields of research, additional methods for
data processing are developed and will be the topic of the next sections.
R eplication and E xperimental D esign

Observed varieties in experiments have two sources. They can be technically caused or result from
biological variability. Therefore, technical and biological replicates need to be carried out. It is possible
to spot one probe multiple times on each microarray. If some spots cannot be evaluated due to technical
problems, there will be a backup. Making more than one microarray or 2D-PAGE gel out of the same
biological sample is another possibility to remove technical variety. It corrects differences in labeling
reactions. A third way is the so-called dye swap. It removes gene or protein specific dye effects. All these
technical replicates are essential to produce significant data. They address measurement error or noise
in experiments. Nevertheless, major variability is caused by different cells, tissues or organisms in one
experiment. Hence, biological replicates are as relevant as technical ones. Statistical estimates consider
a minimum of three replicates as sufficient. Intensity values of replicates are mostly averaged.
To minimize the number of arrays or gels in one experiment that is needed to produce that many
replicates, a careful experimental design is necessary (Ehrenreich, 2006). The classical approach is to
compare different samples to the same reference. This reference design needs lots of sample material
since it is using half of it for measuring the control sample. A better variant is the loop design. Each
sample is compared to the following one and the last one is additionally compared to the first. This
needs less material by creating more replicates than the reference design. Another experimental design
is the all-pair design. All samples are compared to each other. It needs many arrays or gels but provides
the most robust data.
Missing Value Imputation

Every microarray and 2D-PAGE experiment shows missing values. There are many reasons like occurrence of spots with very low intensities or high background noise. Gels have the additional problem of
spot matching mentioned above. In research on human-pathogenic fungi as well as in other research areas
especially on proteomic level, this problem is mostly ignored. Spots with missing values are skipped out
of analysis or are automatically substituted by zero or very small values within image analysis programs.
410
This ignores correlation structure of microarray or gel data. Some more advanced analysis techniques
like special normalization methods, clustering or principle component analysis require complete data.
Lots of packages for statistical programming languages like MATLAB (The MathWorks, Inc., Natick,
USA) or R (The R Project for Statistical Computing, http://www.r-project.org/) exist and can be applied
with little effort. Two approaches for imputation of microarray data (Troyanskaya et al., 2001), namely
the k-nearest neighbor (KNN) approach and imputation by singular value decomposition (SVD) were
introduced in 2001. KNN is used in a present study with A. fumigatus (Albrecht et al., in preparation).
Since 2001, many other methods for imputation have been developed. Among them are regression
methods like local least square (LLS) imputation with several variants. Examples for non-regression
methods are Gaussian mixture modeling (GMC) and Bayesian principal component analysis (BPCA).
Few of them have been used on more than benchmark data, now. GMC has been used in one study with
A. fumigatus (Guthke, Kniemeyer, Albrecht, Brakhage, & Moeller, 2007).
N ormalization and T ransformation

Normalization of DNA microarray and 2D-PAGE data corrects for systematic biases that can result for
example from different amounts of RNA used, array manufacturing method or hybridization conditions.
Several methods exist that have been developed for microarray data and can be applied to gel data, too
(Jung et al., 2006). There is no method that can be recommended for every study. Different normalization methods may lead to different results when applied on the same dataset.
Whenever more than one microarray or gel is investigated, normalization must be done in two parts:
within and between arrays or gels. First, background correction has to be done within arrays or gels.
One possibility is to subtract local background intensity in an area around the spot from foreground
intensity of the spot itself. Afterwards, intensity-based filtering within an array or gel is done. This
excludes spots with very low intensities from analysis. It has been shown that these spots disturb most
normalization and filtering methods. Now, different channels or arrays and gels are scaled to reduce
individual channel, array or gel effects on spot intensities. Sum, mean, median or other statistical
measures of intensities in both channels of one array or gel and in different arrays or gels are assumed
to be equal (Do & Choi, 2006). A normalization factor is calculated and all features are scaled according to this factor. Afterwards, mostly ratios of intensities of both channels or of two arrays or gels are
calculated. Ratios have the disadvantage to treat up- and downregulated features differently, leading to
values of 2 and 0.5 for example. By logarithmizing ratios to base 2, such features will get values of 1
and -1, respectively. Applying the logarithm also reduces variation in the variances of medium and high
intensity spots. Additionally, it makes data be approximately normally distributed which is important
for some filtering approaches.
In most studies on human-pathogenic fungi, normalization stops here. All methods for normalization
and transformation outlined above are implemented in image analysis programs. It was shown several
times for example for DIGE data that normalization by image analysis programs does not remove all
bias in data. More sophisticated normalization includes locally weighted linear regression (LOWESS),
variance stabilization (VSN) and quantile normalization as prominent methods. All three are freely
available within R packages. LOWESS is used within arrays or gels (Do & Choi, 2006). It corrects
intensity dependent effects and can also remove spatial bias on arrays and gels. VSN is a method for
normalization between arrays or gels (Jung et al., 2006). It uses a multiplicative and additive model to
normalize data. The scaling factor reflects dye specific gain and the additive offset removes background
411
fluorescence. Additionally, an asinh function is used instead of logarithmizing ratios to stabilize variance for low intensity spots, too. Quantile normalization works within and between arrays or gels. It
assumes that intensities (or means, medians or other statistical measures of intensities) have the same
empirical distribution across arrays or gels or across different channels.
Result of normalization is a table of normalized intensities or ratios for all features, i.e., genes and
proteins, in rows and for all conditions in columns. Now, these features have to be filtered to extract
differentially expressed ones.
F iltering
Filtering of genes or proteins using a fixed fold change (i.e., ratio of spot intensities of two different
channels, arrays or gels) threshold followed by t-test statistics is the most often used method of determining differential expression when working with data of human-pathogenic fungi. Usually a fold
change threshold of 2 is used because it was shown that changes of microarray and 2D-PAGE gel data
are significant above this level. To detect more subtle regulations, Z-scores can be used (Quackenbush,
2002). Genes or proteins with Z-scores outside the range of [-1.96, 1.96] are differentially expressed at
95% confidence level. This way, the threshold is adapted and thus more specific to the particular dataset.
Z-scores did not play a large role in research of human-pathogenic fungi on transcriptome and proteome
level yet. They have been applied in one proteomic study investigating the temperature resistance of A.
fumigatus (Kniemeyer et al., in preparation). Fold change and Z-score are well correlated. They can be
modified to be intensity dependent and hence better reflect data structure.
After feature selection using one of the above methods, a significance measure of the resulting list
of genes or proteins is needed. t-test or analysis of variance (ANOVA) (Do & Choi, 2006; Jung et, al.,
2006) are often used in studies with human-pathogenic fungi as they are implemented in image analysis
programs. t-test and ANOVA calculate one p-value for each feature. They do not account for multiple
testing that is actually done (each feature is tested against the hypothesis of being differentially expressed).
p-values have to be adjusted to be really meaningful. Several methods exist to do such adjustment.
Bonferronis technique multiplies all unadjusted p-values with the number of tests performed and is
therefore very strict. Other methods like Holms technique or Benjamini-Hochbergs technique are more
practical. Until now, none of these methods has been applied to data of human-pathogenic fungi.
Interpretation and Visualization

Biological interpretation of gene and protein expression data is not always easy and straightforward.
Several methods aim at simplifying this process by dividing data into meaningful groups or reducing dimensionality. Others infer regulatory networks from expression matrices to explain biological
function. All methods have been widely used on microarray data until now; application on gel data
will probably follow. They are based on the assumption that similar expression profiles indicate related
biological function.
Dividing data into different groups can be done by supervised or unsupervised machine learning
(Brazma & Vilo, 2000). Supervised learning needs prior knowledge of classes in data. It aims at constructing classifiers like linear discriminants, support vector machines (SVM) or decision trees to be
able to assign a given expression profile to one of predefined classes. Decision trees have been used
for analyses with Aspergillus species, investigating antifungal drugs and production of secondary me-
412
tabolites. Unsupervised learning or clustering makes nearly no assumptions on data. Data are mined
for naturally occurring clusters that are investigated for similarities within and dissimilarities between
them. There are many different clustering methods, hierarchical, partitioning or density based. Every
approach has strengths and weaknesses. Hence, it is recommended to validate cluster results by using
more than one method. In research on human-pathogenic fungi, clustering has been conducted several
times. It is not restricted on transcriptomics or proteomics. Also comparative genomics and metabolomics data of these fungi have been clustered. Mostly hierarchical agglomerative methods were used. In
one study, Fuzzy C-means was applied (Guthke et al., 2007).
Principal component analysis (PCA) and correspondence analysis (CA) (Fellenberg, Hauser, Brors,
Neutzner, Hoheisel, & Vingron, 2001) are techniques of dimension reduction. They try to visualize
data in two or three dimensions while preserving commonalities and differences of higher dimensions
as good as possible. They display those dimensions that account for the maximum amount of variation in a dataset. It is possible to display features in space of experimental conditions or experimental
conditions in feature space. Both methods can also scale data such that rows and columns are treated
equivalently. In this way, they can visualize features and experimental conditions simultaneously in the
same bi-plot. PCA was used for example to differentiate clinical strains of A. fumigatus from environmental ones, to measure influence of growth conditions on C. albicans and several filamentous fungi
or to visualize serum response to C. albicans cell wall. CA was used for example to type isolates of
A. fumigatus or to separate 2D-PAGE gel profiles of patients suffering from invasive candidiasis from
those of control patients.
Network inference is a reverse engineering method (Filkov, 2005). Nowadays, it is mostly conducted
on transcriptomic data. There are several different methods for inference of Boolean or continuous,
deterministic or stochastic networks, directed or undirected graphs. One method, namely network
inference by using differential equations, has been applied to data from A. fumigatus (Guthke et al.,
2007). Thereby, the unique temperature resistance of the fungus was investigated. The same method is
currently applied to analyze the response of C. albicans to blood and serum.
All results obtained by methods described above are only first steps on the way to understanding
underlying biological processes. The workflow in Figure 3 shows the path from experimental data to
mathematical models. These models should formulate testable hypotheses and be applied to design
further biological experiments to finally validate the hypotheses.
DATA INTEGRATI ON
Benefits and Problems
Flow of information is complex between different cellular levels. It was especially shown for transcriptomic
and proteomic data that correlation is low or medium for most genes and their corresponding proteins.
Integration of both types of data clearly provides additional information. Data from different cellular levels
may even be informative in a complementary manner. In particular, observed dissimilarities between
monitored behaviour on transcriptomic and proteomic level can show important post-transcriptional
regulatory junctures (Chan, 2006). It has also been described that data from one special approach can
be erroneous or incomplete due to technical problems. Some results may not be reliable even though
replicates are used. In 2D-PAGE gels for example, some proteins cannot be found because of their low
413
Figure 3. Iterative process of data analysis
abundance or extreme values of isoelectric point and molecular weight. Results coming from another
cellular level and achieved with different technology can correct, verify, and complete data.
All these advantages make the integrated analysis of more than one type of data desirable (Figure
4). Still, no integrated analyses of data from human-pathogenic fungi can be found in literature. Few
approaches have been made by integrating heterogeneous data from S. cerevisiae. Reason for this may
be the fact that data integration is not simple. One problem relies on different half-lifes of mRNAs and
proteins in the cell. mRNAs have a half-life of minutes to hours, proteins of minutes to sometimes
days. It was shown that in some organisms time delay of protein formation after mRNA production
can sometimes be several hours. This complicates the situation even more. By making snapshots of
a cellular situation, as it is done by analysing microarrays and gels, it is difficult to conclude whether
the level of a special protein really reflects momentary processes on mRNA level or whether it is an
artifact of earlier states. This problem can partially be avoided by analyzing time courses and trends
in regulation. Time courses have the additional advantage of directly reflecting the dynamic structure
of biological systems.
Above issues reflect biology and the difficulty to correlate measurements to actual processes in the
cell. There are also some technical hindrances. Technical quality of DNA microarrays and 2D-PAGE
is hard to compare. Carefully thinking and adapting of all steps of data preprocessing is important to
account for specialties of each technology. Another problem is availability and accuracy of cross references between transcripts and corresponding proteins. In human-pathogenic fungi, for many genes
there is no protein known yet. This complicates interpretation of transcriptomic results and validation by
proteomic data. Another difficulty is heterogeneity of data formats. Different platforms of microarrays
or two-dimensional gels produce data in different formats. They are difficult to compare within one cel-
414
Figure 4. Integration of data from different cellular levels
lular level and even more between two levels. To retrieve all information from different formats and to
be able to reliably interpret this information, implementation of standards and ontologies is needed.
S tandards and Ontologies

Transcriptomic and proteomic data are highly context dependent. Their interpretation and replication
requires detailed knowledge of experimental design, samples, and protocols. Standards to achieve accurate and consistent annotation of experiments are necessary.
Several international consortia exist that address the issue of standardizing microarray data (Kawasaki, 2006). Microarray Gene Expression Data (MGED) Society (http://www.mged.org) is the most
prominent one. It is an international organization of biologists, computer scientists, and data analysts
existing since 1999. It aims at facilitating sharing of microarray data by establishing standards for data
annotation and exchange as well as creation of databases and software implementing these standards.
MGED develops a three component standard. Minimum information about a microarray experiment
(MIAME) is a guideline of minimal information essential for unambiguous interpretation and reproduction of microarray data. It contains data on experimental design, array design, samples used, and every
experimental step including data analysis. Microarray gene expression object model (MAGE-OM) and
415
markup language (MAGE-ML) are developed for data exchange. MGED ontology (MO) comprises
common terms and annotation rules to describe experiments. Since 2002, several journals demand
publications being accomplished by data in MIAME compliant format stored in databases supporting
this standard. This includes data from human-pathogenic fungi.
Control and standardization of proteomic data is not mature yet. Proteomics Standards Initiative
(PSI, http://www.psidev.info) of Human Proteome Organization (HUPO) is an open community effort defining standards for data representation since 2002. They are working on a similar package for
proteomic data like MGED for microarray data. It consists of MIAPE, a counterpart to MIAME, data
exchange formats, and an ontology. Standards documents for MS data are already published. MIAPE:
Gel electrophoresis, version 1.2. has been posted on Nature Biotechnology website (http://www.nature.com/nbt/consult/index.html) in December 2006. It is open for community comments and will be
published after revision. Standards for other parts of proteomic research like sample preparation and
proteomic informatics are under way.
Both consortia not only develop their own standards and ontologies but they also incorporate and
improve existing ones like functional genomics experiment model specification (FuGE) or gene ontology
(GO). A goal for future work of MGED Society and HUPO PSI is working together on an internationally accepted standard for both transcriptomic and proteomic and possibly other omic data. All these
standards and ontologies are general for all applications in biology. They are not restricted to research
on human-pathogenic fungi.
D atabases and D ata W arehouses

The huge amount of biological data makes it necessary to store them in local or global databases. There
are lots of databases and data warehouses for molecular biology available via internet (Galperin, 2008).
Important for nearly every research in this area are for example NCBI GenBank (http://www.ncbi.nlm.
nih.gov/Entrez) with all known nucleotide and protein sequences or UniProt (http://www.uniprot.org),
the universal protein knowledgebase. Various degrees of data integration can be found in data bases
like these.
Only few databases are specially important for experimental data of transcriptomics and proteomics
from human-pathogenic fungi (Table 1).
Data warehouses incorporating experimental data from different cellular levels of these fungi are
not available via internet so far. One data warehouse is being developed (Albrecht, Kniemeyer, Brakhage, Berth, & Guthke, 2007). Also, functional genomics databases containing data from different
cellular levels of other organisms are not very common. A reason may be the difficulty of storing such
heterogeneous, semi-structured, complex data. They have to be stored and displayed in a way that
helps scientists to understand and interpret complex observations. A combination of optimal storage
(warehousing, information management) and biological reality is desirable. Basic genomic data have to
be provided additionally to experimental data. They may be imported out of sequence databases like
GenBank and have to be kept up to date. Also annotation of data has to be provided, as well as direct
access to experimental information in a standardized format. Automatic annotation is not accurate and
comprehensive enough, manual annotation is very time consuming. When building a data warehouse,
some basic analysis tools are desirable. Incorporating all these aspects into one database or data warehouse is difficult and labor intensive.
416
Table 1. Databases and data warehouses important for research on human-pathogenic fungi; extended
extract of the supplement to Galperin (2007)
Name
Description
URL
transcriptomic databases
ArrayExpress
Public collection of microarray gene

expression data
http://www.ebi.ac.uk/microarrayas/ae/
GEO
Gene expression omnibus: Gene

expression profiles
http://www.ncbi.nlm.nih.gov/geo
SWISS-2DPAGE
Annotated 2D gel electrophoresis

database
http://www.expasy.org/ch2d
2D-PAGE
Proteome database system for microbial http://www.mpiib-berlin.mpg.

research
de/2D-PAGE
proteomic databases
fungal databases
CandidaDB
Candida albicans genome database
http://genodb.pasteur.fr/cgi-bin/
WebObjects/CandidaDB
Candida Genome
Candida albicans genome database
http://www.candidagenome.org/
Gnolevures
A comparison of S. cerevisiae and

14 other yeast species, including C.
glabrata
http://cbi.labri.fr/Genolevures/
CADRE
Central Aspergillus data repository
http://www.cadre-genomes.org.
uk/
PHI-base
Genes affecting fungal pathogen-host

interactions
http://www.phi-base.org/
e-fungi
Genomes and functional genomic data

of different fungal species
http://www.e-fungi.org.uk/
database.html
F UT URE TRENDS
Systems biology in human pathogenic fungi is just emerging. Techniques are still in their infancies
and have to become part of the daily routine of researchers in this field. DNA-microarray technology is
well established but still far from being fully exploited (Hoheisel, 2006). Also, 2D-PAGE still suffers
from many unsolved problems and is not able to really display a whole proteome of an organism. In
the future, research on human-pathogenic fungi should improve both techniques but also make use of
alternatives like SAGE and LC-MS/MS. This will provide a broader view on special research topics
than we have today.
Another issue for future work is to analyze both time-series and spatial data. Compartmentalization
is an important issue for future data acquisition, analysis and modeling. Proteomics can reveal localizations of proteins in a cell, but this is rarely used now. Additionally, localization of the pathogen within
its host must be analyzed. A. fumigatus and C. albicans can invade tissue and spread over the whole
body. This process has not only to be investigated with regard to time but also to localization.
417
Furthermore, research has to include more organisms. A. fumigatus and C. albicans are important
fungal pathogens but not the only ones. For example C. glabrata, a pathogenic yeast closely related to S.
cerevisiae, could be an interesting model organism. Systems biology is most advanced on S. cerevisiae.
Methods should be easily applicable to C. glabrata and could reveal basic pathogenicity mechanisms
via comparison to the non-pathogenic yeast.
It is also necessary to include data from research on the host site. A. fumigatus and C. albicans are
opportunistic fungi. The status of the immune system of the host plays a major role in infection and has
therefore to be included in a holistic model, too.
Even more important, data have to be integrated. It was shown several times that transcriptomic
and proteomic data are complementary. One cellular level alone is not sufficient to fully understand the
infection process. At the moment, integration of data is mostly done to support findings of one type of
data with another one. This should clearly be extended to integrated models of more than one cellular
level. Gene regulatory network models as in Guthke et al. (2007) are valuable and bring new knowledge,
but they need to be extended to proteins to reflect biological reality more closely. Databases and data
warehouses will play a vital part in this.
Integration of different techniques and approaches requires interdisciplinarity between biologists
and bioinformaticians of several fields to a much higher extent than today. Integration of knowledge
from different fields of biology and tools from mathematics and computer science is essential. This also
includes the application and combination of bottom-up and top-down approaches. We have just made
few steps in the direction of systems biology. Further steps are necessary to understand the infection
process of human-pathogenic fungi and to cure or even prevent mycoses.
C ONC LUSI ON
We know a biological system, when we can redesign it and predict its resulting properties. Systems
biology of human-pathogenic fungi, as of most other medical research areas, has not come thus far yet.
Fully sequenced genomes of some fungal pathogens as basis are available since few years. In many cases,
assembly or annotation is not fully finished yet. Data from other cellular levels are being collected and
analyzed. Most work is currently done by perturbing systems and measuring response on transcriptomic
or proteomic level. Techniques for this part have been developed and are widely applied. We are just
starting to reach beyond this mere data collection to understand the complex system of infection. Until
now, only little effort has been made in data integration and modeling. Some approaches exist and have
been used to S. cerevisiae as a model organism. They can be applied to data from human-pathogenic
fungi in the future.
REFERENCES
Aggarwal, K., & Lee, K. H. (2003). Functional genomics and proteomics as a foundation for systems
biology. Briefings in Functional Genomics and Proteomics, 2(3), 175-184.
Albrecht, D., Kniemeyer, O., Brakhage, A. A., Berth, M., & Guthke, R. (2007). Integration of transcriptome and proteome data from human-pathogenic fungi by using a data warehouse. Journal of
Integrative Bioinformatics, 4(1), 52.
418
Brakhage, A. A. (2005). Systemic fungal infections caused by aspergillus species: Epidemiology, infection process and virulence determinants. Current Drug Targets, 6(8), 875-886.
Brazma, A., & Vilo, J. (2000). Gene expression data analysis. Microbes and Infection, 3(10), 823-829.
Chan, E. (2006). Integrating transcriptomics and proteomics. Genomics and Proteomics, 6(3), 20-26.
Do, J. H., & Choi, D. K. (2006). Normalization of microarray data: Single-labeled and dual-labeled arrays. Molecules and Cells, 22(3), 254-261.
Ehrenreich, A. (2006). DNA microarray technology for the microbiologist: An overview. Applied Microbiology and Biotechnology, 73(2), 255-273.
Fellenberg, K., Hauser, N. C., Brors, B., Neutzner, A., Hoheisel, J. D., & Vingron, M. (2001). Correspondence analysis applied to microarray data. Proceedings of the National Academy of Sciences of
the United States of America, 98(19), 10781-10786.
Filkov, V. (2005). Identifying gene regulatory networks from gene expression data. In Aluru (Ed.), Handbook of Computational Molecular Biology (pp. 27-1 - 27-30). Florida: Chapman&Hall/CRC Press.
Galperin, M. Y. (2008). The molecular biology database collection: 2008 update. Nucleic Acids Research,
36(Database issue), D2-D4.
Guthke, R., Kniemeyer, O., Albrecht, D., Brakhage, A. A., & Moeller, U. (2007). Discovery of gene
regulatory networks in aspergillus fumigatus. Lecture Notes in Bioinformatics, 4366, 22-41.
Hoheisel, J. D. (2006). Microarray technology: Beyond transcript profiling and genotype analysis. Nature
Reviews. Genetics, 7(3), 200-210.
Hube, B. (2004). From commensal to pathogen: Stage- and tissue-specific gene expression of candida
albicans. Current Opinion in Microbiology, 7(4), 336-341.
Ideker, T., Galitski, T., & Hood, L. (2001). A new approach to decoding life: Systems biology. Annual
Review of Genomics and Human Genetics, 2, 343-372.
van het Hoog, M., Rast, T. J., Martchenko1, M., Grindle, S., Dignard, D., Hogues, H., et al. (2007). Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes.
Genome Biology, 8(4), R52.
Jung, K., Gannoun, A., Sitek, B., Apostolov, O., Schramm, A., Meyer, H. E., et al. (2006). Statistical
evaluation of methods for the analysis of dynamic protein expression data from a tumor study. RevStatStatistical Journal, 4(1), 67-80.
Kawasaki, E. S. (2006). The end of the microarray tower of babel: Will universal standards lead the
way? Journal of Biomedical Techniques, 17(3), 200-206.
Kullberg, B. J., & Oude Lashof, A. M. L. (2002). Epidemiology of opportunistic invasive mycoses.
European Journal of Medical Research, 7(5), 183-191.
Latg, J. P. (1999). Aspergillus fumigatus and aspergillosis. Clinical Microbiology Reviews, 12(2), 310350.
419
Lockhart, D. J., & Winzeler, E. A. (2000). Genomics, gene expression and DNA arrays. Nature,
405(6788), 827-36.
Marouga, R., David, S., & Hawkins, E. (2005). The development of the DIGE system: 2D fluorescence
difference gel analysis technology. Analytical and Bioanalytical Chemistry, 382(3), 669-678.
Mavor, A. L., Thewes, S., & Hube, B. (2005). Systemic fungal infections caused by Candida species:
Epidemiology, infection process and virulence attributes. Current Drug Targets, 6(8), 863-874.
Nierman, W. C., Pain, A., Anderson, M. J., Wortman, J. R., Kim, H. S., Arroyo, J., et al. (2005). Genomic sequence of the pathogenic and allergenic filamentous fungus aspergillus fumigatus. Nature,
438(7071), 1151-1156.
Nucci, M., & Marr, K. A. (2005). Emerging fungal diseases. Clinical Infectious Diseases, 41(4), 521526.
Pandey, A., & Mann, M. (2000). Proteomics to study genes and genomes. Nature, 405(6788), 837846.
Quackenbush, J. (2002). Microarray data normalization and transformation. Nature Genetics, 32, 496501.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibishirani, R., et al. (2001). Missing
value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.
K ey T erms
Data Warehouse: Special type of database for storage of heterogeneous data from different sources.
Data warehouses are optimized for supporting and including data analyzes.
Database: Structured collection of large amounts of data. Queries can be run to collect and display
subsets of the data to many users.
Filtering: Process of separating different types of data from the whole. A filtering criterion has to
be applied, the choice of which is critical for results.
Genomics: Determination and investigation of the entire genome of an organism. Sequencing is a
big part. However, the functional characterization of genes is a big challenge.
Human-Pathogenic Fungi: Fungal species that cause diseases in humans. They can be environmental and act as external pathogens like A. fumigatus. They can also be commensals of humans and
act as endogen pathogens like C. albicans.
Imputation: Substitution of some values for missing data. When missing values are imputed, data
can be analyzed using standard techniques for complete data.
Normalization: Process of removing bias in data. Normalization tries to minimize the influence of
measurement error on results of analyses.
420
Proteomics: Large-scale study of proteins in a cell, tissue or organism under certain conditions.
It is more difficult and less mature than transcriptomics. Proteins are actors of nearly all biological
processes.
Transcriptomics: Examination of mRNA expression levels in a given cell population. mRNA content
and distribution of a cell closely reflects the activation status of its genes.
421
423
Chapter XXV
Development of Speci.c
Gamma Secretase Inhibitors
Jessica Ahmed*
Julia Hossbach*
Paul Wrede
Robert Preissner
abstract
Secretases are aspartic proteases, which specifically trim important, medically relevant targets such
as the amyloid-precursor protein (APP) or the Notch-receptor. Therefore, changes in their activity can
lead to dramatic diseases like M. Alzheimer caused by aggregation of peptidic fragments. On the other
hand, the secretases are interesting targets for molecular therapy of the multiple myeloma, because the
over-expressed Notch-receptor does not emerge into the native conformation until the cleavage by the
presenilin, the active and catalytic subunit of the gamma secretase, occurs. Here, we focus on a novel
methodology of structure-based drug development, feasible without prior knowledge of the target structureanalogy modeling. This combination of similarity screening, fold recognition, ligand-supported
modeling, and docking is exemplarily illustrated for the structure of presenilin and specific inhibitors
thereof.
Development of Speci.c Gamma Secretase Inhibitors
Introduction
Aspartic proteases have received considerable attention as potential targets for pharmaceutical intervention since many play important roles in physiological and pathological processes. Despite numerous efforts, the only inhibitors for aspartic proteases currently on the market are directed against the
HIV protease, an aspartic protease of viral origin (Eder et al. 2007). All other known aspartic protease
inhibitors including those targeting renin, BACE1 and gamma secretase (Tsai et al. 2002) did not yet
overcome the clinical or preclinical development due to problems regarding their speci. city.
Alzheimer disease (AD) is the most frequent cause of dementia. About five million patients in the
seven largest Western economies suffer from that disease. The common form affects humans over 60
years of age and its incidence increases as age advances. AD is characterized by a progressive loss of
short-term memory and impaired cognitive function. In later stages additional symptoms aggravate the
Figure 1. Processing scheme for the -amyloid precursor protein. APP (Amyloid Precursor Protein)
is processed in two steps. First, beta secretase generates two fragments, the beta-amyloid precursor
peptide and the C-terminal fragment (CTF-) C99 fragment. In a second step, the gamma secretase
cleaves the C99 fragment into an A fragment and AICD (APP Intracellular Domain). Sometimes the
gamma secretase generates an A fragment with 42 amino acids instead of 40. The 42 peptide aggregates
rapidly to amyloid plaques with the fatal consequence that the nerve cell degrades (Figure adopted from
Wrede, 2005).
424
Development of Specific Gamma Secretase Inhibitors
situation and patients become totally unable to care for themselves. AD is associated with an accumulation
of amyloid plaques and neurofibrillary tangles in the brain. These morphological alterations are believed
to be causally related to the neurodegenerative process. The beta-amyloid is produced by proteolytic
cleavage of amyloid precursor protein (APP) first by beta and then by gamma secretases (Figure 1).
Beta secretase (also known as BACE1, Asp2, Memapsin) is a transmembrane aspartyl protease that
generates the N-terminus of beta-amyloid by cleaving APP on the luminal or extracellular side into betaAPP and C99. BACE1 is a prominent target for the treatment and prevention of AD. Since it catalyses a
pivotal step in amyloid production, its inhibition should have a positive impact on the progression of the
disease (Sinha et al. 1999). In detail, the molecular cause of the severe AD is the aggregation of short
peptides with the length of 42 amino acids. These amyloid plaques forming peptides are a fragmentation product of the C99 peptide after hydrolysis by the gamma secretase. The regular function of this
protease generates the soluble beta-peptide with 40 amino acids in length.
Role of Secretases in Cancer

Multiple myeloma (also known as plasma cell myeloma, or as Kahlers disease) is a type of cancer of
plasma cells, which are immune cells in the bone marrow producing antibodies. Multiple myeloma is the
second most prevalent blood cancer (10%) after non-Hodgkins lymphoma. It represents approximately
1% of all cancers and 2% of all cancer deaths.
Myeloma is regarded as incurable and, therefore, novel therapeutic approaches like proteasome
inhibitors or specific gamma secretase inhibitors are in great demand.
The gamma secretase complex is a multi component intra-membrane aspartyl-protease that cleaves
the amide bonds of its substrates within their transmembrane regions (Wolfe 1999, Wolfe 2001a). Both,
Notch and the beta-amyloid precursor protein (APP) are cleaved by presenilin, the catalytic subunit of
the gamma secretase that, along with nicastrin, Aph-1 und Pen-2, is necessary for the protease activity
(Wolfe 2001b). Its activity towards the Notch-receptor, which is involved in gene regulation mechanisms controlling multiple cell differentiation processes, renders it interesting as a cancer target. The
differential effects of kinase inhibitors on beta-amyloid precursor protein processing (without influence
on Notch cleavage) are promising for the therapy of Alzheimers disease. For the therapy of cancer,
however, modulators are in great demand that specifically inhibit the cleavage of Notch, but not of APP
(Geling et al. 2002). Here, we present a successful strategy concatenating various in silico and in vitro
methods to develop and validate specific gamma secretase inhibitors.
The highly conserved Notch-receptor is a transmembrane heterodimeric receptor which consists
of four distinct members (Notch 1-4). The physiologic functions of Notch signaling are multifaceted,
including maintenance of stem cell regulation, of differentiation as a well as in oncogenesis. Binding of
Notch-ligand to the receptor leads to Notch signaling by release of the intracellular domain of Notchreceptor through a cascade of proteolytic cleavages by both alpha secretase and gamma secretase.
The initiation starts through contact between the receptor and its ligands. Receptor-ligand interaction
leads to successive cleavage events, of which the third is mediated by the gamma secretase. Following
this cleavage, the intracellular domain of Notch translocates to the nucleus and acts as a transcription
coactivator (Shih 2007).
Furthermore, Notch is involved in the Wnt pathway through beta catenin, a cytoskeletal component,
which enters the nucleus to act as a transcriptional cofactor. By binding of Wnt to the receptor Frizzled,
425
Figure 2. Notch signaling pathway. Presenilin is associated with gamma secretase activity involved in
signaling transmembrane protein notch. The large cell surface protein Notch is activated by contact
with membrane-bound ligand on neighboring cells. Binding of delta/Serrate/Lag-2 by notch results in
three proteolytic cleavages of Notch. The furin-like enzyme cleaves Notch constitutively adjacent to the
amino acid sequence RQRR in the extracellular domain (Jundt et al. 2002). The second cleavage is
the cleavage of the extracellular domain and is catalyzed by a metal protease termed TACE or alpha
secretase. The third cleavage occurs within the extracellular domain of Notch by a gamma secretaseactivity that is dependent on presenilin and is responsible for the release of Notch to the extracellular
domain (NICD). The NICD then moves to the nucleus where it is involved in transcriptional regulation
(Shih 2007). (Figure adopted from Biocarta Pathway Collections. http://www.biocarta.com/genes/allPathways.asp)
426
the activity of the glycogen synthase kinase 3 (GSK-3) is inhibited. Phosphorylation of beta-catenin
induces ubiquitination and proteolytic degradation of beta catenin by the proteasome, while non-phosphorylated beta-catenin is stable and therefore able to enter the nucleus to regulate transcription by
activation of genes responsible for cell survival, proliferation and differentiation during development
(Hayward et al. 2008). Notch signaling plays an important part in cancer development; consequentially
targeting Notch signal steps can have anti-tumor effect. Targeting the gamma secretase is therapeutically important as the gamma secretase is necessary for the activation of all 4 Notch receptors and in
many cancers more than 1 Notch homologue is expressed.
Recently, the oncogenic potential of Notch has been analyzed in multiple myeloma and, furthermore,
it has been discovered that treatment with a specific gamma secretase inhibitor induces apoptosis in
myeloma cells via specific inhibition of Notch signaling (Nefedova et al. 2004). This cytotoxic effect
can be explained by the upregulation of the proapoptotic protein Noxa. Furthermore, Nefedova et al.
could show that pharmacologic inhibition of Notch signaling may enhance the effect of chemotherapy
in multiple myelomas via up regulation of Noxa (Nefedova et al. 2008).
Drug development goals

One of the most important goals in drug development is the identification of compounds with simultaneous
high target affinity and specificity. A comprehensive compound library aims covering the entire chemical
space. Very often, putative compounds have high affinity but also affect all structurally similar targets
which may lead to undesired adverse effects. Therefore, a balance between affinity and specificity is
aspired. Moreover, a specific modulator can be better than an inhibitor with high affinity. Today, many
computational methods are available to propose new putative drugs starting from a lead structure. The
2D-similarity search identifies new compounds with higher affinity than the lead structure, but similar
specificity, because the results share the same scaffold. To detect scaffold hoppers, structurally or chemically related conformers with deviate folds, the 3D-similarity screening is the method of choice.
Another new approach is fragment-based drug design which increases the affinity and combines
different specificities of the fragments. The fragmentation of lead structures into smaller pieces has
been used to simplify the analysis of ligand binding and to define different pharmacophoric elements
necessary for high affinity binding (Hajduk 2007). Nowadays, the pharmaceutical industry holds large
libraries of hits from the High Throughput-Screening assays. These medium-affinity hits could be
merged to more specific leads exhibiting affinities-improved by orders of magnitude.
Analogy modeling and drug design principle

Considering the fact that the gamma secretase has an outstanding role in many diseases like M.
Alzheimer or cancer, it is of great medical interest to develop inhibitors or modulators. For structurebased drug design, a 3D-structure of the gamma secretase or even the structure of the catalytic subunit,
the presenilin, has to be available. But until now, no crystal structure of the gamma secretase could
be determined, which results in difficulties to develop new inhibitors by bioinformatical methods. In
the following, it will be described, how different approaches can be combined to a new methodology,
called analogy modeling, enabling structure-based drug design without prior knowledge of the target
427
Figure 3. Drug developmental goals. To find new drugs a comprehensive compound library aims in
covering the chemical space (a-axis). The goal is to find putative drugs with high specificity (z-axis)
and high target affinity (y-axis).The 2D-similarity screening locates new compounds with higher affinity
than the lead structure, but identical specificity. The second mountain illustrates a different scaffold,
which can be reached by 3D-similarity searching. Another new approach is fragment-based drug design
which increases the affinity and combines different specificities of the fragments (highest point of the
bridge between the two scaffolds).
structure (Figure 4). The analogy modeling consists of methods like structure prediction, similarity
searching by use of 2D - and 3D - screening.
Structure prediction
Fold Recognition / Threading
The modeling of a structure for the active subunit of the gamma secretase (presenilin) requires a known
structure with a similar fold. The fold recognition uses a sequence-based property profile of the target
presenilin, which is threaded through all experimentally determined structures from the Protein Data
Bank. In this case, the translocon, a transmembrane-protein, was identified to be compatible with the
structural requirements. The translocon is a multifunctional complex involved in regulating the interaction of ribosomes with the endoplasmatic reticulum. Furthermore, the translocon is responsible for the
correct orientation of membrane proteins (Skach 2007).
428
Figure 4. Analogy modeling of a presenilin inhibitor. The analogy modeling is a combination of different methodologies to create a structure of a target and to determine new compounds for its inhibition.
For structure prediction, in case that just a protein with similar structural requirements exists, fold
recognition and ligand supported binding site modeling are combined. Different screening methods like
3D- and 2D-similarity search predict new putative compounds starting from a lead structure, which additionally helps to refine the binding pocket of the target. Furthermore, in vitro and in vivo experiments
and X-ray and NMR can help to refine and to validate the target structure and the putative compounds.
In vitro assays can be used to sort out quickly non-binders etc. X-ray and NMR titration are used to
identify and to verify the binding mode of the inhibitors. The result is an optimized target (presenilin)
and putative active ligands.
Ligand-Supported Binding Site Modeling

A help for the modeling of the binding site was the availability of the crystal structure of the beta secretase. Both enzymes, gamma and beta secretase, have two catalytic aspartate-residues in their binding
sites, which are necessary for the catalytic action. The geometry of the catalytic site of the beta secretase
was transferred into the structural arrangement of the translocon.
In sil ico screening-similarity search

In contrast to the experimental identification of new compounds, the in vivo screening, the searching
for compounds by bioinformatical methods is called in silico screening. The experimental procedure
to search for new drugs usually involves the labor- and cost-intensive screening of huge libraries of
429
Figure 5. Structure of the presenilin and the translocon. (a) The presenilin is a transmembrane protein
with ten helices. (b) A compatible transmembrane protein is the translocon. The figure shows the model
of the catalytic subunit of the gamma secretase, which was built according to the suitable part of the
translocon. The two aspartate residues of the catalytic site (taken from the beta secretase) are shown
as stick representation.
(a)
(b)
chemicals in biological High-Throughput-Screening (HTS). The bioinformatical approach involves the

screening of whole libraries as well, but with effective algorithms that enable the researcher to identify
compounds with high structural similarity to known effective substances. With this method compounds
can be found, that exhibit the same or even higher efficiency but better bioavailability or less toxic effects. Compared to the biological HTS the bioinformatical screening is fast and affordable, even for
academic use. For the determination of new drug candidates, the 2D-and 3D-similarity searching are
well-established methods. Substances that show high similarity, on both the 2D-and 3D-level, promise
to have similar properties compared to the lead structure (Lyne 2002). Therefore, known inhibitors of
the gamma secretase were used to search compound databases containing millions of small molecules.
Examples for databases with large amounts of compounds are PubChem (http://pubchem.ncbi.nlm.
nih.gov/; Lazo et al. 2006), a database of chemical structures of small organic molecules, SuperDrug
(http://bioinformatics.charite.de/superdrug/; Goede et al. 2005), a conformational drug database or SuperNatural (http://bioinformatics.charite.de/supernatural/; Dunkel et al. 2006), a searchable database
of available natural compounds.
2D - similarity searching
Similarity screening is an established method for identifying structures with a high similarity assuming
that this leads to similar properties of the compound. To identify compounds with the desired structures
in the databases, which are used for screening, a twofold method has been used. Firstly, the concept of
the `structural fingerprint`, a bit vector, which encodes for chemical and topological characteristics of a
molecule, was applied. This structural fingerprint has been calculated for the lead structure and finally
has been used to search through the databases. Secondly, this molecule was compared to the compounds
of the various databases by using the Tanimoto coefficient. This coefficient gives values in the range of
430
zero (no bits in common) to unity (all bits the same). It is also known as the Jacard coefficient and, when
used to measure dissimilarity rather than similarity, as the Soergel distance. The Tanimoto coefficient
was used in some of the earliest studies of fingerprint-based similarities and is now the coefficient of
choice in different software systems for chemical information management (Willett 2006).
The Tanimoto coefficient can be calculated as follows:
T =
N ab
N a + Nb N ab
where Na is the number of bits set to 1 in compound a, Nb is the number of bits set to 1 in compound
b and Nab is the number of bits, which were set to 1 in compound a as well as in compound b. Only
compounds with a 2D-similarity > 85% were considered. The 2D-similarity search is usable to sample
the chemical space. Of course, the 2D-similarity does not give any information about the spatial similarity of two compounds, but is a valuable tool to recognize chemical similarity. To consider structural
similarity, the 3D-similarity searching has been applied, which will be described in the following.
3D - similarity searching
To find new scaffold hoppers, the 3D-similarity search has been performed. For the identification of the
scaffold hoppers an automated conformer-based 3D-superposition algorithm (Thimm et al. 2004) was
used, which already identified new cancer (Fullbeck et al. 2005) and TSE (Lorenzen et al. 2005) directed
compounds. The lead structure is compared with all drug-like components and its conformers, which
are pre-computed by the MedChemExplorer of Accelrys (Smellie et al. 2003) and stored in a database,
which now consists of about 140 million conformers of about four million different compounds. For
the comparison, a plane, which represents the x-,y-,z-extensions, is positioned in all small molecules,
their conformers and in the lead structure. These cuboids are superimposed with attached centre of
mass, which leads to four different possibilities of superimposition. Assignments of atoms, which are
close to each other, were superimposed according to W. Kabsch (Kabsch 1976). The last step is the
implementation of refinements to optimize the following score:
score = (percentage of superimposed atoms) e-RMSD
Filtering
Lipinski Rule-of-Five
To make a statement on the bioavailability of a compound, which might be used as a drug, the Lipinski
rule-of-five can be comprised. This rule combines properties, which compounds should fulfill to become
drug candidates.
This rule claims that an orally available drug has:
431
Figure 6. Principle of the 3D superposition for the similarity screening. For the comparison of the lead
structure (query) with all small molecules of a database a plane, representing the moments of inertia,
is put into all structures and the centre of mass is calculated for the structures. The long and the small
sites of the cuboids are superimposed, whereas the centre of mass is attached, which results in four
variants of superimpositions. Assignments of atoms, which are nearer than to a particular cut-off, were
superimposed according to W. Kabsch (Kabsch 1976). If necessary, further refinements are performed
to optimize the score.
1.
2.
3.
4.
Not more than 5 hydrogen bond donors

Not more than 10 hydrogen bond acceptors
A molecular weight below 500 g/mol
A LogP below 5
LogP is a parameter that gives information about the lipophilicity of a molecule and is defined as
the logarithm of the 1-octanol/water partition. Compounds, that do not achieve at least four rules of the
rule-of-five, do not promise to be a candidate for a drug (Lipinski et al. 2001).
Docking
Docking describes the prediction of the positioning of a ligand in the binding pocket of a target to build
a complex with lower energy. It represents a fast and efficient possibility to screen molecule libraries for
potential ligands of a target molecule. Modern docking tools use different methods for the prediction of
possible binding poses of the ligand. Thereto, in a first stage the active site has to be specified by the user
or by the docking program itself. Afterwards, different ligand positions will be predicted, which mostly
proceeds in two phases. In the first one, the ligand is positioned in the pocket of the target by rotating
and translating the ligand into the binding pocket and considering potential interaction partners in the
binding site. Beside the rotation and translation of the whole ligand-molecule, the generation of different
conformers by performing intramolecular torsions leads to a better fitting of the ligand in the pocket.
In the second phase, an energy minimization of the target-ligand-complex is performed and evaluated
by a scoring function. To evaluate putative gamma secretase resp. presenilin inhibitors that have been
432
detected by similarity screening, docking is a suitable approach to predict the binding of the ligands.
For docking to the gamma secretase, we applied the commercial program Gold (Jones et al. 1997),
which bases on a genetic algorithm to explore the full range of ligand conformational flexibility with
partial flexibility of the protein. It mimics the process of evolution by applying genetic operators to a
collection of putative poses for a single ligand. Gold contains two scoring functions (fitness function in
genetic algorithm terminology), the Chemscore and the GoldScore. The Chemscore contains the following terms: hydrogen bonding, metal binding, lipophilic interactions and freezing of rotable bonds. The
Chemscore includes additional terms: a covalent energy term, a penalty for steric overlap and ligand
torsion terms (Konstantinou-Kirtay et al. 2007). The GoldScore function consists of four components:
protein-ligand hydrogen bond energy, protein-ligand Van Der Waals energy, ligand internal Van Der
Waals energy and ligand torsional strain energy.
Results
During the screening, we were able to identify a structure which is common in all hits. This structure
is mostly symmetric and consists of three aromatic rings and, furthermore, a polar and an apolar side
(Figure 7). This characteristic structure could be found in all results of the analogy modeling. In Figure
8 the docking of the best fitting structure is depicted. The docking picture visualizes very strikingly the
deep binding of the symmetric structure in the pocket of the presenilin. These results of the analogy
modeling have been validated in experimental assays. An inhibition could be detected at a concentration
of 20M, furthermore, the gamma secretase inhibitor could successfully be combined with proteasome
inhibitors, like bortezomib, to treat efficiently multiple myeloma (Nefedova et al. 2004).
Figure 7. Shared structure of the best screening hits. The screening hits, which were also validated experimentally, hold the same symmetric structure. They consist of three aromatic rings and a polar (R1)
and apolar (R2) site. A1 can be, for example, sulfonamide, whereas A2 can be, for example, oxygen,
sulphur.
433
Figure 8. Docking of a screening hit into the binding site of the modeled presenilin pocket. The surface of
the binding site is shown and a representative of a putative inhibitor is illustrated in stick representation.
The Gold-docking result shows that the best screening hits fit into the binding pocket of the presenilin.
Conclusion
Finding novel drug candidates is one of the most amazing and challenging aims of bioinformatical studies. Structure-based drug design has become an established part of the pharmaceutical development
pipeline. But often, no structural information of the target is available and in these cases the approaches
are restricted to similarity screening or library prioritization. Here, we present a successful example,
that the huge amount of nearly 50,000 experimentally determined protein structures enables new approaches like analogy modeling, because there are so many protein folds known, that the number of
newly detected folds has decreased dramatically. Thus, for most targets, even membrane proteins, it
will be possible to detect homologues with known structure or at least proteins with similar fold. The
fold recognition or threading, which delivers rather mid-quality models has to be combined with ligandsupported modeling to improve the quality towards structures suitable for docking of putative ligands.
The successful testing of the gamma secretase inhibitors proposed using this analogy modeling scheme
encourages further studies regarding targets without detailed structural information.
References
Biocarta Pathway Collections. http://www.biocarta.com/genes/allPathways.asp
Dunkel, M., Fullbeck, M., Neumann, S., & Preissner, R. (2006). SuperNatural: A searchable database
of available natural compounds. Nucleic Acids Research, 34, 678-683.
Eder, J., Hommel, U., Cumin, F., Martoglio, B., & Gerhartz, B. (2007). Aspartic proteases in drug discovery. Current Pharmaceutical Design, 13, 271-287
434
Fullbeck, M., Huang, M., Dumdey, R., Frommel, C., Dubiel, W., & Preisnner, R. (2005). Novel curcuminand emodin-related compounds identified by in silico 2D/3D conformer screening induce apoptosis in
tumor cells. BMC Cancer, 5, 97.
Geling, A., Steiner, H., Willem, M., Bally-Cuif, L., & Haass, C. (2002). A gamma-secretase inhibitor
blocks notch signaling in vivo and causes a severe neurogenic phenotype in zebrafish. EMBO, 3, 688694.
Goede, A., Dunkel, M., Mester, N., Frommel, C., & Preissner, R.. (2005). SuperDrug: A conformational
drug database. Bioinformatics, 21(9), 1751-1753.
Hajduk P., J., Geer, J., (2007). A decade of fragment-based drug design: Strategic advances and lessons
learned. Nature Reviews Drug Discovery, 6, 211-219.
Hayward, P., Kalmar, T., Martinez Arias, A. (2008). Wnt/Notch signalling and information processing
during development. Development, 135(3), 411-424.
Jones, G., Willett, P., Glen, R.C., & Leach, A.R. (1997). Development and validation of a genetic algorithm for flexible docking. Journal of Molecular Biology, 267(3), 727-748.
Jundt, F., Anagnostopoulos, I., Frster R., Mathas, S., Stein, H., & Drken, B. (2002). Activated Notch1
signaling promotes tumor cell proliferation and survival in Hodgkin and anaplastic large cell lymphoma.
Blood, 99(9), 3398-3403.
Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta Crystallographica
Section A, 32(5), 922-923.
Konstantinou-Kirtay, C., Mitchell, J, & Lumley, J.A. (2007). Scoring functions and enrichment: A case
study on Hsp90. BMC Bioinformatics, 8(1), 27.
Lazo, J. S. (2006) Roadmap or roadkill: A pharmacologists analysis of the NIH Molecular Libraries
Initiative. Molecular Interventions, 6, 240-243.
Lipinski, C. A., Lombardo, F., Dominy, B.W., & Feeney, P.J. (2001). Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced
Drug Delivery Reviews, 46(1-3), 3-26.
Lorenzen, S., Dunkel, M., & Preissner, R. (2005). In silico screening of drug databases for TSE inhibitors. Biosystems, 80(2), 117-122.
Lyne, P. D. (2002). Structure-based virtual screening: An overview. Drug Discovery Today, 7(20),
1047-1055.
Nefedova, Y., Cheng, P., Alsina, M., Dalton, W.S., & Gabrilovich, D.I. (2004). Involvement of Notch-1
signaling in bone marrow stroma-mediated de novo drug resistance of myeloma and other malignant
lymphoid cell lines. Blood, 103(9), 3503-3510.
Nefedova, Y., Sullivan, D, M, Bolick, S.C., Dalton, W.S., & Garilovich, D.I. (2008). Inhibition of Notch
signaling induces apoptosis of myeloma cells and enhances sensitivity to chemotherapy. Blood, 111(4),
2220-2229.
435
Shih, I.-M., & Wang, T.L. (2007). Notch Signaling, gamma-Secretase Inhibitors, and Cancer Therapy.
Cancer Res, 67(5), 1879-1882.
Sinha, S., Anderson, J., P., John, V., McConlogue, L., Basi, G., Thorsett, E., & Schenk, D. (1999). Purification and cloning of amyloid precursor protein beta-secretase from human brain. Nature, 402(6761),
537-540.
Skach, W. R. (2007). The expanding role of the ER translocon in membrane protein folding. Journal
of Cell Biology, 179(7), 1333-1335.
Smellie, A., Stanton, R., Henne, R., & Teig, S. (2003). Conformational analysis by intersection: CONAN.
J Computational Chemistry, 24(1), 10-20.
Thimm, M., Goede A., Hougardy, S., & Preissner, R. (2004). Comparison of 2-D similarity and 3-D
superposition. Application to searching a conformational drug database. Journal of Chemical Information and Computer Science, 44(5), 1816-1822.
Tsai, J., Gerstein, M. (2002). Calculations of protein volumes: Sensitivity analysis and parameter database. Bioinformatics, 18, 985-995.
Willett, P. (2006). Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today,
11(23-24), 1046-1053.
Wrede, P., & Filter, M., (2005) Bioinformatics: From peptides to profiled leads. Biopharmaceuticals
VHC-Wiley, Weinheim, 4, 1771-1801
Wolfe, M. S., De Los Angeles, J., Miller, D.D., Xia, W., & Selkoe, D.J. (1999). Are presenilins intramembrane-cleaving proteases? Implications for the molecular mechanism of Alzheimers disease.
Biochemistry, 38, 11223-11230.
Wolfe, M. S. (2001). Gamma-Secretase inhibitors as molecular probes of presenilin function. Journal
of Molecular Neuroscience, 17, 199-204.
Wolfe, M. S. (2001). Presenilin and gamma-secretase: Structure meets function. Journal of Neurochemistry, 76, 1615-20.
Key Terms
AD: Alzheimer Disease is a neurodegenerative disease, which primarily occurs in groups at the age
of 65. About 60 percent of all dementia is caused by AD.
APP: (Beta-) Amyloid Precursor Protein is an integrale membrane protein, which might be involved
in the development of synapses.
BACE1: BACE1 is a synonym of the beta secretase, which is an enzyme of the aspartic protease
family and cleaves the APP. It is involved in Alzheimer development.
GSK-3: Glycogen Synthase Kinase 3 is a serine/threonine protein kinase.
436
HTS: High Throughput Screening is an approach, which involves the screening of large compound
libraries, which allows identification of active molecules in experimental (e.g. cell) assays.
NMR: Nuclear Magnetic Resonance is a spectrometric method for determination and analysis of
structures and dynamics of molecules.
RMSD: Root Mean Square Deviation. In this chapter, the rmsd-value reflects the grade of similarity of two structures.
TACE: Tumor Necrosis Factor Alpha Converting Enzyme is a metalloprotease which is responsible
for the cleavage of Notch outside the membrane
TSE: Transmissible Spongiform Encephalopathies, a condition affecting the brain or nervous system
of humans and animals. The main hypothesis for TSE is transmission by prions.
X-Ray: Electromagnetic radiation, which is also used for crystallography of unknown structures.
Note
*
Both authors contributed equally to this work.
437
438
Chapter XXVI
In Machina Systems for the

Rational De Novo Peptide Design
Paul Wrede
abstract
Peptides fulfill many tasks in controlling and regulating cellular functions and are key molecules in
systems biology. There is a great demand in science and industry for a fast search of innovative peptide
structures. In this chapter we introduce a combination of a computer-based guided search of novel
peptides in sequence space with their biological experimental validation. The computer-based search
uses an evolutionary algorithm that includes artificial neural networks as fitness function and a mutation operator, called the PepHarvester. Optimization occurs during 100 iterations. This system, called
DARWINIZER, is applied in the de novo design of neutralizing peptides against autoantibodies from
DCM (dilatative cardiomyopathy) patients. Another approach is the optimization of peptide sequences
by an ant colony optimization process. This biologically-oriented system identified several novel weak
binding T-cell epitopes.
WHAT MEANS PEPTIDE DESIGN?

Peptides regulate and control many cellular processes. Many cell-cell interactions make use of peptide
recognition and binding. Peptides serve as hormones like ACTH and vasopressin or intercellular signaling
molecules producing only a specific response in target cells after interaction with the cognate receptor.
Most receptors bind only a single or a group of closely related molecules. The humoral immune system
synthesizes antibodies and the antigen is often a peptide. A successful application of neutralizing antibody binding by de novo designed peptides is described in detail (Schneider, Wrede 1993; Schneider
et al. 1998). In contrast the cellular immune system works with peptides as mediators between antigen
presenting cells and T-cells. The binding of peptides to the MHC I receptor and the T-cell receptor
In Machina Systems for the Rational De Novo Peptide Design
depends on a variety of similar peptide sequences. In the beginning of the nineties a special binding
motif described two anchor positions. In the position 2 and 9 a hydrophobic amino acid with a large
side chain seems to be important (Rammensee et al. 1995; Lund et al. 2005). Since this pattern seems
to be peculiar for MHC I binding peptides it was introduced into a prediction tool called SYFPEITHI
(Rammensee 1999). But many recent studies revealed that this pattern is not sufficient for a prediction
with high accuracy. Still all available MHC I binding peptide prediction tools have a disappointing
reliability (Peters et al. 2006; Filter, Wrede unpublished observations).
Often amino acid sequence patterns are not unique although they fulfill the same function meaning
occupying the same binding sites of the target molecule. This makes the development of prediction and
design tools an extraordinary endeavour. Some solutions to overcome these hindrances are described
in the next section.
Several other sources describe the combinatorial chemical process of peptide design. This knowledge
is also necessary and included into the computer-based rational peptide predictions. But the article here
focuses on the computer-based rational peptide design.
BIOINFORMATIC TOOLS FOR THE COMPUTER-AIDED MOLECULAR DESIGN

I highlight novel techniques for molecule design especially peptide design and molecular feature extraction, which can be applied when three-dimensional molecular structures are not available. A necessary
prerequisite for any rational attempt to identify or even design molecules with a desired property or
activity is an accurate model of the underlying sequence- (structure)-activity relationship (SAR). Such
SAR models serve as guideline in the search for novel and optimized compounds in evolutionary design
cycles which have become possible due to advances in both compound generation and screening technology. It is obvious that the quality of the model determines the success rate of this multi-dimensional
design process. Only if a relevant SAR model is used a rational molecular design can be successful
(Wrede, Schneider 1994; Schneider, Soo 2003; Wrede, Filter 2006).
How can we develop a good SAR model? It is apparent that no cure-all recipe exists, nevertheless
some general rules of thumb can be given. One approach is to consider the task as a pattern recognition problem, where three main aspects must be considered: first the data used for generation of a SAR
hypothesis should be representative of the particular problem; second the way molecular structures are
described for model generation and its level of abstraction must allow for a reasonable solution for the
pattern recognition task; and third the model must permit non-linear relationships to be formulated
since the interdependence between molecular activities and structural entities is generally non-linear.
The first point seems to be trivial but selection of representative data for hypothesis generation is very
difficult and often impossible due to a lack of data. The focus here is on the two latter points, namely
different levels of data representation and descriptor types, and non-linear feature extraction from a
given data set by artificial neural networks (ANN). Various types of ANN are of considerble value for
many fields of research, including chemistry, biology, medicine, and pharmaceutical research. Main
tasks performed by these systems are:

Feature extraction
Function estimation and non-linear modeling
439
Classification
Prediction
For many applications alternative techniques exist (Milne, 1997; Duda et al. 2001); ANN provide,
however, a more flexible and elegant approach offering unique solutions to these tasks. The paradigms
offered by ANN lie somewhere between purely empirical and ab initio approaches.
Neural networks:

Learn from examples and acquire their own knowledge (induction)

Are able to generalize
Provide flexible non-linear models of input/output relationships
Are able to cope with noisy data and are fault tolerant (Schneider, Soo, 2003)
Building Blocks of Neural Network Architecture

Artificial neural networks consist of two elements, (i) formal neurons, and (ii) connections between the
neurons. Neurons are arranged in layers, where at least two layers of neurons (an input layer and an output
layer) are required for construction of a neural network. In Fig 1 a network architecture is shown, which
is a three-layered network with a single output neuron (Schneider, Wrede, 1998). Supervised artificial
neural networks can be applied as function estimators and classificators (Rumelhart et al. 1986; Hertz
et al., 1991). They follow the principle of convoluting simple non-linear functions for approximation
of complicated input-output relationships what is known as Kolmogorov theorem. The incoming data
are transferred to the hidden neuron which includes a sigmoidal activation or transfer function. Such
sigmoidal neuron calculates an output value according to:
Sigm(input) =
1
, where input =
1 + e input
wx
i
Here w is the weight vector connected to the neuron, x is the neurons input signal, and is the
neurons bias or threshold value.
If a single sigmoidal output neuron is used the overall function represented by the fully connected
two-layered feed-forward network shown in is:
wx
f(x) = Sigm(
where x is the input vector (data vector). The network shown in Figure 1 with sigmoidal hidden units
and a sigmoidal output unit represents a more complicated function:
f ( x) = Sigm(
j Sigm(
i , j xi
j)
out
where w are the input-to-hidden weights, v are the hidden-to-output weights, are the hidden layer bias
values, and out is the output neurons bias. The more layers are present in a network the more complicated overall functions can be represented. At most two hidden layers with non-linear neurons are
440
Figure 1. Three-layer artificial neural network
Figure 2. A Bongard problem. Patterns of one class must have a common feature. Here class A is characterized by: Two symbols of very similar size but different shape. Now, it is easy to assign the above
pattern to the correct class. This problem is analogous to the classification of peptides.
required to approximate arbitary continuous functions (Cybenko, 1989). Depending on the application
and the accuracy of approximation the required number of layers and the number of neurons in a layer
can vary. There is a rule of thumb that the number of neurons should not be larger than the number
of data points in the training set in order to avoid an overdetermined system. The ratio of data points
available (training data) and network weights, should be around 2, where = number of training data
points / number of weights.
Information flow is from left to right. For clarity only a few connections between the nodes or
neurons are shown. Neurons are transfer units (sigmoidal transfer function), the output unit is a linear
function, details see text.
Beside establishing an adaptive system to approximate the sequence-function relation like an artificial
neural network data representation is often crucial for feature extraction in noisy sequence data. The
coherence of feature extraction and pattern classification is presented in a symbolized model, a Bongard
problem (Figure 2). Twelve patterns belong to two equal sized classes A and B. To which class belongs
the additional pattern? Have in mind that all patterns of a class must have a common feature. To find
this feature a systematic search for descriptors is necessary to solve the problem. One descriptor can be
the colour. In a heuristic procedure it turns out that the colour is an irrelevant property. The reader is
441
Figure 3. The PepHarvester algorithm. Simplified two-dimensional model of the PepHarvester algorithm.
In the centre is the seed-peptide. The shells represent the Euclidian distance of a complete peptide from
the centre according to a given distance matrix (Table 1). The formula for calculation of the distance of
a peptide with the length I is given in the left.
asked to find the correct properties for the descriptors. Anyway descriptor search is often a big hurdle
when biological moledcules like peptides have to be classified.
For peptides several descriptors like hydrophobicity, side chain volume, and polarity are often a good
first choice. There is no general rule for the number of training cycles required but when the output error
is minimized below a given threshold, the prediction quality can be determined with an independent
test data set.
PEPHARVESTER: GENERATOR FOR FOCUSED PEPTIDE LIBRARIES

Some minor modifications are included in this section here from a publication on the PepMaker algorithm, Schneider G, Grunert HP, Schuchhardt J, Wolf K-U, Mller G, Habermehl K-O, Zeichhardt H,
Wrede P (1995).
For several de novo design projects enough peptide sequence data can be obtained from literature
or current experiments to train artificial neural networks. But for many tasks only a single sequence
is available. Here the design of peptides to neutralize autoantibody binding is described. ANN or
similar pattern classification tools need a sufficient set of data for feature extraction. Sufficient means
as many data as possible which include a common feature and being a representative set of all existing sequences. The PepHarvester algorithm can comply with these requirements generating a focused
peptide library starting from a single known peptide. The algorithm generates variants stemming from
sequence space regions around a so called seed-peptide with a unimodal bellshaped distribution.
It is assumed that molecules with an improved function can be identified among the peptides located
close to the seed-peptide in sequence space (Figure 3). This supposition is motivated by a number of
observations (Dayhoff and Eck, 1968; Eigen 1971; Grantham 1974, Kimura 1983, Myata et al., 1979,
Rao, 1987; Schuster, 1986):
1. In natural evolutionary processes, large alterations of a protein may occur within a generation, but
these extremely different mutants rarely survive (low fitness).
442
2.
3.
Most observed mutations leading to a slightly improved function are single-site substitutions
keeping the vast majority of the sequence unchanged.
Conservative replacements tend to prefer substitutions of amino acids which are similar in their
intrinsic physicochemical properties.
Therefore, we chose a localized, bell-shaped distribution of variants for construction of a useful

peptide library which is thought to approximately reflect these aspects of natural protein evolution.
Even peptides spaced far apart from the seed-peptide in sequence space are included. Large sequence
alterations can also lead to improved function. This might be the case if, for example, several optima
exist in sequence space (Eigen et al. 1988a, 1988b, Fontana et al. 1993; Kauffman, 1993). The methodology is expected to provide an additional technique to generating equally-distributed sets of peptides
for screening if one is interested in a peptide with an optimized or analogue function.
There are two central problems to be solved: first an appropriate definition must be given for a distance measure in sequence space; second, a procedure must be at hand that allows the calculation of
mutation rates for pairwise amino acid exchanges. In the following part we focus on these two tasks in
more detail. An application of the method is provided thereafter.
Selection of an Appropriate Amino Acid Distance Matrix is Context Dependent

Simple metrics in sequence space are available by amino acid distance maps which are based on relations between the individual amino acid residues. The euclidian distance between the peptides A and
A of length n serves as a simple distance-measure in the PepHarvester algorithm:
d A =
2
i
i =1
The euclidian distance evaluates a single large-step point mutation more severely than several smallstep substituions. The distance between two amino acids at sequence position i, i, is taken from the
amino acid distance matrix (Table 1) employed. Several distance maps have been suggested which are
based on very different relationships between amino acid residues. There is much evidence that peptides with a similar biological function (similar phenotype) usually have low pairwise distance values,
and fitness can be regarded as resulting from natural selection (Li and Graur, 1999). Sequences with
a high fitness are preferentially selected which is thought to be reflected by a low distance value here.
In this simple PepHarvester model peptide fitness is determined solely by the amino acid sequence,
and all sequence positions are assumed to contribute to the fitness value of a peptide. In peptide design
experiments partly based on the strategy described here we found that the matrix of Feng et al. 1985
is a good first choice if nothing is known about the structuring of the corresponding sequence space
(Schneider Wrede, 1994). The Feng Matrix (also termed GS-matrix) takes into account both genetic
and structural distances between amino acids (Table 1).
The applicability of a particular metric distance and its usefulness for design is context-dependent.
Aspects of an amino acid substitution should be considered which are relevant to the local structural
environment or particular residue function (Eigen et al. 1988a, 1988b, Taylor 1986). This context
boundedness is reflected by the fact that the same amino acid sequence can adopt different structures or
perform different functions in different environments (Minor Kim 1994), just as different sequences can
443
Table 1. Amino acid distance matrix according to Feng et al. (1985)

A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
A
0,00
0,67
0,30
0,30
0,67
0,21
0,81
0,81
0,50
0,81
0,67
0,50
0,25
0,60
0,81
0,21
0,21
0,21
0,61
0,81
C
1,00
0,00
0,83
1,00
0,50
0,60
0,81
0,81
1,00
0,81
0,67
0,67
1,00
1,00
0,81
0,36
0,81
0,81
0,50
0,60
D
0,45
0,83
0,00
0,17
0,83
0,36
0,60
1,00
0,50
1,00
1,00
0,17
1,00
0,36
0,81
0,60
0,81
0,60
1,00
0,81
E
0,45
1,00
0,17
0,00
1,00
0,36
0,81
1,00
0,30
1,00
0,83
0,50
0,75
0,36
0,81
0,60
0,60
0,36
0,83
1,00
F
1,00
0,50
0,83
1,00
0,00
1,00
0,81
0,36
1,00
0,36
0,67
0,83
1,00
1,00
1,00
0,60
1,00
0,36
0,50
0,21
G
0,25
0,50
0,30
0,30
0,83
0,00
1,00
0,81
0,67
0,81
0,83
0,50
0,75
0,81
0,60
0,21
0,81
0,36
0,50
0,81
H
1,00
0,67
0,50
0,67
0,67
1,00
0,00
1,00
0,50
0,60
0,83
0,30
0,75
0,36
0,36
0,60
0,81
1,00
0,83
0,60
I
1,00
0,67
0,83
0,83
0,30
0,81
1,00
0,00
0,67
0,21
0,30
0,67
1,00
1,00
0,81
0,81
0,60
0,21
0,61
0,60
K
0,75
1,00
0,50
0,30
1,00
0,81
0,60
0,81
0,00
0,81
0,67
0,30
1,00
0,36
0,21
0,60
0,36
0,60
0,83
1,00
L
1,00
0,67
0,83
0,83
0,30
0,81
0,60
0,21
0,67
0,00
0,17
0,83
0,75
0,81
0,81
0,81
0,81
0,21
0,30
0,60
M
1,00
0,67
1,00
0,83
0,67
1,00
1,00
0,36
0,67
0,21
0,00
0,83
1,00
0,81
0,81
1,00
0,60
0,36
0,50
0,81
N
0,76
0,67
0,17
0,50
0,83
0,60
0,36
0,81
0,30
1,00
0,83
0,00
1,00
0,60
0,81
0,21
0,36
0,81
1,00
0,60
P
0,25
0,67
0,67
0,50
0,67
0,60
0,60
0,81
0,67
0,60
0,67
0,67
0,00
0,60
0,60
0,36
0,36
0,60
0,67
0,81
Q
0,75
0,83
0,30
0,30
0,83
0,81
0,36
1,00
0,30
0,81
0,67
0,50
0,75
0,00
0,60
0,60
0,60
0,81
0,83
0,81
R
1,00
0,67
0,67
0,67
0,83
0,60
0,36
0,81
0,17
0,81
0,67
0,67
0,75
0,60
0,00
0,60
0,60
0,81
0,67
1,00
S
0,25
0,30
0,50
0,50
0,50
0,21
0,60
0,81
0,50
0,81
0,83
0,17
0,45
0,60
0,60
0,00
0,21
0,81
0,67
0,60
T
0,25
0,67
0,67
0,50
0,83
0,81
0,81
0,60
0,30
0,81
0,50
0,30
0,45
0,60
0,60
0,21
0,00
0,60
0,83
0,81
V
0,25
0,67
0,50
0,30
0,30
0,36
1,00
0,21
0,50
0,21
0,30
0,67
0,75
0,81
0,81
0,81
0,60
0,00
0,50
0,60
W
1,00
0,50
1,00
0,83
0,50
0,60
1,00
0,81
0,83
0,36
0,50
1,00
1,00
1,00
0,81
0,81
1,00
0,60
0,00
0,60
Y
1,00
0,50
0,67
0,83
0,17
0,81
0,60
0,60
0,83
0,60
0,67
0,50
1,00
0,81
1,00
0,60
0,81
0,60
0,50
0,00
give rise to similar folds (Laurents et al. 1994). Signal peptides of secretory proteins give examples for
sequence feartures largely encoded by physico-chemical properties like hydrophobicity or polarity
(Schneider Wrede, 1993, Schneider Wrede, 1998). Many sterical constraints must be taken into account,
for instance, for the design of idealized helical structures (DeGrado Lear 1990). Thus, selection of a
useful measure of sequence similarity is crucial for successful application of the PepHarvester algorithm
and correctness of the above made assumption concerning the idea of generating variants. In general,
short isolated peptides lack a defined tertiary context and, therefore, distance maps constructed from
intrinsic amino acid properties or structural propensities might be useful as a guide for the generation
of variants. Tertiary context is a major determinant for the design of large polypeptides. The use of
evolutionary sequence profiles might provide alternative sequence descriptions taking some tertiary
constraints into account. Jones and coworkers described methods for the rapid generation of mutation
data matrices from protein sequences which might also be useful for the PepHarvester approach (Jones
et al. 1992).
Mutation Rates are Calculated from Amino Acid Distances

Mutations leading to conservative replacements and as a consequence to a very similar phenotype
are strongly preferred as a result of natural selection. The distance dA,A between the two isofunctional
variant sequences A and A is small provided a sensitive distance matrix is used. The distance dA,A per
se is different from the rate of mutation rA,A and a general rule for converting distances to rates is not
at hand. Based on the results of Myata (1979) who investigated variant hemoglobin sequences and the
relation between sequence distances and mutation rates, we assume that the rates of observed (accepted)
single-site- substitutions
A A are approximately Gaussian-distributed with respect to sequence distance (Schneider Wrede
1994). The rates of the A A transition are based on the probabilities of amino acid substitutions.
Observed substitution probabilities P(i->j) for two amino acids i and j result from both natural mutation rates and subsequent selection. Dayhoff and Eck (1968) described a procedure for calculation of
observed mutation rates. For the conversion of an amino acid distance matrix to a (non-symmetric) rate
444
matrix we have employed a formula where the distance-dependent exchange rate rij is a monotonously
decaying function of the distance:
2
d ij
exp
2 2
i
p (i j ) = ij =
2
d ij
exp
2 2
i
The Gaussian distribution employed in a special bell-shaped distribiution. Probably another monotonously decaying, localized function, for example, exp (-d/ ) would not essentially change the
outcome. is a position-specific parameter defining the shape of the Gaussian distribution which may
be subjected to time - dependent alterations, for example in simulation experiments. This was taken
into consideration since:
1.
2.
Identical single-site substitutions can have very different effects on peptide function, and
The rate of change in amino acid sequences has been only approximately constant in the course
of evolution (Benner et al. 1994).
For > , the value of rij is 0.05 for each amino acid, i.e. all substitutions are equally probable.
Small values lead to narrow distributions of the exchange rates which is thought to reflect strong
selection pressure. The amino acid distance matrix according to Feng et al. (1985) and the rate matrix
with =0.1 (narrow distribution) were used. Based on the rate matrix, variants of the seed peptide
are generated.
PEPTIDE DESIGN CYLE INCLUDES IN VITRO SCREENING AND IN MACHINA

CALCULATIONS
Once a reliable measure for biological peptide function will be established by in vitro, in vivo or ex vivo
tests, the sequences together with this information can be employed for training adaptive systems as
artificial neural networks. The next step is to model a mathematical function which is more suitable for
assigning fitness values to each amino acid sequence with a given length, rather than using a simple
distance measure as described above. In this manner, the sequence space might be structured in a more
subtle way. Based on such heuristics it might be possible to search for new idealized peptides with a
certain function. This peptide strategy design will be described in the next section for the design of
peptides used in neutralizing autoantibody binding. A similar approach led to the identification of novel
organic molecules as modulators for human Kv1.5 Ion channel (Schneider et al., 2000).
A possible peptide design cycle which is based on random screening as well as model-based steps is
shown in Figure 4. Starting from a functional peptide which might have been found by combinatorial
screening is varied by a guided random search e.g. by the PepHarvester program.
445
Figure 4. Peptide design cycle as a combination of in vitro screening and computational techniques
The PepHarvester needs only a single functional peptide to generate a highly enriched focused
library. This library is tested and all data are used in the next step to train an artificial neural network.
The ANN presents a model or fitness function for a guided search through the sequence space-an algorithm called the DARWINIZER.
Then the new peptides generated are tested for their activity in a biological test. The results obtained
by these biological tests provide a basis for a model-based search. The data can be fed into an adaptive
system e.g. neural network, which will be used to find a mathematical model describing the relation
between the amino acid sequence of a peptide and its biological activity. Besides conventional statistics,
neural networks provide flexible systems for approximation of mathematical functions and are of growing
importance for amino acid sequence analysis. Such mathematical models may serve as a fitness function
for subsequent peptide design in machina. We have implemented an efficient evolutionary algorithm
for systematic and fast sequence optimization employing a trained artificial network as fitness function.
This simulated molecular evolution or DARWINIZER technique suggests an optimized peptide which
may either be used directly for in vitro or serve as the new seed peptide for the PepHarvester program,
thereby initiating a new round of the design cycle. Combinations of rational and irrational design concepts with evolutionary optimization strategies might well provide a basis for efficient and fast design of
peptides. Several in vitro tests of peptides optimized with the DARWINIZER were already successful
(Wrede et al. 1998). One example of medical importance is described in the next section.
DESIGN OF SYNTHETIC PEPTIDES PREVENTING THE POSITIVE

CHRONOTROPIC EFFECT OF AUTOANTIBODIES FROM DCM PATIENT SERA
The application of the above proposed rational design cycle lead to a set of novel peptides which prevent
the positive chronotropic effect of anti-1-adrenoreceptor autoantibodies from the serum of patients suffering from idiopathic dilated cardiomyopathy (DCM), a severe autoimmune disease (Figure 5). Recent
studies identified autoantibodies bind to the first and second loop of the 1-adrenoreceptor leading to the
harmful chronic cardiac adrenergic drive under which DCM patients are suggested to be exposed (Cetta,
446
Figure 5. A schematic drawing of the -adrenergic receptor. A seven spanning membrane protein belonging to the class of GPCR (G-protein coupled receptors). In the loop 2 on the lumenal site binds
the autoantibody of DCM patient serum. Epitope mapping identified the antigenic binding site as the
ARRCYNDPKC sequence within the loop.
Michels 1995). The design goal is to find short peptides representing the natural epitope sequences which
can be used as therapeutical molecules. Searching follows the de novo design cycle described above:
The design cycle commences with the a seed peptide. To get a seed peptide the loop 2 sequence
region of the -adrenergic receptor is fragmented into decamer peptides, which overlap by a step size
of 2 residues. These peptide fragments were measured for binding of anti- 1-adrenoreceptor antibodies by ELISA technique. The highest signal correlated with the amino acid sequence ARRCYNDPKC
(positions 107-116). This epitope was already identified as a natural epitope for antibody binding (Wallukat et al. 1995; Mobini et al. 2000) and used here as the seed peptide for generating a focused peptide
library with the PepHarvester algorithm. In Figure 6 the activity of 90 peptides were measured by an
ELISA assay. Peptides with the closest euclidian distance from the seed peptide in the origin (0.229)
give a stronger signal than the seed peptide, dashed line parallel to abcissa. With increasing euclidian
distance from the seed peptide a decaying ELISA signal is observed. Two exceptions from the general
trend must be mentioned: First there are several peptides in the close neighbourhood of the seed peptide
showing a higher activity than the seed. Second peptides even in the further distance are identified with
a comparable activity to the seed. The first observation may be an effect of a local hill climbing in the
natural fitness landscape. Probably the seed peptide is a member of a suboptimal location rather than
a global optimum. The other observation may be some inaccuracy of the distance measurement in the
sequence space or several active peptides reside in other local optima of the fitness landscape. Only in
vitro experiments can confirm the applicability of the PepHarvester algorithm for constructing focused
peptide libraries.
The activity of the seed peptide is indicated by the dashed line. Vertical bars indicate the maximal
and minimal activity of the peptides found in the distance intervals marked on the x-axis. With increasing distance from the seed peptide the lower the activity as it is expected. Striking is the slightly higher
activity of many peptides compared to the seed peptide indicating that other optima exist in sequenceactivity space.
447
Figure 6. Activities of peptides measured in an ELISA of the PepHarvester run
In the next step all ELISA measured peptides are used to train an artificial neural network. The idea
behind is to construct an artificial fitness landscape i.e. to approximate the sequence function relation.
All tested 91 peptides including the seed peptide were described by two physico-chemical properties
per residue: hydrophobicity (Engelman et al.1986) and side-chain volume (Harpaz et al. 1994) which
represent 91 20-dimensional pattern vectors. The quality of the network is sufficient for the further
design step when the peptide activity is correctly predicted.
To get a suggestion about the distribution of the active peptides with an absorbance >103 and inactive peptides with an absorbance <103 a projection in the two-dimensional physicochemical space of
hydrophobicity and side chain volume supports analysis. Principal component analysis (PCA) and
Sammon mapping. The PCA is a linear projection of high-dimensional data on a two-dimensional, the
two principal components, plane, while Sammon mapping is a non-linear projection where the relative
distances between the peptides in the multi-dimensional space are maintained in the 2D projection after
using an optimization procedure. Details are described in Schneider, Wrede 1998. The active peptides
are not spread over the whole map rather they are distributed in two adjacent cluster (labeled I and II).
Anyway the active peptides are spread in a broad region and the inhomogeneous clusters the sequenceactivity relation modeling is a difficult problem it may be even unsolvable. The unexpected existence of
two clusters of the active peptides may be due to the use of polyclonal antibodies in the ELISA system.
It cannot be excluded that the two clusters are supertype binding motifs to two antibodies. Another also
very likely possibility can be that two different peptides bind to the same antibody in different ways
(Kramer et al. 1997).
Nevertheless due to the lack of other data we used these experimentally characterized peptides as
data set for the training of a supervised ANN. For the derivation of an SAR, here the relation of short
peptides and their ability to bind autoantibodies, an ANN represent an adaptive framework for modeling
arbitrary non-linear relationships with a high noise tolerance (Hertz et al. 1991, Wrede, Schneider 1994,
Schneider, Wrede 1998, Schneider, Soo, 2003, Schneider, Baringhaus 2008). To assess the usefulness of
448
the SAR models cross-validation of the data were performed ten times using random 8+2 splits. Several
networks with different numbers of hidden neurons were trained on the prediction of the absorbance
values. The result of this optimization procedure: Number of hidden neurons is five; relative deviation
of prediction with training data 15%, linear correlation coefficient of r = 0.87 (t=16.4). Independent test
data were predicted with a deviation of 17%, r = 0.79 (t = 4.8) Complete cross-validation (leave-one-out)
resulted in a test data deviation of 27%, r=0.59 (t=6.8). According to the t-test values the null-hypothesis
of chance correlation can be rejected. Therefore it is very likely that the trained ANN represents a useful
SAR model for structuring the sequence space. Of course the prediction accuracy of the ANN cannot
be better than the error of the input data in our case 15% error of the ELISA test. More complex ANN
architectures will lead to a decreased learning error but a drastically increased test values error. Since
the ANN used in peptide design gave a poor correlation and elevated test-error in the leave-one-out
procedure, this particular SAR model can only perform of semi-quantitative predictions as differentiating between high, medium and low levels of activity. When the number of peptides for the training
increases the prediction accuracy will be more precise. It cannot be excluded that a choice of another
peptide encoding scheme will improve the results too.
Now there is a heuristic for searching peptides in the sequence space available. The evolutionary
algorithm DARWINIZER with the trained ANN as a fitness function was applied for the rational de
novo design of peptides. The DARWINIZER is described in details in several publications (Schneider,
Wrede, 1994; Wrede, Schneider, 1994, Schneider et al., 1996, Wrede et al. 1998, Wrede, Filter, 2005)
and will be explained briefly here. In an iterative process starting with a parent peptide sequence several hundred peptide variants are generated by amino acid substitutions. The sequence length remains
invariant. The trained ANN serves as a filter system to select out the best variant in each generation.
After several cycles the process is stopped when no further optimization is obtained.
A series of six peptides with a range of predicted activities was tested in a bioassay. The predicted
binding activities were tested in an ex vivo assay. The measured parameter is the beating rate of rat
myocytes after adding immunglobulin sera of DCM patients in the presence or absence of the designed
peptides. The ex vivo response of the beating rate of rat myocytes to different peptide concentrations
in the presence of human anti-1- adrenoreceptor antibodies gave the following results (Table 2): As
predicted the seed peptide ARRCYNDPKC, a natural epitope of the autoantibodies prevented the positive chronotropic effect. The beating rate reverted to the basal level. A very similar effect occurred with
the four peptides (peptides 1-4) predicted to have at least a marginal activity. Peptide 5 represents an
anti-designed peptide leading to a completely inactive peptide as predicted. The ex vivo response of the
designed peptides was concentration dependent. Peptides 1, 2, and 3 neutralized autoantibody activity
Table 2. Activity of peptides in a bioassay. Peptides 1, 2 and 5 were designed de novo. Underlined residues are identical to respective seed peptide residues (natural epitope).
P eptide
S eed peptide
P eptide 1
P eptide 2
P eptide 3
P eptide 4
P eptide 5
Am in o acid -S e quence
ARRCYNDPKC
DRFGDKDIAF
GWFGGADWHA
IWGCSGKLIC
KLDAPTNKWG
FVRRTYYPER
Activity
P redicted
Measu red
H igh
H igh
H igh
M edium
Low
No
H igh
H igh
M edium
M edium
M edium
No
449
in the highest concentration (10g/ml) while with the lowest concentration peptide 1 reveals a stronger
effect than the seed peptide. Anyway a rather similar biological effect was observed for all four predicted
peptides. For the anti-designed inactive peptides (peptide 5) no influence of the rat myocyte beating rate
was found. In addition two random peptide sequences had no effect either (data not shown).
The design algorithm lead to very different sequences although both, peptide 1 and peptide 2 have
the best biological activity. The sequences DRFGDKDIAF (peptide 1) and GWFGGADWHA (peptide
2) have only one amino acid in common, namely the Asp-7, with the seed peptide (ARRCYNDPKC).
This invariant amino acid seems to be important for the binding motif. Indeed, 73% of the potentially
strong-binding peptides possess this residue, and 18% contain an asparagine. It is somehow surprising
that the medium-binding sequence IWGCSGKLIC (peptide 3) sharing Cys-4 and Cys-10 with the seed
peptide and a low-binding peptide, KLDAPTNKWG (peptide 4) lacking any identity with the seed peptide. On the other hand the negative-designed sequence of peptide 5, FVRRTYYPER, has two identical
amino acids (Arg-3 and Pro-8). Such sequence analysis reveals that conserved residues in a short peptide
sequence do not fulfill the property of a binding motif. We like to conclude that the parallel distributing
data procession of the ANN is able to take the context dependency of a sequence into account.
A COMBINATORIAL OPTIMIZATION WITH AN ARTIFICIAL ANT SYSTEM

In the previous section evolutionary algorithms are described for the optimization of peptide sequences.
Here two operators are the key constituents of the system namely the mutation operator connected
with the selection operator. The selection process depends on the fitness function which can consist
of a trained artificial neural network or other pattern classification systems. Such a system is based on
stochastic processes that optimize the overall population of solutions. Another biologically oriented
method for the de novo design of peptides is ant colony optimization (ACO). ACO is introduced by
Dorigo (1992). ACO copy the foraging behaviour of ants like those of the subfamily Dolichodorinae.
Ants are able to find the shortest path connecting the nest and food source. The single ant is unable to
find the shortest path. With pheromones, ants are able to work as individuals. They can coordinate and
organize themselves according to a simple rule, like following the intensity of an odor trace, to prosecute
a common aim (stigmergy).
Here an ACO concept of peptide design is explained with the aim to find new peptide sequences
binding to the MHC I H-2Kb molecules of the mouse. Artificial ants are modeled in such a way that
they move in the search space from one building block (i.e. amino acid) to another and leave traces
of pheromones. The ants collect substituents along their path and assemble virtual molecules by attaching the collected substituents to a given peptide. The product is scored by a trained ANN. Here
we used three different ANN. Each of them used a different set of descriptors for encoding peptide
sequences of a given length, for MHC I binding octamers were designed. If during the path through
sequence space a strong binding peptide was identified then the corresponding path received a strong
pheromone trace. Pheromone intensity can be interpreted as a likelihood function or probabilities for
the molecular building blocks.
Stability test (Brock et al. 1996) measures the binding of peptides indirectly. MHC I molecules are
only stable with a bound cognate peptide. The intracellular peptide transport for loading MHC I molecules includes the essential TAP system. RMA-S cells are TAP deficient and MHC I molecules can
only be stabilized by a external addition of MHC I binding peptides. MHC I is identified by a fluorescent
labeled antibody.
450
Figure 7. A path of an artificial ant in decision space. The dots represent transition probabilities (pheromone concentrations) to move from one residue position to the next. In the example, the artificial ant
moved along a path representing the peptide sequence DKYKFRWR.
Rational peptide design strategies can be applied in novel vaccine concepts. A prerequisite is the
prediction of peptide binding to the MHC I molecule. Important is to find among the predicted binding
peptides which are able to induce and stimulate an immune response. Such peptides are called epitopes.
Immune response occurs via the recognition of the MHC I-peptide complex by the T-cell receptor. Therefore in the current context active peptides are called T-cell epitopes. Several prediction tools for T-cell
epitopes exist but the trained ANN presented here have the best prediction accurracy compared with
all public systems (Filter, Wrede, personal communication). The Matthew correlation coefficient cc was
used to determine classification (test data) and re-classification (training data) accuracies. This analysis
gives immediate information about the generalization ability of a given classifier or QSAR model:
cc =
PN OU
(N + U ) (N + O )(P + U )(P + O )
where P is the number of correctly predicted positive examples, N the number of correctly predicted
negative examples, O the number of false-negatives (under-prediction). This index assumes values in
the interval [-1,1], where cc=1 indicates error free classification. For the MHC I case study, correlation
coefficients of 0.96 (training data) and 0.94 (test data) were obtained. The neural architecture used as
a virtual screening filter is a combination of three different networks and a single jury network. Three
different individual feed forward networks were trained on classifying known binding and non-binding octapeptides. Each neural network was fed with a different molecular descriptor to account for
451
different molecular features (descriptor spaces A,B,C). Then, the output values of networks A,B, and
C were combined to form the input to the jury network, yielding a single score value corresponding to
the predicted probability of a peptide to be MHC I-binding or non-binding. The jury approach made
superior predictions than the individual networks.
For peptide design, the artificial ant system was run using several hundred thousand artificial ants.
Stages of pheromone matrix update are described. In the beginning the ant wander through search space
without a preferred path, corresponding to random peptides. After some time, preferred side-chains
sustituents emerge, and most paths contain these building blocks. Finally the optimal path is found.
The most potent peptide found was ITYQYIPL with low nanomolar activity. It shares residues with
the known mimotope SIYRYYGL at the anchor positions (underlined), which are important for MHC
I binding. Among all peptides, which were designed using the artificial ant system, more than 90%
exhibited the desired activity profile.
Outlook
Can the tools described here help or support the tasks of systems biology? Before answering this question I like to describe the term systems biology as I understand it: Classical research follows the
reductionist approach in the sense that only subsystems are studied in many details under supervised
conditions. The results are combined with those obtained from other related subsystems. This scientific
approach lead to many discoveries like the classical metabolic pathways as the Krebs cycle, urea cycle,
amino acid metabolism and many other biochemical systems as they are described in the biochemistry
and cellular biology textbooks. Nowadays computer systems can handle parallel processes as they occur in many instances in biology. The almost endless number of biochemical data makes it necessary
to anlayze them in many different contexts. Parallel distributed processing of many metabolic data
should result in multidimensional models of cellular functions. A goal could be to design a virtual biochemically driven system in order to make predictions when the effects of drugs are studied. Another
important question is how can a cell maintain all its chemical reactions under variable conditions. How
is the self-sustaining chemistry within the cell possible? (Wrede, 2007).
ACKNOWLEDGMENT
All the work described here was performed in a very fruitful collaboration with many colleagues listed
here alphabetically: Karl-Heinz Baringhaus, Anne Bredenbeck, Matthias Filter, Hans-Peter Grunert,
Jan Hi, Jrgen Kleffe, Rudolf Kunze, Florian Losch, Wolfgang Rnspeck, Gisbert Schneider, Johannes
Schuchhardt, Wieland Schrdl, Peter Walden, Gerd Wallukat, Brigitte Wittmann-Liebold and Heinz
Zeichhardt. Many of them supported me with their articles and inspiring discussions.
References
Benner, S. A., Cohen, M. A., & Gonnet, G. H. (1994). Amino acid substitution during functionally
constrained divergent evolution of protein sequences. Protein Engineering, 7, 1323-1332.
452
Bhm, H-J., & Schneider, G. (Eds.) (2000). Virtual screening of bioactive molecules. Wiley-VCH,
Weinheim.
Bredenbeck, A., Losch, F. O., Sharav, T., Eichler-Mertens, M., Filter, M., Givehchi, A., Sterry, W., Wrede,
P., & Walden, P. (2005). Identification of noncanonical melanoma-associated T cell epitopes for cancer
immunotherapy. Journal of Immunology, 174, 6716-6724.
Brock, R., Wiesmller, K-H., Jung, G., & Walden, P. (1996). Molecular basis for the recognition of two
structurally different major histocompatibility complex/peptide complexes by a single T cell receptor.
Proceedings of the National Academy of Sciences (USA), 93, 13108-13113.
Cetta, F., & Michels, V. V. (1995). The autoimmune basis of dilated cardiomyopathy. Annals of Medicine, 27, 169-173.
Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Mathematics of Control,
Signals and Systems. 2, 303-314.
Dayhoff, M. O., & Eck, R. V. (1968) A model of evolutionary change in proteins: In Atlas of protein
sequence and structure (M.O. Dayhoff, editor). National Biomedical Research Foundation. Washington
D.C. 345.
DeGrado, W. F., & Lear, D. J. (1990). Conformationally constrained alpha-helical peptide models for
protein ion channels. Biopolymers, 29, 205-213.
Dorigo, M., & Sttzle, T. (2004) Ant colony optimization. Cambridge: MIT Press.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001) Pattern classification 2nd ed. New York: John Wiley and
Sons.
Eigen, M. (1971). Self-organization matter and the evolution of biological macromolecules. Naturwissenschaften, 58, 465-523.
Eigen, M., Winkler-Ostwatitsch, R., & Dress, A. (1988a). Statistical geometry in sequence space: A
method of quantitative comparative sequence analysis. Proc. Natl. Acad. Sci., 85, 5913-5917.
Eigen, M., McCaskill, J. S., & Schuster, P. (1988b). The molecular quasi-species Advances in Chemical
Physics., 75, 149-263.
Engelman, D. A., Steitz, T., & Goldman, A. (1986). Identifying nonpolar transbilayer helices in amino
acid sequences of membrane proteins. Annual Reviews of Biophysical Chemistry, 15, 321-353
Filter, M., Eichler-Mertens, M., Bredenbeck, A., Losch, F.O., Sharav, T., Givehchi, A., Walden, P., &
Wrede, P. (2006). A strategy for the identification of canonical and non-canonical MHC I-binding epitopes
using an ANN-based epitope prediction algorithm. QSAR and Combinatorial Sciences, 50, 350-358.
Feng, D. F., Johnson, M. S., & Doolittle, R. F. (1985). Aligning amino acid sequences: Comparison of
commonly used methods. J. Mol. Evol, 21, 112-124.
Fontana, W., Stadler, P. F., Bornberg-Bauer, E. G., Griesmacher, T., Hofacker, I. L., Tacker, M., Tarazona, P., Weinberger, E. D., & Schuster, P. (1993). RNA folding and combinatory landscapes. Physical
Reviews, E47, 2083.
453
Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science, 185.
862-872.
Harpaz, Y., Gerstein, M., & Chothia, C. (1994). Volume changes on protein folding. Structure, 2, 641649.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Santa
Fe Institute
Hiss, J. A., Bredenbeck, A., Losch, F. O., Wrede, P., Walden, P., & Schneider, G. (2007). Design of MHC
I stabilizing peptides by agent based exploration of sequence space. Protein Engineering, Design and
Selection, 20, 99-108.
Hlldobler, B., & Wilson, E. O. (1994). Journey to the ants. A story of scientific explorations. Cambridge,
MA: Harvard University Press.
Janeway, C. A., Travers, P., Walport, M., & Shlomchik, M. (2001) Immunobiology, 5th Edition. New
York: Garland Publisher.
Jones, D. T., Taylor, W. P., & Thornton, J. (1992). The rapid generation of mutation data matrices from
protein sequences. Computer Applications in Biosciences, 8, 275-282.
Kauffman, S. (1993). The origins of order-self-organization and selection in evolution. New York:
Oxford University Press.
Kimura, M. (1983). The neutral theory of molecular evolution. Cambridge: University Press.
Kramer, A., Keitel, T., Winkler, K., Stcklein, W., Hhne, W., & Schneider-Mergener, J. (1997) Molecular
basis for the binding promiscuity of an anti-p24 (HIV-1) monoclonal antibody. Cell, 91, 799-809.
Laurents, D. V., Subbiah, S., & Levitt, M. (1994). Different protein sequences can give rise to highly
similar folds through different stabilizing interactions. Protein Science, 3, 1938-1944.
Li, W. H., & Graur, D. (1999). Fundamentals in molecular evolution. Sinauer Associates, Sunderland.
Lund, O., Nielsen, M., Lundegaard, C., Kesmir, C., & Brunak, S. (2005) Immunological Bioinformatics.
Cambridge: MIT Press.
Milne, G. W. A. (1997). Mathematics as a basis for chemistry. Journal of Chemical Information and
Computational Sciences, 37, 639-644.
Minor, D., & Kim, P. S. (1994). Context is a major determinant of -sheet propensity. Nature, 371, 264267.
Mobini, R., Fu, M., Wallukat, G., Magnusson, Y., Hjalmarson, A., & Hoebeke, J. (2000). A monoclonal
antibody directed against an autoimmune epitope in the human 1-adrenergic receptor recognized in
idiopathic dilated cardiomyopathy. Hybridoma, 19, 135-142.
Myata, T., Miyazawa, S., & Yasunaga, T. (1979). Two types of amino acid substitutions in protein evolution. Journal of Molecular Evoution, 12, 219-236.
454
Peters, B., Bui, H.H., Frankild, S., Nielson, M., Lundegaard, C., Kostem, E., Basch, D., Lamberth, K.,
Harndahl, M., Fleri, W., Wilson, S.S., Sidney, J., Lund, O., Buus, S., & Sette, A. (2006). A community
resource benchmarking predictions of peptide binding to MHC-I molecules. PLos Computational Biology, 2, e65.
Rammensee, H. G., Friede, T., & Stevanoviic, S. (1995). MHC ligands and peptide motifs:first listings.
Immunogenetics, 41, 178-228.
Rammensee, H. G., Bachmann, J., Emmerich, N. P., Bachor, O. A., & Stevanoviic, S. (1999). SYFPEITHI:
Database for MHC ligands and peptide motifs. Immunogenetics, 50, 213-219.
Rao, J. K. M. (1987). New scoring matrix for amino acid residue exchange based on residue characteristic
physical parameters. International Journal of Peptide and Protein. Research, 29, 276-281.
Rumelhart, D. E., McClelland, J. L., & The PDP Group (1986). Parallel distributed processing, 1(2).
Cambridge, USA: MIT Press.
Schneider, G., & Wrede, P. (1993). Development of artificial neural filters for pattern recognition in
protein sequences. Journal of Molecular Evolution, 36, 586-595.
Schneider, G., & Wrede, P. (1994). The rational design of amino acid sequences by artificial neural
networks and simulated molecular evolution: De novo design of an idealized leader peptidase cleavage
site. Biophysical Journal, 66, 335-344.
Schneider, G., Clement-Chomienne, O., Hilfger, L., Schneider, P., Kirsch, S., Bhm, H-J., & Neidhart, W.
(2000). Virtual screening for bioactive molecules by evolutionary de novo design. Angewandte Chemie
International Edition English, 39, 4130-4133.
Schneider, G., & So, S-S. (2003). Adaptive Systems in Drug Design. Georgetown: Landes Bioscience.
Schneider, G., & Wrede, P. (1998). Artificial neural networks for computer-based molecular design.
Progress in Biophysics and Molecular Biology, 70, 175-222.
Schneider, G., Grunert, H.P., Schuchhardt, J., Wolf, K-U., Mller, G., Habermehl, K-O., Zeichhardt, H.,
& Wrede, P. (1995). A peptide selection scheme for systematic evolutionary design and construction of
synthetic peptide libraries, Minimal Invasive Medizin. 6, 106-115.
Schneider, G., Schrdl, W., Wallukat, G., Mller, J., Nissen, E., Rnspeck, W., Wrede, P., & Kunze, R.
(1998). Peptide design by artificial neural networks and computer-based evolutionary search. Proceedings of the National Academy of Sciences, 95, 12179-12184.
Schneider, G., & Baringhaus, K-H. (2008). Molecular Design. New York: VCH-Wiley Weinheim.
Schuster, P. (1986). The physical basis of molecular evolution. Chemica Scripta, 26B, 27.
Taylor, W.R. (1986). The classification of amino acid conservation. Journal of Theoretical Biology, 119,
205-218.
Wallukat, G., Wollenberger, A., Morwinski, R., & Pitschner, H.F. (1995). Anti-beta 1 adrenoreceptor
autoantibodies with chronotropic activity from the serum of patients with dilated cardiomyopathy:
455
Mapping of epitopes in the first and second extracellular loops. Journal of Molecular and Cellular
Cardiology, 27, 397-406.
Wrede, P., & Filter, M. (2005). Bioinformatics: From peptides to profiled leads. In Knblein, J. (ed.),
Modern Biopharmaceuticals, 4, 1771-1801. Wiley-VCH, Weinheim.
Wrede, P., & Schneider, G. (1994). Concepts in protein engineering and design. New York: Walter de
Gruyter Berlin.
Wrede, P., Landt, O., Klages, S., Afshin, A., Hahn, U., & Schneider, G. (1998). Peptide design aided
by neural networks: Biological activity of artificial signal peptidase I cleavage sites. Biochemistry, 37,
3588-3593.
Wrede, P. (2007). Molecular biology-the self-sustaining chemistry. Chemistry Central Journal, 1,25.
Key Terms
Amino Acid Distance Matrix: Calculation of the number of the Euclidian distances between all 20
amino acids according to their physicochemical properties and their genetic coding.
ANN: Artificial Neural Networks, system for generating artificial fitness landscapes. ANN are often
used as function estimators and classification system. They follow the principle of convoluting simple
non-linear functions for approximation of complicated input-output relationships. ANN are favourite
classification systems because they can make good predictions even with noisy data.
Ant Colony Optimization: Stochastic optimisation procedure imitating the ant foraging behaviour.
Method allows to visualize the path through the sequence space.
Autoimmune Disease: Disease caused by the adaptive immune system responses to self antigens.
DARWINIZER: Computer-based simulating molecular evolution cycle for the optimisation of peptide sequences. It is a combination of an artificial fitness function (e.g. trained artificial neural network)
for selecting out the best mutated sequence offsprings. Mutation operator works like the PepHarvester
algorithm.
DCM Dilatative Cardiomyopathy: Severe heart disease, here the autoimmune disease with autoantibodies directed against the -adrenergic receptor leading to permanent stimulation of the heart
beat frequency.
De Novo Design Cycle: Building novel molecules with a given function starting from a model. A
model can be the specific knowledge about receptor ligand interaction.
ELISA Enzyme Linked Immunosorbent Assay: Serological assay in which an antigen is detected
by an enzyme-linked antibody that converts a colourless substrate into a coloured product.
Feature Extraction: Process of reducing data by measuring certain properties or features. These
features are used in a classifier.
456
Focused Libraries: Describe an enriched peptide or small molecule library. The number of found
active molecules is significantly larger than a randomly picked subset on average.
MHC I: Major Histocompatibility Complex. General name for membrane bound glycoprotein of
highly polymorphic nature presenting peptide antigens to T-cells. They are also known as histocompatibility antigens (Janeway et al., 2001).
Pattern Recognition: Process of classifying patterns according to common features. Feature extraction is therefore an ultimate prerequisite for the process of pattern recognition.
PCA: Principle Component Analysis. Technique seeking a projection which represents the data in a
best way. The new coordinates can be considered as linear combinations of the original descriptor axes
often treated as factors(principle components) (Schneider, Baringhaus, 2008).
PepHarvester: Algorithm to generate a focused library starting from a single seed peptide.
QSAR: Quantitative Structure Activity Relationship, a term used in drug discovery research. The
term stands for the relation of the physico-chemical property of a compound an its biological function.
Several techniques can describe this relation in a quantitative manner.
SAR: Structure (Sequence) Activity Relation, used here in the context of amino acid sequence activity relation. This relation is approximated by a stochastic procedure like artificial neural networks.
Sammon Mapping: Non-linear mapping to approximate local geometric relationships on a low
dimensional space. It is a non-linear mapping procedure.(Bhm, Schneider, 2000)
457
458
Chapter XXVII
Applications of Metabolic Flux

Balancing in Medicine
Ferda Mavituna
The University of Manchester, UK
Raul Munoz-Hernandez
The University of Manchester, UK
Ana Katerine de Carvalho Lima Lobato
Federal University of Rio Grande do Norte, Brazil
abstract
This chapter summarizes the fundamentals of metabolic flux balancing as a computational tool of metabolic engineering and systems biology. It also presents examples from the literature for its applications
in medicine. These examples involve mainly liver metabolism and antibiotic production. Metabolic
flux balancing is a computational method for the determination of metabolic pathway fluxes through a
stoichiometric model of the cellular pathways, using mass balances for intracellular metabolites. It is a
powerful tool to study metabolism under normal and abnormal conditions with a view to engineer the
metabolism. Its extended potential in medicine is emphasized in the future trends.
Introduction
Systems biology studies simultaneously the complex interaction of the many cell components using many
levels and forms of biological information and data in order to understand how they work together or
not. After the single cell, the natural challenge for the systems biology is to understand the integrated
functioning of the tissues, organs and the whole organism such as the human body.
Applications of Metabolic Flux Balancing in Medicine
In systems biology, metabolism is the final manifestation of the integrated functioning, regulation and
control of genes, transcription, translation and enzyme action. The effects of some genetic alterations
cannot always be observed in the phenotype but the genetic effects can be observed in the physiology
or metabolome. Metabolism is the fundamental determinant of cell physiology and the chemical engine
that drives the living process. Metabolic engineering is the study of metabolism using scientific and
engineering tools in order to understand it better under normal and abnormal conditions such as disease, injury, stress or mutation. Metabolic engineering is therefore an important component of systems
biology and metabolic flux balancing is a powerful tool of metabolic engineering (Stephanopoulos et
al, 1998; Palsson, 2006).
The objectives of this chapter are to introduce the fundamentals of metabolic flux balancing and
show its potential as a tool of systems biology through its applications in medicine.
Fundamentals of Metabolic Flux Balancing

Metabolism converts substrates into metabolic energy, redox potential and metabolic end products that
are essential for cellular function. Several independent reactions that govern the synthesis and organisation of the macromolecules into a functioning cell can be classified as fueling reactions, biosynthetic
reactions, polymerisation reactions and assembly reactions.
Characteristics of metabolic pathways can be summarized as follows:

Almost all metabolic reactions are reversible.

Metabolic pathways however, are irreversible.
Every metabolic pathway has a first committed step.
All metabolic pathways are regulated.
Metabolic pathways in eukaryotic cells occur in specific cellular locations.
Different metabolic pathways are connected by metabolites that participate in more than one
pathway by pathway branching. These metabolites, therefore, connect one reaction sequence with
another.
Co-factors like ATP, NADH and NADPH also take part in pathway integration because of their
central roles in biosynthetic reactions. Biosynthetic reactions continuously form and utilise these
co-factors and hence connect individual reactions both within the same pathway and between
different pathways.
While cell composition may vary with celltype and physiological and environmental conditions, a
typical cell can be assumed to contain: protein, RNA, DNA, lipids, lipopolysaccharides, peptidoglycan,
glycogen and free amino acids. The 12 precursor metabolites formed in the biosynthetic pathways are
used to synthesize about 75-100 building blocks, coenzymes and prosthetic groups needed for cellular
synthesis. The major biosynthetic pathways involved in cell growth include the biosynthesis of amino
acids, nucleotides, sugars, amino sugars and lipids. The building blocks produced in biosynthetic reactions are sequentially linked into long branched or unbranched polymeric chains during polymerisation
reactions. These long polymeric chains are called the macromolecules of cellular biomass and can be
grouped into ribonucleic acid (RNA), deoxyribonucleic acid (DNA), proteins, carbohydrates, free amino
acids and lipids.
459
Metabolic flux balancing is a computational method for the determination of metabolic pathway
fluxes (specific rates of reactions) through a stoichiometric model of the cellular pathways, using mass
balances for intracellular metabolites.
A very important assumption used in almost all of the large scale metabolic flux balancing applications is the metabolic pseudo-steady state concept. In this context, the steady state refers to the condition
at which all the concentrations of the metabolites, the metabolic reactions rates and the biomass composition are constant during the snapshot study of the metabolism. In practice, this is usually achieved
by constant extracellular metabolite concentrations in a continuous bioreactor or chemostat cultures.
Furthermore, one of the characteristics of living organisms is their ability to maintain a relatively constant composition whilst continually taking in nutrients from the environment and returning excretory
products which is called homeostasis.
The concept of metabolic steady state corresponds to a dynamic equilibrium. The dynamics of
metabolic reactions can also be expressed by their characteristic times. Various reactions of cell metabolism operate at different time scales so when considering a reaction pathway, only reactions with
comparable time scales need be considered. The assumption of pseudo-steady state in the metabolism
is therefore valid considering the critical times/relaxation times of individual reactions compared to the
organisms response to changes in the physico-chemical environment and frequency and sensitivity of
observation/monitoring.
Mass balances on metabolites is illustrated below using the hypothetical metabolic pathway of Figure
1. In Figure 1, ri where i = 1 6, represents the specific reaction rate for the formation or consumption
of a metabolite, and rA, rB, rD and rE are the transport reactions for the metabolites between the cell (or
cell compartment) and its environment. These specific reactions are the metabolic fluxes. Referring to
Figure 1, the steady state mass balances for the individual metabolites are:
A: rA r1 = 0
(1)
Figure 1. A hypothetical metabolic pathway involving metabolites A C. The boundary for the purposes
of mass balances may represent the cell or a sub-cellular compartment. Metabolites A, B D and E are
transported across the boundary. r represents the fluxes (specific reaction rates).
rB
rA
r1
A
r2
r3
r4
r5
r6
E
rE
460
rD
Boundary
for mass
balances
B: r1 r2 rB = 0
(2)
C: r2 + r5 r3 r4 = 0
(3)
D: r3 + r6 rD = 0
(4)
E: r4 r5 r6 rE = 0
(5)
These metabolite balances are rearranged so that transport fluxes are presented separately from the
internal fluxes as shown in the matrices of Figure 2 and Eq. 6.
These balances are then expressed in matrix formalism as shown in Figure 2. The elements of the
stoichiometric matrix M represent the stoichiometric coefficients of the metabolites participating in a
particular reaction. If the metabolite is produced in a particular reaction, then its stoichiometric coefficient is positive, if consumed negative. A zero stoichiometric coefficient indicates that the metabolite
does not participate in that particular reaction. Note that this stoichiometric matrix M is written in
transposed form of the stoichiometric matrices introduced later in the text.
At steady state:
Mv = b
(6)
Lets assume that we can measure two transport fluxes, such as rA and rD:

Total number of internal fluxes = 6

Total number of transport fluxes = 4
Total number of fluxes = 6 + 4 = 10
Total number of known fluxes = 2
Total number of unknown fluxes = 8
Figure 2. Mass balances for the metabolites of the pathway of Figure 1, represented in matrix formalism.
r1 r2

Metabolite A
Metabolite B
Metabolite C
Metabolite D
Metabolite E
r3
r4
r5
r6
r1
0
0
0
0 rA
1 0
r2

r
+
1
1
0
0
0
0

r3 B
0 + 1 1 1 + 1 0 = 0

r4
0 +1 0
0 + 1 rD
0
r5
r
0
0 + 1 1 1
0
r6 E
Stoichiometric matrix, M
Internal fluxes, v
Transport Fluxes, b
461
Total number of linearly independent equations = 5 (metabolite balances)

Degrees of freedom = 8 - 5 = 3
There are three possible solutions for this metabolic network. Alternatively, if we could measure three
more fluxes, the degrees of freedom would be zero meaning that we could then calculate the remaining
five unknown fluxes using five metabolite balance equations.
Computational metabolic flux analysis via metabolic flux balancing involves the following steps.
Cell Composition
Measuring and/or compiling data on the macromolecular, metabolite, elemental composition of the
cells.
Data Compilation on Constraints

Measuring and /or compiling data on kinetic, operational, technological and biological parameters
and constraints. These include experimental specific growth, substrate uptake and product formation
rates.
Construction of the Metabolism in Matrix Formalism

A matrix of stoichiometrically balanced biochemical reactions of the metabolic pathways of interest
is constructed using the metabolic pathway topology available from the literature and databases, such
as; the Boehringer Mannheim table of metabolic pathways (Michal, 1993; http://us.expasy.org/cgi-bin/
show_thumbnails.pl) and KEGG (http://www.genome.jp/kegg/pathway.html).
Stoichiometry is written such that a compound used in the forward reaction (as reactant) has a
negative stoichiometric coefficient, and a compound formed in the forward reaction has a positive stoichiometric coefficient. If a compound does not participate in a reaction, its stoichiometric coefficient
for that reaction is zero. For generalisation, following the nomenclature used in Stephanopoulos et al,
(1998) and as shown in Table 1, the stoichiometric coefficients are termed as for substrates, for
metabolic products, for biomass compounds and g for intracellular metabolites. In the representation
of stoichiometrically balanced J number of metabolic reactions in matrix formalism, we can consider
a system where N substrates are converted to M metabolic products and Q biomass constituents via K
intracellular metabolites which participate as pathway intermediates.
The two-numbered index on the stoichiometric coefficient indicates the reaction number and the
compound, e.g. ji is the stoichiometric coefficient for the ith substrate in the jth reaction. With these
definitions (Table 1), the stoichiometry for the jth cellular reaction can be written as (Stephanopoulos
et al, 1998):
N
i =1
+
ji S i
i =1
ji
Pi +
i =1
ji
X macro,i +
i =1
g ji X met ,i = 0
(7)
There is an equation like this for each of the J number of cellular reactions. All J number of reactions
can be represented by Eq. 8 using matrix notation and symbols of Table 1:
462
Table 1. Nomenclature used in the representation of metabolite mass balances following Stephanopoulos
et al (1998) in matrix formalism. Metabolic components and stoichiometric coefficients are the elements
of the metabolic and stoichiometric matrices, respectively in J metabolic reactions.
System Elements
Total
Number
Metabolic
Stoichiometric
Component
Matrix
Coef.cient
Matrix
Metabolic Reactions
j (reaction)
v(rates)
N.A.
N.A.
Substrates
Metabolic Products
Macromolecular Biomass
Components
X macro
Xmacro
Intracellular Metabolites
(pathway intermediates)
X met
Xmet
AS + BP + Xmacro + GXmet = 0
(8)
In these matrices rows represent reactions and columns metabolites which is different from the matrix
in Figure 2. Following the rules of multiplication of a matrix by a vector, the stoichiometric matrices A, B,
and G will have to be transposed in step 4 which then converts them to the form used in Figure 2, that
is in the transposed matrices AT, BT, T and GT rows represent metabolites and columns reactions.
From Stoichiometry to Reaction Fluxes

The stoichiometry defines the relative amounts of the compounds produced or consumed in each of
the J intracellular reactions, but does not allow us to calculate the rates (fluxes) at which metabolic
reactions proceed, substrates are taken up and metabolic products are secreted in to the medium. This
can be done by introducing the rates of the individual reactions and further coupling them in order to
determine the overall rates.
For cellular reactions, the biomass is often used as a reference to define specific rates. Then, the
cellular reaction rates can be expressed as the specific rate of reaction (fluxes) with the units of (mmol
metabolite) (g dry weight)-1 h-1.
The forward rate (or velocity, or flux), v, defines the rate of a chemical reaction, so a compound that
has a stoichiometric coefficient, is formed at the rate v. The forward reaction rates of the J reactions
are then collected in the rate vector v.
The net specific rate (metabolic flux) for the ith intermediary metabolite is written as the sum of its
consumption and production rate in all J reactions as:
r met ,i =
j =1
g ji
(9)
By performing mass balances for substrates, products, macromolecular cell components and intermediary metabolites, under pseudo-steady state assumption, the steady state metabolite balance equations
are written as (Stephanopoulos et al, 1998):
463
rs = ATv
(10)
rp = BTv
(11)
rmacro = Tv
(12)
rmet = GTv
(13)
where v is the vector of fluxes, and r is the vector of net specific reaction rates. AT is the transpose of
A.
Metabolic Flux Distributions (Calculation of Unknown Fluxes)

Equation 9 for rmet forms the basis for computational metabolic flux analysis, i.e., the determination of
the unknown pathway fluxes in the intracellular rate vector v:
r met ,i =
j =1
g ji
(9)
The metabolic reaction network matrix is solved for the unknown fluxes using all the collated experimental and literature data on cell composition and any of the specific metabolic rates (fluxes).
The system is determined, with a unique solution, if exactly F fluxes (or reaction rates) in v are
experimentally measured or known. In such a determined system, the solution yields fluxes of the
individual metabolic reactions.
Since a metabolite can take part in more than one metabolic reaction, and since there are many metabolites like this, the number of reactions is usually greater than the number of metabolites. This means
that the number of unknown fluxes will be greater than the number of metabolite balance equations.
This situation results in an underdetermined system of linear algebraic equations which has infinite
solutions. By using linear programming with various stoichiometric, physiological and experimental
constraints, an objective function can be defined and optimised (maximised or minimised). The objective function can be for example, the specific growth, or substrate uptake or product formation rate. The
optimised solution not only gives the optimised value of the objective function but also the corresponding metabolic fluxes. Furthermore, the sensitivities of the metabolic fluxes can be calculated once the
solution is obtained. For large scale optimization solutions, computational software either as stand-alone
applications or built in programmes in well-known mathematics platforms can be used.
An example of a metabolic flux distribution is given in Figure 3 for Streptomyces coelicolor with
the objective function of the maximization of the antibiotic production. This figure shows a summary
of the computed results involving about 400 metabolic fluxes. The specific glucose uptake rate and the
specific growth rate were given as constraints to the programme. Such as flux distribution map shows
the extent of participation of a particular metabolic pathway in the overall metabolism. For instance,
using the flux values, the distribution of glucose uptake into glycolytic and pentose phosphate pathways
can be analysed for different experimental and genetic conditions. The completion or otherwise of the
TCA cycle, and various metabolic shunts can be viewed in order to develop a genetic engineering (Kim
464
et al. 2004) or operational strategy such as nutrient limitations (Naeimpoor & Mavituna, 2000) or type
of nitrogen source (Naeimpoor & Mavituna, 2001).

Metabolic flux balancing is a powerful tool of metabolic engineering and can be applied for the following in general:

Quantification of various metabolic fluxes inside cells.

Identification of possible rigid branch points (nodes) in the pathways.
Calculation of non-measured extracellular fluxes.
Investigating the influence of alternative pathways on the distribution of fluxes
Calculation of maximum theoretical yields
Optimum growth and production media design
Selecting the physico-chemical conditions for improved bioprocesses
Extending the substrate and product range in bioprocesses by designing new metabolic pathways
to use new, novel substrates, achieve novel biotransformations
Designing metabolic pathways to increase the formation of desired metabolites and decrease the
formation of unwanted metabolites
Identification and testing of the important bioreactions and bottlenecks in the metabolism for
targeted genetic manipulations
Comparison of the sets of admissible routes for the wild type (normal) and the mutant/defective/
genetically altered cells
As a tool of metabolic engineering, metabolic flux balancing has an important potential in medicine
as shown by the literature review given in this section. Most of the general applications listed above
are still valid in the medical field. The most valuable application of metabolic flux balancing is in the
understanding of the metabolism of the cells, organ and the body as an integrated system using a genome-scale metabolic reaction network based on the known genomes. This is particularly useful to find
out whether a single target for drug development such as an enzyme or a gene is sufficient in the treatment of a particular medical condition. Often, there is no guarantee that the manipulation of a single
metabolic reaction will result in the desired systematic response in the human body. The applications
of metabolic flux balancing and other metabolic engineering tools can therefore lead to savings in time
and cost by contributing to the rational selection of targets for medical treatment.
The current applications of metabolic flux balancing in medicine in particular, can be listed as:

Study of the metabolism of cells, tissues, organs and the whole body using genome-scale metabolic
reaction networks under:

Normal conditions

Diseased, injured, mutated or genetically altered (such as gene therapy) conditions
More efficient industrial production of pharmaceuticals such as:

Antibiotics

Antitumor, anticancer compounds
465
Figure 3. Computed metabolic fluxes in Streptomyces coelicolor optimised for the maximum production
of the antibiotic actinorhodin with a specific growth rate (Mu in the figure) of 0.0253 h-1 and a specific
glucose uptake rate of 0.29 (mmol) (g dw) -1 h-1. Numbers in boxes next to the arrows indicate the flux
value in (mmol) (g dw) -1 h-1.
mu
Glc
0.2949
P e n to se P h o s
r5
0.0253
G6
0.008
N u cle o tid
o2
0.2613
S e rin e F am ily A A
0.235
0.000
e4
H istid in e
F a m ily A A
PGa
NADH
DG
0.000
0.007
FADH2
0.447
shik
0.000
NAD
FAD
A ro m atic A A
Pe
0.007
0.392
0..077
Prep
act
Pyr
0.298
macoa
-0.275
accoa
mal
P yru va te F am ily A A
-0.258
0.017
A sp a rtate F am ily A A
0.020
oa c
it
0.020
T C A C yc le
fum
1.519
atP
0.000
suc
-0.009
0.000
faDH
0.307
co2
succoa
Icit
0.795
0.020
naDH
0.000
oG
G lu tam a te F am ily A A
0.3068
co2
466
0.0135
act
Antiparasitic compounds
Antiviral compounds
Some examples of the applications of metabolic flux balancing in the study of human cell and organ
metabolism are summarized in Table 2. Table 3 on the other hand summarizes some examples of the
application to the production of antibiotics.
Most of the applications of metabolic flux balancing involving animal cells and tissues are on liver
cells (hepatocytes) as summarized in Table 2. Understanding the metabolic and regulatory pathways of
hepatocytes is important due to the important roles liver plays in the overall human metabolism. These
investigations can provide strategies for optimizing hepatic function and identify potential targets for
improving hepatic functions. They can also be used for biotechnological applications involving liver cells
such as their use for drug testing as an alternative to animal tests and the development of bioartificial
liver (BAL) devices. For example, Si et al. (2007) used metabolic flux balancing for white adipose tissue
(WAT) mass since it is the main determinant of obesity and associated health risks. The calculated flux
distributions predicted the sequential activation of several intracellular cross-compartmental pathways,
including lipogenesis, the pentose phosphate pathway, and the malate cycle. The flux distribution around
pyruvate was a key indicator of adipocyte lipid accumulation. Severe injury activates many stress-related
and inflammatory pathways that can lead to a systemic hypermetabolic state.
Severe burns cause dramatic alterations in liver and whole-body metabolism. Enhanced survival and
immune function have been reported using dehydroepiandrosterone (DHEA) in animal trauma models.
Banta et al. (2005) investigated the specific effects of DHEA on hepatic metabolism following burn
injury both experimentally and using metabolic flux balancing in perfused rat livers. After 4 days of
burn injury and intraperitoneal injections of DHEA and using control animals as well, the livers were
isolated and perfused in vitro, and 28 metabolite fluxes were measured. DHEA administration appears
to normalize hepatocellular metabolism in burned rats but also decreases the PPP flux, which may impair the livers ability to recycle endogenous antioxidants. DHEA administration appears to normalize
hepatocellular metabolism in burned rats but also decreases the PPP flux, which may impair the livers
ability to recycle endogenous antioxidants.
Banta et al, (2007) in another study, induced a systemic hypermetabolic response in rats by applying
a moderate burn injury followed two days later by cecum ligation and puncture (CLP) to produce sepsis.
Two days after CLP, livers were analyzed for gene expression changes using DNA microarrays and for
metabolism alterations by ex vivo perfusion coupled with Metabolic Flux Analysis. In their model, burn
injury prior to CLP increased fluxes through posttranslational mechanisms with little contribution of gene
expression, while CLP treatment up-regulated the metabolic machinery by transcriptional mechanisms.
They concluded that mRNA changes measured at a single time point by DNA microarray analysis did
not reliably predict metabolic flux changes in perfused livers. Some examples of other applications of
metabolic flux balancing to liver cells are given in Table 2.
Another important application of metabolic flux balancing in medicine is in the area of the production
of pharmaceuticals. Table 3 lists some examples of metabolic flux balancing investigations in antibiotic
production. In these applications the objective was to study the primary and secondary metabolism of
the antibiotic producing microorganism in order to understand the carbon and nitrogen fluxes so that
either genetic or bioprocessing strategies could be identified. Indeed, in the study of the effect on the
type of nitrogen source, Naeimpoor & Mavituna, (2001) showed that the type of nitrogen source used
in the medium affected the specific growth rate, antibiotic production rate and the metabolites excreted
467
in to the medium. According to the results of metabolic flux balancing, an industrial scale process could
be started with ammonia which promoted growth and then switched to a nitrate salt which promoted
antibiotic formation.
There is an urgent need to identify, develop and produce new antibiotics against antibiotic resistant
pathogens such as such as methicillin-resistant S. aureus strains (MRSA) and vancomycin-resistant
enterococci (VRE). Daptomycin (with the trade name of Cubicin) is one such successful antibiotic.
The calcium dependent antibiotic (CDA) of Streptomyces coelicolor is a structurally related to daptomycin and furthermore it is a non-ribosomal polypeptide. Non-ribosomal polypeptides can lead to
novel natural and non-natural therapeutic compounds in the future. Kim et al. (2004) used metabolic
flux balancing for the production of calcium dependent antibiotic (CDA) for different phases of a batch
culture. A comparison of sensitivities, and the fluxes that changed the most during the batch culture
Table 2. Some examples of application of metabolic flux balancing in the study of mammalian cell
cultures.
468
Reference
Application
Model
Metabolism studied
Lee et al, 2000
The effect of injury on liver
61 reactions
35 metabolites
Primary
Calik and Akbay, 2000
Liver cells, focusing on fibrosis
125 reactions
83 metabolites
GAMS 2.25
Primary
Collagen Palmitate
Lee et al, 2003
Hepatic hyper metabolism
60 reactions
Primary
Gambhir et al, 2003
Hybridoma cells
compared in three distinct metabolic
states.
30 reactions
32 metabolites
BIONET
Mathematics@4.0
Primary
Antibody
Chan et al, 2003a
Primary rat hepatocytes
76 reactions
Primary,
Albumin
Chan et al, 2003b
Rat hepatocytes in response

to low-insulin and amino acid
supplementation
76 reactions
Primary:
Gluconeogenesis (plasma
cultures) Glycolysis (medium
cultures) Albumin
Chan et al , 2003c
Hepatocytes in response to hormone

and amino acid supplementation
76 reactions
Primary:
Gluconeogenesis (plasma
cultures) Glycolysis (medium
cultures)
Yokoyama et al, 2005
Two liver models: Gluconeogenic

state (I) Glycolytic state (II)
64 reactions
MATLABRN
Primary
Gluconeogenesis(I) Glycolysis
(II)
Banta et al, 2005
Effects of DHEA in liver

metabolism following a burn injury
72 reactions
45 metabolites
Least-square
Central carbon, Nitrogen,

Energy
Nolan et al, 2006
Liver central carbon metabolism
60 reactions
MATLAB
Primary
Banta et al, 2007
Metabolic network model for liver
72 reactions
45 metabolites
Least-square
Si et al, 2007
Differentiating 3T3-L1
preadipocytes
MATLAB
Primary, Adipocyte production
Table 3. Application of metabolic flux balancing to the production of antibiotics.

Reference
Application
Computational Model
Metabolism Studied
Henriksen et al, 1996
Flux distribution for different

cultivation conditions
Penicillin
72 reactions
77 metabolites.
Jin et al, 1997
S. cerevisiae metabolic flux

distributions
24 reactions
Primary Heterologous protein
Daae & Andrew, 1999
Streptomyces lividans
53 Metabolites 57
Reactions
Matlab/ Excel
Primary metabolism
Naeimpoor & Mavituna,

2000
Effect of nutrient limitations

on Streptomyces coelicolor
metabolism
200 reactions
GAMS
Primary metabolism &

Actinorhodin
Van Gulik et al, 2000
Penicillium chrysogenum
195 Reactions
SPAD it
Primary &
Penicillin-G
Naeimpoor & Mavituna,

2001
Effect of different nitrogen

sources Streptomyces coelicolor
metabolism
237 reactions, GAMS
Primary & Secondary Actinorhodin
Thykaer et al, 2002
Glucose uptake
Biomass
AA pattern
Adipoyl-7-ADCA
Primary
Adipate pathway Adipoyl-7-ADCA
Rossa et al, 2002
Streptomyces lividans
38 Reactions
47 Compounds
BioNet
Biomass
Actinorhodin
Undecylprodigiosin
Kim et al, 2004
Streptomyces coelicolor
metabolism
400 Reactions
GAMS
Primary & Secondary CDA
Bushell et al, 2006
Streptomyces clavuligerus
57 Reactions
Flux Analyzer
Clavulanic acid
Gonzalez-Lergier et al, 2006
Escherichia coli
933 Reactions
625 Comp.
Polyketides biosynthesis
Erythromycin
Kleijn et al, 2007
MNAv3.0
Penicillin-G
Kiviharju et al, 2007
Streptomyces peucetius
515 Reactions
624 Reactions
Matlab/ Flux Analyzer
5.2
Rhodomycinone
production
indicated possible genetic engineering strategies in order to increase the production yields which were
then tested with the in silico experiments.
Future Trends
Metabolic flux balancing will show its full potential in medicine with the following applications some
of which are already developing in to exciting research topics:

Drug development, delivery and drug testing in silico:
469
Identifying targets for drug development by studying host/pathogen and host/parasite metabolic interactions in order to identify weak points in the metabolism of the pathogen or the
parasite without affecting the host

Effects of drugs on the metabolism of the pathogen, parasite and host

Drug metabolism in the host especially side effects, drug degradation, by-products and end
products of degradation

Interpretation of experimental or clinical drug testing results, adjustment (scale-up) of drug
dosage from experimental animals to the patient

The timing and dosing of the drug delivery intended to work on the pathogen or parasite but
which will be simultaneously degraded by the patients metabolism
Environmental effects of chemicals such as organic and inorganic wastes, hormones or hormone
analogues, xenobiotics on the human metabolism:

Interactions

Their degradation

Investigation for potential treatments
Gene therapy:

Suggestion of metabolic pathway synthesis

Testing the metabolisms response to genetic manipulations especially by consideration of
bypasses in the metabolic pathways
Tissue engineering Metabolic models to study and simulate in vitro:

Growth of cells/tissues/organs in vitro (Grafts)

Stem cell cultures

Differentiation

Apoptosis
Symbiotic relationships between host and other organisms such as; probiotic and prebiotic foods
associated microbial cultures and their interactions with the human metabolism.
Similar applications as above in veterinary medicine
Metabolic flux balancing is a computational modelling tool which needs verification by in vivo and
in vitro experiments. Using isotope labelled substrates, metabolic intermediates in experiments, noninvasive techniques such as NMR, genetic deletions in comparison with wild types, metabolic inhibitors,
environmental perturbations should all help in model verification, development and refinement. The
results of optimised solution reflect the theoretical capability of the metabolism and the cell and may not
always be identical to the experimental numerical values of the measured fluxes. If more experimental
data are collected and more fluxes are measured, the degree of freedom in the solution of the metabolic
network reduces and the optimised results approach
The fundamental assumption of pseudo-state in the construction of mass balances for metabolites
in metabolic flux balancing can be replaced with dynamic balances for small scale pathways but it
will be more difficult for genome-scale metabolic networks. As simpler models get verified, and our
understanding of metabolism and related cell functions increases, we should incorporate the aspects
of compartmentalisation, regulation and control of the metabolism in the cells, tissues, organs and the
whole body.
470
Conclusion
In this chapter, the fundamentals of metabolic flux balancing as a computational tool of metabolic engineering and systems biology have been introduced. Some examples summarized from the literature show
its application in medicine either through the study of metabolism of cell/tissue and organ cultures or
through the increased efficiency of the production of pharmaceuticals. Metabolic flux balancing provides
valuable information about metabolic pathway utilization, the extent of participation of metabolites and
parts of the metabolism in the overall system and metabolic physiology under different environmental conditions. Metabolic flux balancing not only gives the individual metabolic reaction rates in the
metabolism but also the secondary calculations can be used to obtain sensitivity analysis based on the
individual metabolic reaction rates. Using metabolic flux balancing, metabolic bottlenecks, genetic deletions and amplifications can be tested in silico in order to develop a rational strategy to perform genetic
changes/modifications. It can also be used to test different operating conditions in order to improve the
production of pharmaceuticals. More exciting applications in medicine, as summarized in the future
trends, will evolve as metabolic flux balancing gets integrated with other advances in biotechnology,
medicine and other related disciplines.
Acknowledgment
We would like to acknowledge the financial support from CONACYT (Mexico) for Raul Munoz-Hernandez and from CAPES and CNPq (Brazil) for Ana Katerine de Carvalho Lima Lobato.
References
Banta, S., Yokoyama, T., Berthiaume, F., & Yarmush, M. L. (2005). Effects of dehydroepiandrosterone
administration on rat hepatic metabolism following thermal injury. Journal of Surgical Research, 127,
93-105.
Banta, S., Vemula, M., Yokoyama, T., Jayaraman, A., Berthiaume, F., & Yarmush, M. L. (2007). Contribution of gene expression to metabolic fluxes in hypermetabolic livers induced through burn injury
and cecal ligation and puncture in rats, Biotechnology and Bioengineering, 97, 1, 118-137.
Bushell, M. E., Kirk, S., & Zhao, H. J., Avignone-Rossa, C. A. (2006). Manipulation of the physiology
of clavulanic acid biosynthesis with the aid of metabolic flux analysis. Enzyme and Microbial Technology, 39, 1, 149-157.
alk, P., & Akbay, A. (2000). Mass flux balance-based model and metabolic flux analysis for collagen
synthesis in the fibrogenesis process of human liver. Medical Hypotheses, 55, 1, 5-14.
Chan, C., Hwang, D., Stephanopoulos, G. N., Yarmush, M. L., & Stephanopoulos, G. (2003a). Application of multivariate analysis to optimize function of culture hepatocytes. Biotechnology Progress, 19,
2, 580-598.
471
Chan, C., Berthiaume, F., Lee, K., & Yarmush, M. L. (2003b). Metabolic flux analysis of cultured hepatocytes exposed to plasma. Biotechnology and Bioengineering, 81, 1, 33-49.
Chan, C., Berthiaume, F., Lee, K., Yarmush, M. L. (2003c). Metabolic flux analysis of hepatocyte function in hormone- and amino acid-supplemented plasma. Metabolic Engineering, 5, 1, 1-15.
Daae, E. B., & Andrew, P. I. (1999). Classification and sensitivity analysis of a proposed primary metabolic reaction network for Streptomyces lividans. Metabolic Engineering, 1(2), 153-165.
Gambhir, A., Korke, R., Lee, J., Fu, P. C., Europa, A., & Hu, W. S. (2003). Analysis of cellular metabolism of hybridoma cells at distinct physiological states. Journal of Bioscience and Bioengineering,
95(4) 317-327.
Gonzlez-Lergier, J., Broadbelt, L. J., & Hatzimanikatis, V. (2006). Analysis of the maximum theoretical
yield for the synthesis of erythromycin precursors in Escherichia coli. Biotechnology and Bioengineering, 95(4), 638-644.
Henriksen, C. M., Christensen, L. H., Nielsen, J., & Villadsen, J. (1996). Growth energetics and metabolic
fluxes in continuous cultures of Penicillium chrysogenum. Journal of Biotechnology , 45(2), 149-164.
Jin, S., Ye, K., & Shimizu, K. (1997). Metabolic flux distributions in recombinant Saccharomyces cerevisiae during foreign protein production. Journal of Biotechnology, 54(3), 161-174.
Kim, H. B, Smith, C. P., Micklefield, J., & Mavituna, F. (2004). Metabolic flux analysis for calcium dependent antibiotic (CDA) production in Streptomyces coelicolor. Metabolic Engineering, 6(4), 313-325.
Kiviharju, K., Moillanen, U., Leisola, M., & Eerikainen, T. (2007). A chemostat study of Streptomyces
peucetius var. caesius N47. Applied Microbiology and Biotechnology, 73(6), 1267-1274.
Kleijn, R. J., Liu, F., van Winden, W. A., van Gulik, W. M., Ras, C., & Heijnen, J. J. (2007). Cytosolic
NADPH metabolism in penicillin-G producing and non-producing chemostat cultures of Penicillium
chrysogenum. Metabolic Engineering, 9(1), 112-123.
Lee, K., Berthiaume, F., Stephanopoulos, G. N., Yarmush, D. M., & Yarmush, M. L. (2000). Metabolic
flux analysis of postburn hepatic hypermetabolism. Metabolic Engineering, 2(4) 312-327.
Lee, K., Berthiaume, F., Stephanopoulos, G. N., & Yarmush, M. L. (2003). Profiling of dynamic changes
in hypermetabolic livers. Biotechnology and Bioengineering, 83(4), 400-415.
Michal, G., (1993). Biochemical pathways (wall chart). Boehringer Mannheim GmbH.
Naeimpoor, F., & Mavituna, F. (2000). Metabolic flux analysis for Streptomyces coelicolor under various nutrient limitations. Metabolic Engineering, 2, 140-148.
Naeimpoor, F., & Mavituna, F. (2001). Metabolic flux analysis in Streptomyces coelicolor: Effect of nitrogen source. In Novel Frontiers in the Production of Compounds for Biomedical Use, 1(2), Antibiotics,
131-145, (eds.) J. Anne and M. Hofman, (eds.) A. Van Broekhoven, F. Shapiro and J. Anne. Dordrecht,
The Netherlands: Kluwer Publishers.
Nolan, R. P., Fenley, A. P., & Lee, K. (2006). Identification of distributed metabolic objectives in the
hypermetabolic liver by flux and energy balance analysis. Metabolic Engineering, 8(1), 30-45.
472
Palsson, B.O., (2006) Systems biology: Properties of reconstructed networks. Cambridge University
Press.
Rossa C. A., White, J., Kuiper, A., Postma, P. W., Bibb, M., & Teixeira de Mattos, M. J. (2002). Carbon
flux distribution in antibiotic-producing chemostat cultures of streptomyces lividans. Metabolic Engineering, 4(2), 138-150.
Si, Y., Yoon, J., & Lee, K. (2007). Flux profile and modularity analysis of time-dependent metabolic
changes of de novo Adipocyte formation. American Journal of Physiology - Endocrinology and Metabolism, 292, 1637-1646.
Stephanopoulos, G., Aristidou, A. A., & Nielsen, J. (1998). Metabolic engineering: Principles and
methodologies. Academic Press.
Thykaer, J., Christensen, B., & Nielsen, J. (2002). Metabolic network analysis of an adipoyl-7-ADCAproducing strain of Penicillium chrysogenum: Elucidation of adipate degradation. Metabolic Engineering, 4(2), 151-158.
Van Gulik, W. M., de Laat, W. T. A. M., Vinke, J. L., & Heijnen, J. J. (2000). Application of metabolic
flux analysis for the identification of metabolic bottlenecks in the biosynthesis of penicillin-G. Biotechnology and Bioengineering, 68(6), 602-618.
Yokoyama, T., Banta, S., Berthiaume, F., Nagrath, D., Tompkins, R. G., & Yarmush, M. L. (2005).
Evolution of intrahepatic carbon, nitrogen, and energy metabolism in a D-galactosamine-induced rat
liver failure model. Metabolic Engineering, 7(2), 88-103.
Key Terms
Biomass: In this context, it means cells, tissues, organs. It is often measured and expressed as the
concentration of dry biomass (dry weight).
Flux: In this context, the metabolic flux is identical to the specific metabolic reaction rates. The most
frequently used units are: (mmol metabolite) (g dry wt biomass)-1 (h)-1.
Matrix: In mathematics, a matrix is a table of elements. These elements, members or entries, may
be numbers of any abstract quantities that can be added and multiplied. Matrices are useful in describing the linear equations in a short format. They are also used to keep track of the coefficients of linear
algebraic operations.
Metabolic Pathways: A metabolic pathway is any sequence of feasible and observable biochemical
reaction steps catalysed by enzymes and connecting a specified set of input and output metabolites.
Metabolic Product: A metabolic product is a compound produced by the cells and is excreted to
the extracellular medium. It could be produced in the primary metabolism, e.g. carbon dioxide, ethanol,
acetate, or lactate, or a more complex one, e.g. a secondary metabolite or a heterologous protein secreted
to the extracellular medium.
473
Specific Reaction Rate: The specific rate of a microbial activity is equal to the volumetric rate for
that activity divided by the cell concentration performing that activity.
Specific rate
Volumetric rate
Biomass concentration
Substituting the definition of volumetric rate in to this equation, the specific reation rate becomes:
Specific rate
Amount of a compound produced or consumed

(Unit volume)(Unit time)(Concentration of biomass)
Substrate: A substrate is a compound that is present in a sterile culture medium and can either be
further metabolized by, or directly incorporated into, the cell. The substrate could therefore, range from
carbon, nitrogen, and energy sources to various minerals, vitamins essential for cell function.
Transpose of a Matrix: In linear algebra, the transpose of a matrix A is another matrix AT obtained
by writing the rows of A as the columns of AT and the columns of A as the rows of AT .
Volumetric reaction rate: It is defined as:
Volumetric rate
474
Amount of a compound produced or consumed

(Unit volume)(Unit time)
Section VIII
Data Integration and

Data Mining
476
Chapter XXVIII
Multi-Level Data Integration

and Data Mining in Systems
Biology
Roberta Alfieri
CNR - Institute for Biomedical Technologies, Italy
and CILEA, Italy
Luciano Milanesi
CNR - Institute for Biomedical Technologies, Italy
abstract
This chapter aims to describe data integration and data mining techniques in the context of systems
biology studies. It argues that the different methods available in the field of data integration can be very
useful in making research in the field of systems biology easier. Moreover data mining is an important
task to take into account in this context, therefore in this chapter, some aspects of data mining applied
to systems biology specific case studies shall be discussed. The availability of a large number of specific resources, especially for the experimental researchers, is something difficult for users who tried
to explore gene, protein, and pathway data for the first time. This chapter finally aims to highlight the
complexity in the systems biology data and to provide an overview of the data integration and mining
approaches in the context of systems biology using a specific example for the Cell Cycle database and
the Cell Cycle models simulation.
INTRODUCTION
In the context of the application of biomedical science to systems biology, the availability of many
different database and data resources, and a huge amount of heterogeneous data are continuously accumulating, became a crucial point in the last few years. In the field of the medical sciences, and, more
Multi-L evel Data Integration and Data Mining in Systems Biology
in particular in the systems biology context, it is largely recognized that successful data integration has
become essential in order to improve the possibility to better explore the knowledge space in many different biological studies. Experimental researchers and computer scientists can discover through data
integration new and interesting relationships that enable them to make better and faster decisions for
example about disease targets and drug molecules. Moreover, the collection of related information has
been shown to be an essential component in biomedical and systems biology research, particularly in
the genomics, proteomics and pathways information area.
The necessity for data integration is widely approved in the bioinformatics and systems biology community since bioinformatics data are currently spread across the internet and throughout organizations
in a wide variety of formats. Moreover the achievement of interesting results in most bioinformatics
and systems biology-related activities, from functional characterization of genomic and proteomic data
to the development of mathematical models of biological processes, requires an integrated view of all
relevant data useful to accomplish those tasks. The challenges of data integration may be addressed
using a wide variety of approaches. While each approach has advantages and limits, it can be difficult
to evaluate which approach suits a particular need best without fully understanding the data integration landscape. The data integration methods aim to facilitate detailed and accurate investigation on
specific gene, protein or pathway since high information content should be useful both for data mining
and mathematical modelling of the biological process of interest. In this chapter the different data integration approaches and some practical example of data integration are illustrated in the specific field of
the cell cycle process. The importance of the cell cycle in the shifting from a healthy to a pathological
state in some specific experimental conditions which are illustrated in the context of the need to create
an integrated system capable to collect the most important information related to cell cycle genes and
proteins, which are drawn from the analysis of the cell cycle information available in literature and the
existing pathway databases.
There is another important technique used for the knowledge discovery is the data mining approach.
Data mining system has become widely used in the context of biomedical science and systems biology
as it makes the prediction of the behaviours and the future trends for a biological system possible, allowing taking knowledge-driven decisions. In its general definition data mining can also be considered as
the process of analyzing data from different perspectives and summarizing it into useful information,
which can be used to increase the current knowledge about a specific biological process. Technically,
data mining is the process of finding correlations or patterns among many fields in large relational databases. An example of data mining application in systems biology in the context of the mathematical
modelling of a biological process is illustrated in this chapter.
Moreover the use of bioinformatic tools, data mining and data integration can help researchers to
better studying the modelling complexity, by screening of the potential model components in order to
find the emergent properties of a biological system, which is one of the main aims of systems biology
studies. Finally the main advantage of using the data mining and the data integration approaches in the
context of systems biology investigations, are presented.
BACKGROUND
During the past years, a very fast increase of the availability of quantitative data related to biological
systems and processes occurred as consequence of the systematic application of automated high-
477
throughput molecular biology techniques, which have led to the generation of an immense quantity
of data compared with that produced by the application of earlier technologies. The advances in DNA
sequencing technology make the study of complete genomes easier and similar progress in the study
of transcripts, proteins and metabolites, bring to the availability of more complex datasets. In fact, the
need for databases that store molecular biological data, and which allow analysis through computational
software, was apparent long before experimental techniques were as powerful as they are today and data
as complete as they appears nowadays. As a consequence, many collections of biological knowledge
have been developed in order to become fundamental resources that are used every day by researchers
around the world.
The process of building a new database relevant to many fields in the context of biomedicine involves transforming, integrating, and filtering multiple data sources, as well as adding new material
and annotations.
In bioinformatics and in a wider systems biology context, researchers use a very large number of
different databases to retrieve more and more information related to the problem they are investigating. In the past years the biologists themselves built databases, but at the beginning of the genomic era
the amount of data was small and the main problem to solve consists in making the database entries as
much as readable, so that the most of database entries were formatted as flat files (Wong, 2002). The
immediate consequence of this kind of data integration is a growing number of databases in different formats which dont use standard query software and which are only accessible by bioinformatics
experts (Baker and Brass, 1998).
These databases and systems often do not have anything that can be thought of as an explicit database schema, which is a formalized catalogue of all the tables in the database which can be queried, the
attributes of each of these tables, and the meaning of and indices on each of these attributes. Further
compounding the problem is the fact that research biologists demand flexible access and queries in adhoc combinations. Simple retrieval of data is not sufficient for modern bioinformatics. Here the challenge
is to understand how to manipulate the retrieved data derived from various databases and restructure
these in some way in order to make them as useful as possible in the process of investigation on specific
biomedical problems (Wong, 2002).
State of the Art of Data Integration in Systems Biology

The first effort to collect information about proteins under the same root was dated in 1965, and by 1984
it evolved in the Protein Information Resource (Apweiler, Bairoch, & Wu, 2004), the Protein Data Bank
(Berman, Henrick, & Nakamura, 2003), which stores information about protein structures (founded in
1971) and the EMBL data library (Cochrane, et al., 2006), the first database able to store information
about nucleic-acid sequences, founded in 1981, were developed. Nowadays, these databases and their
successors have been joined by numerous other resources, which store information on chemical entities,
gene expression, molecular interactions and biochemical pathways. It is important to point out that the
value of bioinformatics data is thus completely dependent on the ability to make the correct links from
the sequences to the scientific literature, and to extract the information that literature contains, so a good
data integration system should be linked with literature information as much as possible.
Many of the main bioinformatics databases are made available thanks to a small number of institutions
devoted to providing services to the scientific community, such as the European Bioinformatics Institute
(EBI) in the UK (Lopez, Duggan, Harte, & Kibria, 2003), and the National Centre for Biotechnology
478
Information (NCBI) in the USA (Jenuth, 2000). The EBI maintains a large data warehouse that contains
over 100 bioinformatics databases (Zdobnov, Lopez, Apweiler, & Etzold, 2002) using the SRS data
warehousing system (Etzold, Ulyanov, & Argos, 1996). The NCBI Entrez server (Geer & Sayers, 2003)
offers similar functionality. However, there is a difference in the focus of the two systems: the interface
to SRS is powerful but complex, allowing the construction of queries among different databases in a
generic manner. Entrez has a simpler interface, with less support for structured queries, but provides
rapid retrieval of any kind of data which is linked to a given search term. Both warehouses contain not
only gene and protein databases, but also literature databases (for example, PubMed is incorporated
into Entrez), allowing direct interlinks between these resources. The usefulness of a data warehouse
is primarily dependent on the coherence of the data it contains. Data warehousing technologies would
seem to support the automatic generation of a single integrated data resource from independent, but
cross-referenced, databases. However, there are certain limitations to this approach. Accurate crossreferences are vital for good results, but the volume of data at the moment exceeds the capacity for
manual supervision of almost all bioinformatics databases (for example in the UniProtKB about 220,000
records, which are 7% of the total, are manually curated). In order to overcome this limitation, automatic
methods for generating cross-references can be used. They allow to establish equivalence between the
different entities through the tracking of identifiers (as data are transferred between resources), or by
comparing the properties (such as the sequence) of different entities. However, these methods may not
always produce the correct answers. Different databases may maintain different identifiers, and use
different names, for the same biological object and the same object may be assigned different properties in different resources.
There are different approaches used to improve the coherence among biological databases; these
include the establishment of international collaborations to share data, the use of common controlled
vocabularies (such as the Gene Ontology (Gene Ontology Consortium., 2006), a hierarchical controlled vocabulary for describing the function of gene products) by different resources, and the wider
development of internal standards for data representation (particularly in new areas of research, such
as transcriptomics (Whetzel, et al., 2006) and proteomics (Orchard, et al., 2006), where the data complexity can be higher than that of sequence data. Another approach is the production of well-annotated,
non-redundant subsets of the complete data and the development of services based on these. The EBIs
Integr8 project (Kersey et al., 2005) provides reference data sets and analysis tools for species with
completely sequenced genomes from which redundancy has been removed.
General Database Architecture

Biological databases can be generally defined as very large collection of biological data which are organized in peculiar schemas. These database schemas can have different architecture and the data collection
can be organized and managed through different software. Information stored in each specific resource
follows a logic schema of data organization, which allows a feasible data extraction through queries.
The internet available biological databases are for architecturally point of view are usually organized
as showed in Figure 1.
This architecture consists in three level of software: (1) the Database Management System (DBMS),
a set of software which manage a collection of data and which is placed between the user and the data
themselves. At the top of this stratification (3) there is the web browser which allows the database query
and which is useful to visualize the query results as web pages. In the middle of this stratification there
479
Figure 1. General database architecture
is a software level (2) which allows the connection between the data management system and the user
through the web interface. This level generally consists of PHP (Personal Home Page) drivers which
are useful to make queries to the database and to convert the queries results into web pages (typically
in hypertext mark-up language, HTML). This process is very important since it allows a more direct
interaction between users and database content in an easy way. Thanks to this peculiar architecture an
interaction between users and data stored in different biological resources is facilitated.
However a deeper level, each biological resource stores heterogeneous data and this may create more
difficulty to retrieve the data. In fact, the different biological databases are focused on specific problem
and a single database usually contains only specific information.
The data integration would be the solution to these problems in order to make available a number of
integrated resources which cover different information form biomedical science and systems biology.
DATA INTEGRATION
Data integration is a central part of systems biology and the problem of the data integration can be
described on three levels of complexity: the first is the integration of heterogeneous data resources and
databases with the aim to base data between these databases and to query for information; the second
consists in the identification of correlative associations across different datasets with the aim to gain a
more comprehensive and coherent view of the same objects in light of different data sources; the third
is the mapping of information gained for interaction of the objects into networks and pathways that may
be used as basic models for the underlying cellular systems.
480
A general integration engine in a bioinformatic context should satisfy the following characteristics,
as suggested by Wong (Wong, 2000): first of all the database should not have a fixed schema since it
should be able to satisfy any query, just on the basis of the single query itself; if a schema is needed before
the definition of a query, then it would be hard to use such query because biomedical databases often
do not have a public schemas available. A specific data model needs to be designed in order to easily
access and store the data from the external data sources. The external data sources used in bioinformatics are typically owned by different organizations that are updating and evolving their databases. It is
therefore important for a general data integration solution to be robust when the data sources evolve.
The creation of a standard metadata in XML format may be very useful to exchange the variety of data
from external databases. Although to provide the XML metadata format is the general tend, actually
different biological database still are using different data management systems and provide only web
pages in html format as their primary mode of access. This make very difficult and challenging to keep
update the information retrieved from several different databases publicly available on the internet.
Data Integration Techniques

The most commonly used strategies adopted to integrate data are: link integration, view integration,
data warehouses, and ontology-based data integration.
The link integration approach consists of a single initial query to a single data source which is immediately followed by hypertext links to related information in other data sources. These data sources
must cooperate to create linking rules in order to appropriately integrate different heterogeneous data.
This approach is far from the most successful method of data integration, because of the difficulty in
linking heterogeneous data from different sources, but it is a considerable first attempt to arrange different kinds of biological information. The main problem relies in the different data model used for each
resource and in the different data management system chosen: sometimes data model can vary and no
advertisement has been provided or differences in the query system can be highlighted so that linked
queries are not possible at this stage. Moreover it is extremely susceptible to naming conflicts and ambiguities, it must consider the updated issues of each linked resource and it is due to the responsibility
of integration and interpretation of the single researcher.
An evolution of the link integration is the sequence retrieval system (SRS) (Zdobnov, Lopez, Apweiler, & Etzold, 2002), the most widely used database query and navigation system for the Life Science
community. SRS is a keyword indexing and search system which recognizes the existence of structured
fields in source databases and allows maintainers to explicitly relate a field in one database to a differently named field in another resource. In order to add a new data source into SRS, this data source is
generally required to be available as a flat file and a description of the schema or structure of the data
source must be available. The access to SRS is possible by using keywords and some constraints in a
specific SRS Query Language. SRS is essentially an information retrieval system which gets back results
which match specified keywords and constraints. These records can contain embedded links that a user
can follow individually to obtain deeper information. There is not too much help in result organization
and retrieval: a browser-based interface for formulating SRS queries and viewing results is available,
through which users can independently access multiple data sources.
SRS also has some limitation: first of all it is basically a retrieval system that simply returns entries in a simple aggregation, in the sense that it is impossible to perform further transformation on
481
the retrieved results. Then SRS is mainly based on flat-file and it is few integrated with more dynamic
analysis tools.
The view integration approach is the second independent way to solve a data integration problem.
Using this approach the information is always stored in the source databases and an external environment around the databases has been built in order to create a unique large system where such external
sources are linked.
The system determines the single resources which should solve each part of the global query and so
that the local queries can be sent to the appropriate resources which are able to solve them. The results
of each query are retrieved and merged to create a single result for the initial query.
This integrative approach is useful since it is not strictly related to the external resources data model
and it relies only on the structure of the single query. The system able to send the local query to the
appropriate resource should be able to preserve the data integration from the variation of the external
sources and it should be always able to retrieve the results.
The most complete cross databases linking consists in the development of the cross-database query
languages named Kleisli and K2 at the University of Pennsylvania around 1990 (Davidson et al., 2001).
This system is based on a language processor able to analyze a query, given in one of these languages,
in order to discover which databases need to be accessed to satisfy the request, and generates a series of
sub-queries. The engine then manage each sub-queries and try to extract the information from particular
databases, for example, the GenBank driver, which can query the NCBI Entrez web interface. After the
drivers fetch the data, the Kleisli/K2 query processor transforms and integrates them, and returns the
data to the user. Kleisli does not require data schemas to be available since it has both a nested relational
data model and a data exchange format that external databases and software systems can easily translate into. It protects existing queries, via a type inference mechanism, from certain kinds of structural
changes in the external data sources. Kleisli is also able to store, update, and manage complex nested
data. Moreover it has a good query optimizer.
Although this approach seems to be very useful since the above characteristics, the system has failed
to expand in the scientific community since many researchers were unsatisfied about the performance
of the system. Because processing a query is limited by the slowest data source, Kleisli and K2 rarely
have the performance that is associated with direct access to the source databases. The failure of these
languages to be adopted by the academic bioinformatics community is more complex, but might result
from the difficulty of writing and maintaining the component database drivers.
The data warehouse approach consists in the development of a new resource where heterogeneous
data coming from different resources are stored.
A data warehouse is a database constructed to support efficient querying of the data it contains (in
contrast to normalized databases designed to support data integrity, which are widely used to maintain
primary resources). Many data warehouses used in bioinformatics provide generic query interfaces
(for example, computer languages and graphical user interfaces) applicable to all the data they contain,
thus enabling the addition of new data without the need for interface redesign. A data warehouse may
be built from several different resources, but to allow the construction of queries that filter and extract
information derived from these resources. Moreover the data must be fitted into a single model that
takes into account the relationship between the different sources, which can be done by exploiting the
cross-references that many of these sources contain.
It is a useful approach which allows to bring all the data under one roof into a single database though
the development of a unified data model that can accommodate all the information stored in the vari-
482
ous source databases and the development of programs that will take data from the source databases,
transform them to match the unified data model and load them into the warehouse.
The first step in the development of a data warehouse consists in the definition of a unified data model
that can store all the information originally contained in many different source databases. The next step
is the development a set of programs able to bring the data from the source databases, transform them
to match the unified data model and then load them into the warehouse.
The warehouse can then be used as a new resource for answering any of the questions that the source
databases can handle, as well as those that require integrated knowledge that the individual sources do
not have. The development of a data warehouse is not trivial as it could be thought. The first problem to
face with is the constant updating of the warehouse content since new information is being constantly
added to the source databases, which means that the new data must be inserted into the warehouse in
a appropriate way even though the warehouse will go out of date. Moreover the data model is always in
developing by adding new data types, changing fields and nomenclature, and changing the relationships
among data types. This constant mix means that dump, transform and load software that have been
written for one version of a database will not necessarily work with a later version and the modification
are constantly needed.
The advantages of a data warehouse approach appear clear, first of all the data warehouse allows an
high efficiency to retrieve specific information related to a specific query, more information availability
in unique resource, immediate access to different kind of information through a single query and better
information accuracy and better control on the information sources.
An example of attempts at the warehouse approach was the Integrated Genome Database (IGD)
project (Ritter, Kocab, Senger, Wolf,. & Suhai, 1994), which has been developed with the aim to combine human sequencing data with the multiple genetic and physical maps. IGD integrated more than a
dozen source databases, including GenBank, the Genome Database (GDB) and the databases of many
human genetic-mapping projects. The integrated database was distributed to users with a graphical interface. The IGD project survived for almost a year before collapsing. The main reason for its collapse
was the database mix of source databases. On average, each of the source databases changed its data
model twice a year thus the IGD data import system broke down every two weeks and the dumping
and transformation programs had to be rewritten. A more recent warehouse project that is underway at
the University of Pennsylvania makes use of a generalized model for biological data called the Grand
Unified Schema (GUS) (Bahl, et al. 2002). The immediate goals of this project are more modest than
those of IGD, because the aim is to support several targeted and more restricted research projects rather
than to become a general public resource.
The ontology-based data integration can be inserted in a particular context where data integration
deals with the problem of making uniform such data sharing some common semantics but originate
from unrelated sources. The heterogeneity is a concept which must be taken into account when we
work on data integration. Heterogeneity might be classified into four categories: structural heterogeneity, involving different data models; syntactical heterogeneity, involving different languages and data
representations; systemic heterogeneity, involving hardware and operating systems; and semantics
heterogeneity, involving different concepts and their interpretations.
There are several methods created to address the problem of dealing with different concepts and
interpretations. In general, the approaches might be divided into two different branches: approaches using ontologies and approaches without using ontologies (for example using meta-data (Busse, Kutsche,
Leser. & Weber, 1999, Nam, & Wang, 2002).
483
The term ontology was proposed by Gruber (Gruber, 1993) as an explicit specification of a conceptualization. A conceptualization, in this definition, refers to an abstract model of how people commonly
think about a real thing in the world; and explicit specification means that concepts and relationships
of an abstract model receive explicit names and definitions.
Ontology gives the name and the description of the domain specific entities by using predicates that
represent relationships between these entities. The ontology provides a vocabulary to represent and
communicate domain knowledge along with a set of relationships containing the vocabularys terms at
a conceptual level. Therefore, because of its potential to describe the semantic of information sources
and to solve the heterogeneity problems, ontology might be used for data integration tasks.
The ontology-based data integration approach involves the organization of a full set of bioinformatic
resources, by means of developing a resourceome (Cannata, Merelli, Altman, 2005). Each field would
be organized using ontologies through a machine-understandable organization of each field of interest.
A distributed development approach would be required so that resources with focused expertise can
classify other resources in their area, while providing the metadata that would allow easier access to
useful existing resources.
Often scientists attempt to organize a bioinformatics collection of resources site information. One
of the most popular is the Pedros List, a list of computer tools for molecular biologists and the Expasy
Life Sciences Directory, formerly known as the Amoss WWW links page. The Bioinformatics Links
Directory (http://www.bioinformatics.ubc.ca/resources/links_directory/) today contains more than
700 curated links to bioinformatics resources, organized into eleven main categories, including all the
databases and Web servers yearly listed in the dedicated Nucleic Acids Research special issues (Fox et
al 2005). Moreover the National Centre for Biotechnology Institute has tried to make access to its suite
of tools transparent, with much success. However the lack of an useful index is a crucial problem which
should be solved as soon as possible.
The attempt to solve such problems using the ontologies is an important step towards data integration as generalized as possible, in many fields.
First, an overall ontology with the high-level concepts (algorithms, databases, organizations, papers,
people, etc.) must be created, with a set of standard attributes and a standard set of relations between
these concepts (e.g., people publish papers, papers describe algorithms or databases, organizations house
people, etc.). The initial ontology should be compact and built for distributed collaborative extension.
Second, a mechanism for people to extend this ontology with sub-concepts in order to describe their
own resources should be designed. The precise location of a tool within a taxonomy is not critical, for
example the author will place it somewhere based on the location of similar/competing resources or
based on a best-informed guess. Others may create links to the resource from other appropriate locations in the taxonomy in order to ensure that competing interpretations of the appropriate conceptual
location for the resource are accommodated. Thirdly, the formats for the ontologies and the resource
descriptions should be published so enterprising software engineers can create interfaces for surfing,
searching, and viewing the resources. The resulting distributed system of resource descriptions would
be extensible, robust, and useful to the entire biomedical research community.
The ontologies can support resource collections starting from the idea that each individual who
has created or who is maintaining a resource uses a standard ontology to describe the basic features of
that particular resource using the semantic Web, and these are automatically included in a distributed
index of resources. Thus, the index is created by querying the semantic net for descriptions of all available tools, which can then be registered and updated on a regular basis. The development of a browser
484
for this index could be the final step for the indexing of bioinformatics resources. Adoption of agent
technology may be helpful in overcoming the inherent complexity of this challenge (Berners-Lee et al,
2001). Starting from this point an interesting example of Resourcome has recently been developed using
an agent-based technology (Bartocci E. et al 2007). This resourceome develops along two orthogonal
ontology layers. The resource ontology layer exploits the semantic relationships among the considered
resources. The domain ontology layer provides an organic overview of a scientific domain and allows
the mapping of resources to domains concepts. The ontologies are kept alive by agents, which take
into account other fundamental issues like availability and quality of the resources. Users can easier
individualise the proper resources for their needs. Simultaneously, all the grid services emerging from
the lower layer, are given a precise semantics, framed in the context of the domain ontology. In this
context the concept of resourcome is very useful to allow in silico and in vivo scientists to navigate
intuitively and without getting lost in the ocean of resources.
Data Mining of Systems Biology Relevant Information for Modelling

Mining bioinformatics data is an emerging area at the boundaries between systems biology and bioinformatics. Data mining has been recognized as one of the most important information technologies
for automating the process of analysing and interpreting the data deriving from studies of biomedical
science. In systems biology data mining and the data integration are equally important in order to understand the main features of a biological system.
Data mining mainly refers to two phenomena: the extraction of specific information that relates
to some initial data of interest and the analysis of large-scale data sets in order to develop general assumption.
If we consider as initial data a gene or protein name or ID, it will be possible to identify a specific
record that contains much of the information that is already known about that entity simply by querying resources such as UniProt (for protein-based queries), NCBI (to obtain information in a genomic
context) or SRS (for complex filtering over many resources).
However, if the data consists of nucleotide or protein sequence, the comparison must be done among
known sequences to identify similar or identical molecules that have already been annotated.
Common algorithms specific for this purpose include BLAST (McGinnis & Madden, 2004), FASTA
(Pearson, 1994) are available through the websites and web services of the EBI, NCBI and other bioinformatics service providers. High sequence similarity may be a good signal of functional equivalence,
but aspects of a proteins function may be suggested by the presence of particular domains, even if a
proteins overall architecture is not known or the complete function of closely matching sequences is
not known. Several methods exist for domain identification in sequence, mostly using hidden Markov
models (Eddy, 2004). InterPro (Mulder et al., 2005) is a curated, integrative resource that combines
methods for domain identification from 15 different member databases, in which redundant methods
are merged and common annotation attached.
Another method of data mining consists in the literature mining, which is the extraction of information from literature about a specific biological task.
In Swiss-Prot, Entrez Gene and other well-curated databases, direct links exist from individual
records to relevant publications. However, references to many papers relevant to a gene or protein are
not likely to have been directly curated. Further literature can be found by directly searching literature
databases with the name associated with the protein, or with a sentence associated to the biological
485
concept. Abstracts in MEDLINE are labelled with terms from a controlled vocabulary (medical subject
headings; MeSH (Nelson, et al 2004), and searching with relevant MeSH terms can be used to identify
papers of interest. Unfortunately, most bioinformatics database records are not directly annotated with
MeSH terms, while they are often annotated with terms from the GO vocabulary. GO is more tightly
focused than MeSH, but MeSH is split into 16 principal subsections and covers geography and sociology
in addition to biology, whereas GO only covers three specific aspects of biology. A resource to translate
GO terms automatically into their equivalent MeSH terms is still in development.
Data mining for large data sets can be quite different to extracting data about an individual protein
or a concept from literature.
The analysis performed are generally similar as in the case of single sequence search similarity, but
the development of automated procedures is usually essential as the data volume is large. Many public
bioinformatics servers are limited about the amount of data that user can request in a single query, but
in this case often the underlying software is available for local installation. Another solution for the
data mining of large datasets is given by the Web Service standards proposed by the World Wide Web
consortium. For scientists who have to analyse large quantities of data, but who lack programming
expertise, the use of a workflow management tool may provide a solution. A tool like Taverna (Oinn,
T. et al., 2004) is designed explicitly for use in the context of bioinformatics data, providing a graphical user interface for the assembly of multiple services, potentially running at diverse locations, into a
single data processing pipeline.
Large data sets enable knowledge discovery through the identification of patterns within the data.
Another example of large-scale data mining can be found on the UniProt, where statistical patterns in
curated data sets have been used to apply annotation to non-curated sequences. Statistical predictions
can also be directly tested against the actual annotation in curated records, allowing each type of data
to be validated against the other.
Data Integration Solution for Cell Cycle Data: The Cell Cycle Database
We face with the problem of data integration considering an important biological process in the context of systems biology and in general in the context of biomedical science: the cell cycle. Recently we
developed a data integration system which collects the main information about cell cycle genes and
proteins involved in the yeast and mammalian cell cycle circuitry. The system allows to retrieve several
class of information regarding genes and proteins interaction network and the existing mathematical
model developed for this biological process. This integrative system is available to support system biology research on the cell cycle and it aims to become a useful resource for collecting all the information
related to current and future models of this biological process.
We considered the different possible approaches to data integration and we decided to implement
a data warehouse, the Cell Cycle Database (Alfieri et al 2007), which will collect the most useful and
relevant information about genes and proteins involved in the cell cycle process. We started integrating
information from two eukaryotes, which are the budding yeast Saccaromyces cerevisiae and the Homo
sapiens. These two organisms were chosen due to the evolutionary molecular conservation of their
regulatory mechanisms, the deep knowledge of their cell cycle due to a large number of experimental
data and the importance of the cell cycle in the context of cancer research in humans.
The relational database is managed by a MySQL server. The Cell Cycle Database is structured
in a snowflake schema (Levene & Loizou, 2003): the main data about yeast and human genes and
486
proteins are stored in the core tables, while the auxiliary data about the, genes, proteins and models
are stored in the external tables. These tables are all linked to the main table by a one-to-one or oneto-n relationship through the specific identification number (ID) for genes and proteins. The snowflake
schema has been chosen in order to facilitate the automatic data insertion and the automatic updating
of the database content. The automatic updating system has been realized in order to perform automatically the queries to the public databases and to import the new data in the database. The procedure of
automatic updating of the resource can be performed by the database administrator just inserting the
gene name in a specific web page. In this way a cascade updating of all tables occurs.
We collect data based on Kegg Pathway Database (Ogata H, Goto S, Sato K, Fujibuchi et al,1999)
and Reactome (Vastrik et al., 2007) gene information. The database contains the human and yeast genes
involved in the complete cell cycle pathway and in the MAP kinase signalling pathway, the human genes
involved in the apoptosis pathway from Kegg, and it also integrates more specific information related
to mitotic and checkpoint pathways from Reactome. Starting from these data, the database system has
been able to automatically perform the retrieval of the information related to each gene and protein
by querying several external biological resources freely available. The information retrieval has been
developed through a set of programs used for importing specific information about gene and protein
into the database.
The data sources selected for the yeast and human gene information are Entrez Gene for the general
information about genes (Maglott et al., 2005), GenBank for the DNA sequences (Benson et al., 2005),
Ensembl Genome Browser for transcripts information related to each gene (Hubbard et al., 2005) and
Gene Expression Omnibus (GEO) for microarrays expression data (Barrett et al., 2005). The data source
specific to the yeast genome are Saccaromyces Genome Database (SGD) (Cherry et al, 1998), Comprehensive Yeast Genome Database (CYGD) (Guldener, 2005), the Promoter Database of Saccaromyces
cerevisiae (SCPD) (Zhu & Zhang MQ, 1999) and the YEASTRACT (Yeast Search for Transcriptional
Regulators And Consensus Tracking) which provides the transcription factors specific for the yeast
genes based on literature references (Teixeira et al, 2006). For human genes there are other specific
data sources, such as dbSNP for the list of Single Nucleotide Polymorphism (Sherry et al., 2001), Mammalian Gene Collection (MGC) for cDNA clones associated to each gene (MGC Project Team, 2004),
the Database of Transcriptional Start Site (DBTSS) for information related to the promoter region of
the human genes, the promoter sequence and the transcriptional start site position (Yamashita R et al.,
2006). Moreover, we consider the database Transfac for transcription factors associated to each gene
(Matys V et al. 2006), Unigene for expression data from EST counts (Miller G et al., 1997), QPPD for
PCR primers specific for human genes (Quantitative PCR Primer Database) and Online Mendelian
Inheritance in Man (OMIM) for human genetic disorders (McKusick, 1998).
We also considered different data sources for yeast and human protein information, such as Uniprot
for the general information about proteins (Bairoch A et al., 2005), PDB for protein structures (Kouranov
et al., 2006), Transpath for protein complexes (Krull M et al., 2006) and InterPro for protein domains
(Mulder et al, 2002). Particular attention has been given to the protein-protein interactions, in fact we
chose several interaction data sources for yeast and human proteins, such as Mint (Zanzoni et al. , 2002),
Intact (Hermjakob et al., 2004), Bind (Bader et al, 2003) and BioGrid (Stark, 2006), in order to better
understand the cell cycle interaction network. The integration system has been developed through a
data warehousing approach (Stein, 2003), which allows the integration of information stored in different biological databases. An automatic data retrieval have been developed in order to keep the database
constantly up to date (Davidson et al, 1995). The database integration system (Figure 2) consists in a
487
Figure 2. The data warehouse integrative system
series of programs used to retrieve the data from several different external databases, to transform and
load them into the warehouse data model. In this way all the data stored will have the same format in
order to facilitate the database specific query.
Data Integration and Data Mining for Model Information Retrieval

The main task of modelling in systems biology is to provide a framework for the prediction of the
behaviour of a biological system based on in silico simulation of human diseases biology across the
multiple distance and time scales of an organism.
However the full understanding of the responses of a biological system requires an essential step, that
is the knowledge of all of its component parts. This is the main reason why the integration of genomics,
proteomics and metabolite measurements within the context of controlled gene or other external perturbations of complex cell and animal models is the basis of systems biology efforts at many different
research levels (that is from basic research to drug companies research).
Considering that the mathematical models are often the key elements of systems biology studies we
will now face with the problem of data integration and data mining by presenting here the main model
repositories.
Modelling efforts useful both for drug discovery and more in general for the analysis of the biological
properties of such processes must stimulate response at the scale of cell and tissue organ complexity. At
the same time a sufficient level of detail must be included in the sense that intervention points accessible to model analysis are available and they can be modulated in silico in order to predict an high level
488
read-out. A deeper level of detail can be reached through the data integration of as much information
as possible related to all the components of each networks.
The models in general can be defined as abstract representations of biological components and
processes in order to mathematically describe their structural and dynamical properties. Biological
processes can be represented as a network of reactions which can be described in deterministic terms
by ordinary differential equations (ODE) systems, in order to mathematically simulate their dynamics. The models simulations can be useful to identify the emergent properties of the system and are
also useful to analyse some peculiarities of a biological network. Other methods for the mathematical
modelling of a biological system, such as stochastic methods (e. g. boolean networks, Petri nets) are
available and suitable for different aims of simulating, even if the ODE based models are more suitable
for temporal simulation purposes.
These models benefit from the large amount of literature data so that in order to annotate the different
model components, systems biology studies have to tackle the problem of finding information related
to all the elements involved. In this scenario the need to collect information about genes and proteins
in a unique resource becomes a crucial problem, despite the fact that several resources on biological
pathways, like KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway Database (Ogata H et al.,
1999) and Reactome (Vastrik I et al., 2007), are already available for different organisms.
The KEGG Pathway Database covers a larger field because it is a wide collection of pathway maps
for metabolic processes, genetic and environmental data from different organisms such as signal transductions pathways and human diseases. For each component of the KEGG pathway a short report is
given: the report contains the essential information both for genes and proteins and basic links to some
genomic and proteomic databases are provided.
Reactome is a curated resource for different organisms pathway data related to biological processes
which relies on information about single reactions grouped into pathways. The Reactome data enlarges
the concept of a biochemical reaction to include, for example, the association of two proteins to form a
complex, or the transport of an ubiquitinated protein into the proteasome. Since the resource is principally
based on the assembly of single reactions, it lacks the complexity of the entire pathways, especially in
the case of complex pathways (e.g. the cell cycle).
In the large field of databases on the web many repositories of models are available. The BioModels
Database (Le Novere et al., 2006) is the reference database that contains peer-reviewed models in Systems Biology Markup Language (SBML) format (Hucka et al, 2003), an XML-based language for the
storage and exchange of biological models.
JWS Online, a systems biology tool for the simulation of kinetic models from a curated SBML
model database (Olivier & Snoep, 2004), is another interesting model repository. JWS Online allows
the viewing of kinetic laws reactions, but it lacks the representation of some important mathematical
structures of the considered model, such as algebraic equations, delay equations and events. There is
also another model database, the CellML repository (Cuellar AA et al., 2003), which stores models in
the CellML format, an alternative XML-based format for the representation of biological models. The
CellML repository contains models that conform to the CellML specification. These models represent
several types of cellular processes, including models of electrophysiology, metabolism, signal transduction
and mechanics. Both JWS Online and Biomodels allow the model simulation powered by the software
Mathematica (web version 2.0) and a static visualization of the simulation results is possible. However
none of them gives users the possibility to directly simulate the ODE system. Biomodels, CellML and
JWS Online contains a considerable number of models from different organisms.
489
Moreover the Cell Cycle Database has a specific section where cell cycle models are stored. In particular yeast and mammalian cell cycle models published in recent literature and based on linear and
non-linear differential equations systems are stored in this repository.
The model list in the Cell Cycle Database has been assembled searching literature and browsing
many specific on-line resources. All the models relevant to cell cycle studies have been collected in the
database using an XML file encoded with the Systems Biology Markup Language (SBML). In particular,
a number of models, for which the SBML file is available in BIOMODELS or from authors web sites,
have been directly integrated in the Cell Cycle Database. Published models not yet implemented in
SBML have been manually encoded in SBML using JigCell Model Builder (M. Vass et al., 2004).
Each model is presented in a report structured in three sections: the publication data, the SBML data
structure, and the numerical simulation part. The simulation section allows users to simulate a model
using the software XPPAUT (Ermentrout, 2002) and to plot results on the fly in order to capture the
dynamical properties of the biological process.
An Example of Data Mining in a Systems Biology Analysis

One of the main topic in systems biology is the mathematical simulation of a biological process, in order
to better understand the behaviour of a living system through mathematical rules and bioinformatics tools.
Many bioinformatics tools and databases can help in the data mining workflow of the model definition.
For instance databases specifically focused on protein-protein interaction and pathway resources have
crucial role in the definition of the model and other tools, like CellDesigner, Copasi or JigCell, allow to
implement the model wiring diagram. A data mining-oriented workflow for a systems biology study is
an innovative approach to the mathematical modelling of a biological process. The use of bioinformatic
tools, such as data integration methods and data mining techniques, can help investigators reduce the
complexity of the modelling, starting from the screening of the model components to the definition of
the emergent properties of a biological system, which is the final aim of systems biology studies.
Data mining has been recognized as one of the most important information technologies for automating the process of analysing and interpreting the data in modern knowledge especially in an as
complex field as systems biology.
An application of data mining in a specific systems biology study is given by the definition of a
putative network which describes the G1 to S transition of mammalian cell cycle, taking into account
the characteristic cellular localization of the key players of this process.
The characterization of the network consists of the detection of the main components participating
in the biological process investigated which have a crucial role in the nucleo/cytoplasmic translocation.
The network characterization is performed through a literature and web-based data mining in order to
detect the main components which participate in the G1 to S transition process. An extensive literature
searching is required using electronic resources such as PubMed from NCBI and other literature searching tools in order to identify the most recent articles related to the G1 to S transition in mammalian
cells. Furthermore a wide web-based search is necessary in order to identify the main protein-protein
interaction involved in the biological process. This step implies the browsing of many different bioinformatic databases, such as protein-protein interactions resources (BIND, Mint, IntAct), cell cycle
specific database (Cell Cycle Database) and pathway resources (Kegg Pathway Database, Reactome).
When the model components have been identified, a wiring diagram of the G1 to S transition model in
mammalian cells can be drawn using the diagram editor CellDesigner.
490
In this way it is possible to combine data integration and data mining techniques and to link them
in order to accomplish a common task, that is the definition of a partially unknown biological process
and its mathematical simulation.
FUTURE TRENDS
Both data integration and data mining offer many possible future developments especially in the context of systems biology. The importance of automatic methods for linking biological data has grown as
the number of known sequences and pathways have increased and the proportion of manually curated
database records has fallen. As new, increasingly large-scale, experimental techniques continue to be
developed, and the number of databases, specifically focused on biological problems, holding data from
their application grows in parallel, the ability to perform distributed queries over resources stored in
different databases. However, it remains difficult to integrate data from different resources. The future
trends should be new bioinformatics warehousing systems, designed with the explicit aim of supporting
queries that can be executed across differently located resources able to ensure the efficient execution
of such queries. An alternative to accessing a single data warehouse should be to retrieve data from
multiple sources and merge this as needed. Some projects are developing software to provide solutions
to generic problems in analysing bioinformatics data distributed across many resources.
Moreover there are common problems in the attempting to mine information available on the web
and there is a common interest in the bioinformatics community in adopting the emerging semantic
web technologies, designed to support distributed computing over a wider range of domains. The key
concept of the semantic web (ontology-based resources) is the publication of self-describing data, that is,
data published together with the metadata that describes it. If such publication accords to standardised
protocols, and if the descriptions themselves are machine-readable, standardised and shared across
resources, then a programmer should be able to retrieve and integrate distributed data by specifying
a logical request to a semantically aware search engine without needing specific knowledge about the
peculiarities of individual sources. This model is particularly attractive, not only due to the highly
dispersed nature of much bioinformatics data, but also because the descriptive standards necessary to
make it work have already been developed for particular sub-domains, such as microarray and molecular
interaction data.
CONCLUSION
We have presented a generalized framework for data integration and data mining in systems biology.
The applicability of methodologies we propose here is possible in different types and sizes of data and
to different biological problems. In this chapter we focused on the cell cycle process and the specific
methodology that we chose in order to accomplish a data integration task in a restricted field.
The benefits of the integrative approach rely in the possibility of revealing the underlying mechanisms of a specific and widely investigated process such as the cell cycle through specific information
exploration. The database which integrates genes and proteins should be helpful to both experimentalists
and modellers during their research activity as the information retrieval can occur in few steps, that
491
is the query to a single resource with the possibility for the direct retrieving of heterogeneous kind of
information.
REFERENCES
Alfieri, R., Merelli, I., Mosca, E., & Milanesi, L. (2007). A data integration approach for cell cycle
analysis oriented to model simulation in systems biology. Aug, 1(1), 35.
Apweiler, R., Bairoch, A., & Wu, C. H. (2004). Protein sequence databases. Curr. Opin. Chem. Biol.
8, 76-80.
Bader, G. D., Betel, D., Hogue, C. W. (2003), BIND: The biomolecular interaction network database.
Nucleic Acids Res., 31, 248-250.
Bahl, A. et al.(2002, PlasmoDB: The plasmodium genome resource. An integrated database that provides
tools for accessing, analysing and mapping expression and sequence data (both finished and unfinished).
Nucleic Acids Res., 30, 8790.
Bairoch, A., Apweiler, R., Wum C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang,
H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., ODonovan, C., Redaschi, N., & Yeh, L. S.
(2005), The universal protein resource (UniProt). Nucleic Acids Res., 33, D154-159.
Baker P. G., & Brass, A. (1998). Recent development in biological sequence databases. Curr. Op. Biotech., 9, 54-58.
Barrett, T., Suzek Tugba, O., Troup, D. B., Wilhite, S. E., Ngau, W. C., Ledoux, P., Rudnev, D., Lash,
A. E., Fujibuchi, W., & Edgar, R. (2005). NCBI GEO: Mining millions of expression profiles-database
and tools. Nucl. Acids Res., 33, 562-566.
Bartocci, E., Corradini, F., Merelli, E., & Scortichini, L. (2007), BioWMS: A Web-based workflow
management system for bioinformatics. BMC Bioinformatics, 8(Suppl 1), S2
Benson, D. A. Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2005). GenBank. Nucleic
Acids Res., 33: D34-D38.
Berman, H., Henrick, K., & Nakamura, H. (2003). Announcing the worldwide Protein Data Bank.
Nature Struct. Biol., 10, 980.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Sci Am, 284, 34-43.
Busse, S., Kutsche, R. D., Leser, U., & Weber, H. (1999). Federated information systems: Concepts,
terminology and architectures. Technical Report. Nr. 99-9, TU Berlin.
Cannata, N., Merelli, E., & Altman, R. B. (2005). Time to organize the bioinformatics resourceome.
PLoS Comput Biol., 1(7), e76.
Cherry, J. M. et al. (1998). SGD: Saccharomyces genome database. Nucleic Acids Res., 26(1), 73-79.
Cochrane, G. et al.(2006). EMBL nucleotide sequence database: Developments in 2005. Nucleic Acids
Res., 34, D10-D15 .
492
Cuellar, A. A., Lloyd, C. M., Nielsen, P. F., Bullivant, D. P., Nickerson, D. P., & Hunter, P. J. (2003). An
overview of CellML 1.1, A biological model description language. SIMULATION: Transactions of The
Society for Modeling and Simulation International, 79(12), 740-747.
Davidson, S. B. et al (1995). Challenges in integrating biological data sources. JCB, 2, 557-572.
Davidson, S. B. et al. (2001) K2/Kleisli and GUS: Experiments in integrated access to genomic data
sources. IBM Syst. J., 40.
Eddy, S. R. (2005). What is a hidden Markov model? Nature Biotechnol, 22, 1315-1316.
Ermentrout, B. (2002). Simulating, analyzing, and animating dynamical systems: A guide to XPPAUT
for researchers and students. Philadelphia: SIAM.
Etzold, T., Ulyanov, A., & Argos, P. (1996). SRS: Information retrieval system for molecular biology
data banks. Methods Enzymol, 266, 114-128.
Fox, J. A., Butland, S. L., McMillan, S., Campbell, G., & Ouellette, B. F. (2005). The bioinformatics
links directory: A compilation of molecular biology Web servers. Nucleic Acids Res 33, W3-W24.
Geer, R. C., & Sayers, E. W. (2003). Entrez: Making use of its power. Brief Bioinform, 4, 179-184.
Gene Ontology Consortium (2006). The gene ontology (GO) project in 2006. Nucleic Acids Res., 34,
D322-D326.
Gruber, T. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition
, 5(2), 199-220.
Guldener, U. et al. (2005). CYGD: The comprehensive yeast genome database. Nucleic Acids Res, 33,
364-368.
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron,
M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G.,
Sherman, D., & Apweiler, R. (2004). IntAct - An open source molecular interaction database. Nucl.
Acids. Res., 32, 452-455.
Hubbard, T. et al (2005). Ensembl. Nucleic Acids Res., 33, 447-453.
Hucka, M. et al (2003): The systems biology markup language (SBML): A medium for representation
and exchange of biochemical network models. Bioinformatics, 19(4), 524-531.
Jenuth, J. P. (2000) The NCBI. Publicly available tools and resources on the Web. Methods Mol. Biol.,
132, 301-312.
Kersey, P. et al.(2005). Integr8 and genome reviews: Integrated views of complete genomes and proteomes. Nucleic Acids Res., 33, D297-D302.
Kouranov, A., Xie, L., De la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E., & Berman, H. M. (2006).
The RCSB PDB information portal for structural genomics. Nucl. Acids Res., 34, 302-305.
Krull, M., Pistor, S., Voss, N., Kel, A., Reuter, I., Kronenberg, D., Michael, H., Schwarzer, K., Potapov,
A., Choi, C., Kel-Margoulis, O., & Wingender, E. (2006). TRANSPATH: An information resource
493
for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res.,
34, D546-D551.
Le Novere, N., Bornstein, B., Broicher, A., Courtot, M., Donizelli, M., Dharuri, H., Li, L., Sauro, H.,
Schilstra, M., Shapiro, B., Snoep, J. L., & Hucka, M. (2006): BioModels database: A free, centralized
database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic
Acids Res., 34, D689-91.
Levene, M., & Loizou, G. (2003). Why is the snowflake schema a good data warehouse design? Information Systems, 28(3), 225-240.
Lopez, R., Duggan, K., Harte, N., & Kibria (2003). A. Public services from the European bioinformatics
institute. Brief Bioinform, 4, 332-340.
Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T. (2005). Entrez gene: Gene-centered information at
NCBI. Nucleic Acids Res., 33, D54-D58.
Matys, V., Kel-Margoulis, O. V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev,
D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A. E., &
Wingender, E. (2006): TRANSFAC and its module TRANSCompel: Transcriptional gene regulation
in eukaryotes. Nucl. Acids Res., 34, 108-110.
McGinnis, S. & Madden, T. L. (2004) BLAST: At the core of a powerful and diverse set of sequence
analysis tools. Nucleic Acids Res., 32, W20-W25.
McKusick, V.A (1998): Mendelian inheritance in man. A catalog of human genes and genetic disorders.
Baltimore: Johns Hopkins University Press. (12th edition).
MGC Project Team (2004): The status, quality, and expansion of the NIH full-length cDNA project:
The mammalian gene collection (MGC). Genome Res., 14, 2121-2127.
Miller, G., Fuchs, R., & Lai, E. (1997). IMAGE cDNA clones, iniGene clustering, and ACeDB: An
integrated resource for expressed sequence information. Genome Res., 7, 1027-1032.
Mulder, N. J. et al (2002). InterPro: An integrated documentation resource for protein families, domains
and functional sites. Brief Bioinform., 3, 225-235.
Mulder, N. J. et al. (2005). InterPro, progress and status in 2005. Nucleic Acids Res., 33, D201-D205.
Nam, Y. & Wang, A. (2002). Metadata integration assistant generator for heterogeneous distributed
databases. In Proceedings International Conference on Ontologies, Databases, and Applications of
Semantics for Large Scale Information Systems, Irvine CA, (pp. 28-30).
Nelson, S. J., Schopen, M., Savage, A. G., Schulman, J. L. & Arluk, N. (2004) The MeSH translation
maintenance system: structure, interface design, and implementation. Medinfo, 11, 67-69.
Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., & Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 27, 29-34.
Oinn, T. et al.(2004) Taverna: A tool for the composition and enactment of bioinformatics workflows.
494
Olivier, B. G., & Snoep, J. L. (2004). Web-based kinetic modelling using JWS. Online, Bioinformatics,
20, 2143-2144.
Orchard, S. et al. (2006). Autumn 2005 workshop of the human proteome organisation proteomics
standards initiative (HUPO-PSI) Geneva, September, 46, 2005. Proteomics, 6, 738-741.
Pearson, W. R. (1994). Using the FASTA program to search protein and DNA sequence databases.
Methods Mol. Biol., 24, 307-331.
Quantitative PCR Primer Database. (http://web.ncifcrf.gov/rtp/gel/primerdb/).
Ritter, O., Kocab, P., Senger, M., Wolf, D. & Suhai, S. (1994) Prototype implementation of the integrated
genomic database. Comput. Biomed. Res., 27, 97-115.
Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., & Sirotkin, K. (2001).
dbSNP: The NCBI database of genetic variation. Nucleic Acids Res., 29, 308-311.
Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). BioGRID: A
general repository for interaction datasets. Nucleic Acids Res., 34, D535-539.
Stein, L. D. (2003). Integrating biological databases. Nat Rev Genet, 4, 337-345.
Teixeira, M. C., et al. (2006). The YEASTRACT database: A tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucl. Acids Res., 34, 446-451.
Vass, M., Allen, N., Shaffer, C.A., Ramakrishnan, N., Watson, L. T., & Tyson, J. J. (2004). The JigCell
model builder and run manager. Bioinformatics, 20(18), 3680-3681.
Vastrik, I., DEustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G. R., Croft, D., De Bono, B.,
Gillespie, M., Jassal, B., Lewis, S., Matthews, L., Wu, G. R., Birney, E., & Stein, L. (2007). Reactome:
A knowledge base of biologic pathways and processes. Genome Biology.
Whetzel, P. L. et al. (2006). The MGED Ontology: A resource for semantics-based description of microarray experiments. Bioinformatics, 22, 866-873.
Wong, L. (2000). Kleisli, a functional query system. J. Funct. Prog., 10, 19-56.
Wong, L. (2002). Technologies for integrating biological data. Brief Bioinform, 3(4), 389-404.
Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuritani, K., Nakai, K., & Sugano, S. (2006). DBTSS: DataBase of human transcription start sites, progress report 2006. Nucl. Acids Res., 34, 86-89.
Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., & Cesareni, G.
(2002). MINT: A molecular INTeraction database. FEBS Letters, 513, 135-140.
Zdobnov, E. M., Lopez, R., Apweiler, R., & Etzold, T. (2002). The EBI SRS server-new features. Bioinformatics, 18, 1149-1150.
Zdobnov, E. M., Lopez, R., Apweiler, R., & Etzold, T. (2002). The EBI SRS server New features.
Zhu, J., Zhang, M. Q. (1999). SCPD: A promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15(7-8), 607-11.
495
Key Terms
Cell Cycle: The series of events that take place in a eukaryotic cell leading to its replication. These
events can be divided in two broad periods: interphase, during which the cell grows, accumulating nutrients needed for mitosis and duplicating its DNA, and the mitotic or M phase, during which the cell
splits itself into two distinct cells, often called daughter cells. The cell cycle is a crucial process for the
organisms life, by which a single-celled fertilized egg develops into a mature organism, as well as the
process by which hair, skin, blood cells, and some internal organs are renewed.
Data Integration: The process of combining data existing in different sources and providing the
user with a unified view of these data. This process is useful in many situations, in particular in the
scientific environment when raise out the necessity of combining research results from different bioinformatics repositories. Data integration appears with increasing frequency as the volume and the need
to share existing data explodes.
Data Mining: The process through which large amounts of data are sorted with the aim to extract
from them relevant information. This term is increasingly used in the sciences to extract information
from the enormous data sets generated by modern experimental and observational methods, especially
in the biological context. It can be defined as the nontrivial extraction of previously unknown and
potentially useful information from data and databases.
Data Warehouse: The main repository of an organizations historical data, its corporate memory.
It contains the raw material for managements decision support system. The critical factor leading to
the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as
data mining, on the information without slowing down the operational systems. The data in the data
warehouse is organized so that all the data elements relating to the same real-world event or object are
linked together.
Mathematical Model: An abstract model that uses mathematical language to describe the behaviour of a system. Mathematical models are used particularly in the natural sciences and engineering
disciplines (such as physics, biology, and electrical engineering) but also in the social sciences (such
as economics, sociology and political science). It can be defined as the representation of the essential
aspects of an existing system (or a system to be constructed) which presents knowledge of that system
in usable form.
496
497
Chapter XXIX
Methods for Reverse Engineering

of Gene Regulatory Networks
Hendrik Hache
abstract
In this chapter, different methods and applications for reverse engineering of gene regulatory networks
that have been developed in recent years are discussed and compared. Inferring gene networks from
different kinds of experimental data are a challenging task that emerged, especially with the development of high throughput technologies. Various computational methods based on diverse principles were
introduced to identify new regulations among genes. Mathematical aspects of the models are highlighted,
and applications for reverse engineering are mentioned.
Introduction
Deciphering the structure of gene regulatory networks by means of computational methods is a challenging task that emerged during last decades. Large scale experiments, not only gene expression
measurements from microarrays but also promoter sequence search for transcription factor binding
sites and investigations of protein-DNA interactions have spawned various computational algorithms
to infer the structure of underlying gene regulatory networks. Identifying gene interactions yields to
an understanding of the topology of gene regulatory networks and the functional role of any gene in
a particular pathway. Of interest are genes with a certain impact on pathways. These can be therefore
putative drug targets. Once a network is obtained, in silico experiments can be performed to test hypotheses and generate predictions on disease states or on the behavior of the system under different
conditions (Wierling et al., 2007).
Quantitative gene expression measurements using microarrays were first performed by Schena
etal. (1995) for 45 Arabidopsis thaliana genes and shortly after, for thousands of genes or even a whole
Methods for Reverse Engineering of Gene Regulatory N etworks
genome (DeRisi etal., 1996; DeRisi etal. 1997). Since that time various methods for the analysis of
such large-scale data have been developed. First of all, clustering algorithms were used to partition
genes into subsets of co-regulated genes regarding to their expression profiles (Eisen etal., 1998). It
is found that genes belonging to the same cluster have similar biological functions. But this does not
imply information about any directed regulatory interactions among these genes. For this purpose,
more sophisticated methods were employed to reverse engineer gene networks and regulation causality from such data. Therefore, reverse engineering constitutes an intermediate step from correlative to
causative data analysis.
Gardner and Faith (2005) classified reconstruction algorithms into two general strategies: physical
approaches and influence approaches. Algorithms of the first group seek to identify interactions between transcription factors and DNA and reveal protein factors that physically control RNA synthesis.
These methods, such as promoter binding analysis, as performed by Lee etal. (2002), uses the genomic
sequence information directly. The second strategy, the influence approach, aims to identify causal
relationships between RNA transcripts by the examination of expression profiles. The regulation of
the transcription machinery can be effected on multiple levels. For instance, it can be regulated on
the DNA, transcriptional, or translational level. The regulation on the DNA or transcriptional level is
mainly due to binding of transcription factors on specific parts of DNA or by chemical or structural
modifications of DNA. A regulation on the translational level might be due to microRNAs resulting in
the decay of the respective mRNA target (Ruvkun, 2001). Since the quality of current measured protein
concentration is not sufficiently high enough for reconstruction purposes, it is assumed in mathematical
models that changes in expression as measured by mRNAs concentrations can explain changes of other
gene transcripts. However, a recent study by Newman etal. (2006) showed that many changes in the
measured protein levels at single-cell resolution are not observable by DNA microarray experiments.
In computational models many regulation effects are neglected or included as hidden factors. In recent
years, a combination of both approaches were employed by integrating multiple data sources for the
construction of priors of networks or parameters (Imoto etal., 2003; Bernard and Hartemink, 2005;
Werhli and Husmeier, 2007).
In this chapter different reverse engineering algorithms following an influence approach that have
been proposed in the last decades are discussed. Moreover, relevant mathematical aspects are briefly
described and applications to reveal gene regulatory networks or part of them are highlighted. Another
crucial point that is discussed in this chapter is the validation of algorithms. The reverse engineering
methods have to cope with noisy, high dimensional, and incomplete data. But the quality and amount
of useful measurements for reverse engineering is increasing.
Computational Models and Methods

The following models can be considered as graphical models that define a mathematical structure
M = (G,F,Q) with a graph G, a set of functions F, and a set of parameters Q of these functions. The graph
G = (X,E) consists of nodes X = {X1,...,Xn} with real values x = {x1,...,xn} which are interconnected via
edges E, see Figure 1. An edge can be directed or undirected. For a directed graph the ki direct regulaki
otherwise it represents all connected nodes.
tors of node Xi are represented by X pa[i ] = {X 1pa[i ] , , X pa
[i ] }
ki
1
The associated values are depicted by x pa[i ] = {x pa[i ] , ,x pa[i ] }. A function f i F with fi : x pa[i ] xi is
498
Figure 1. Example of a graph. (A) A graph consists of nodes and edges. The nodes are often associated
with the genes and the edges with regulations. Each node value is the corresponding expression value
of the gene. There can be activation and inhibition regulation, e.g., X1 on X3 and X2 on X3, respectively.
Parents of a node are its regulators, e.g., the parents of node X3 are X1 and X2. (B) The adjacency matrix
is the representation of the graph topology.
Table 1: Properties of models. BoN - Boolean networks, ODE - ordinary differential equations, LN
- linear networks, NN - neural networks, AN - associated networks, BN - Bayesian networks, DBN dynamic Bayesian networks.
BoN
continuous values
discrete values
continuous in time
discrete in time
deterministic
stochastic
LN
NN
AN
BN
DBN
static
dynamic
ODE
x
x
assigned to each node, representing a regulation or dependence relation between the nodes. Whether
it is a regulation or dependence relation depends on the model. The value of the nodes can be discrete
or continuous. Discrete models (e.g., Boolean networks and some kind of Bayesian networks) are often
much easier to handle mathematically, in some cases even in a closed form.
A computational model for reconstruction purpose should be an approximation of a real biological
system, but simplifications have to be done to reduce the complexity and the number of parameters which
have to be fitted to given data. Nevertheless, the model should reflect features of the real system.
For the reconstruction of regulatory networks one has to consider different model properties. Some
models can describe dynamical behavior, others use steady state or perturbation data as inputs. One has
to distinguish between deterministic or stochastic models. Some models can handle continuous values,
while for others a discretization has to be performed. Measurements of gene expression are carried out
at discrete time points, therefore most models are discrete in time, only a few keep the kinetics continuous. See Table 1 for an overview.
499
In the following sections, several model types are described and reverse engineering applications
are mentioned.
Boolean Networks
Boolean network models were first introduced by Kauffman (1969) to study dynamic properties of large,
randomly constructed gene networks. Shmulevich etal. (2002) give a good overview of Boolean models
of gene regulatory networks and showed that complex behavior can be modeled by this approach. A
Boolean network is a mathematical model where each node is assumed to take one of two values as its
state (active or inactive, on or off, expressed or not expressed). The state of a network is determined by
the input node values xpa[i]. For a n-node-network there are 2n possible states of the system that means
for a dynamical network that at least after 2n time steps the initial state is reached again. The set of
functions F are Boolean functions bi (e.g., AND, OR, or NOT):
xi = f i (x pa[i ] ) = bi (x pa[i ] )
(1)
or a state transition table which determines the state of each node in dependence of the state of the parent nodes. For a dynamical system, discrete time steps has to be considered:
xi [t + t ] = bi x pa(i ) [t ]
(2)
For k inputs of a node there are 22 possible Boolean functions.

A strategy to find the correct parents configuration and the interaction functions, i.e., learning the
transition table from an incomplete data table, can base on mutual information. This statistical measure
is defined by:
I(Xi;Xj) = H(Xi) + H(Xj) H(Hi,Xj)

= H(Xi) H(Xi,Xj)

= H(Xj) H(Xj,Xi)
(3)
(4)
(5)
where H is the Shannon entropy, defined in general as:
H (X ) =
P( X = z) log P( X = z )
z Z
(6)
Z is a set of all possible values of a random variable X. Entropy can be considered as a measure of
uncertainty. Mutual information of two random variables is composed of the Entropy of one variable
and the conditional entropy of the other. This measures the information that is shared between these
two variables. In other words, it is the reduction of uncertainty of one variable with measuring the other.
For statistically independent variables Xi and Xj is I(Xi;Xj), since measuring a variable does not change
the knowledge of the other, i.e., H(Xi) = H(Xi|Xj).
An algorithm for reverse engineering with Boolean networks based on an information-theoretic
approach using mutual information was introduced by Liang et al. (1998). It is called REVEAL. At the
500
beginning REVEAL considers for each gene one input node and calculates the mutual information of the
output node and each possible input (k = 1). If one input node can predict the output states completely,
the input node will be considered as the only parent with the corresponding part of the transition table.
If not, all pairs of nodes (k = 2) will be checked as an input set by computing the mutual information
once again. k will be increased until the condition of perfect reconstruction is satisfied.
Akutsu et al. (1999) introduced a simpler, but more time consuming approach. In this approach the
connectivity k is fixed. The algorithm performs an exhaustive search for k nodes as input and a set of
Boolean functions which are consistent with the given state transition table. Akutsu et al. (2000) extended the model by combining a Boolean and a qualitative network. The later is defined as a network
in which each edge has a label activation or inhibition. Regulations are represented as qualitative rules.
The authors showed that the inference algorithm is robust against noise on simulated data, but they
pointed out that the proposed method does not give much more information than clustering for a small
number of data sets.
A generalization of Boolean networks are logical networks, first described by Thomas (1973). The
new feature is that the nodes can have more than two states. Wilczynski and Tiuryn (2006) extended
the model to a stochastic dynamical system in a reverse engineering context.
Differential Equation Networks

Using ordinary differential equations (ODEs) gene regulation can be described in a steady state or a
dynamical system in high detail. It is widely used for realistic forward modeling of biological systems
(see e.g., Heinrich and Schuster, 1996). The rate of change of each node value xi is given by functions
f i F that depend on the values of all input nodes:
dxi (t )
= fi (x pa[i ](t ), ui (t ), Q)

dt
(7)
Each f i F can have various mathematical forms with certain parameters, which can be composed
of different terms which have different functional roles, such as regulations (activation or inhibition)
or degradation, described for instance by a power law, sigmoid, or linear function. Additionally an
input function ui for external signals can be included. Furthermore, time delay dependencies can be
incorporated. The complexity of the model can be increased by considering intermediate products in
the gene regulation process, such as proteins or metabolites, but at the expense of more parameters
(Chen et al., 1999).
If the system is in steady state, i.e., dxi(t) / dt = 0 for all i, Eq. (7) can be written as:
0 = fi (x pa[i ], Q)
(8)
where x pa[i ] are the values of all parent nodes in steady state. Note the time-independence of this equation.
ODEs are usually not considered as a graphical model, but since Eq. (7) describes direct interactions
among genes a directed graph can be associated and be used for visualization of the topology. Only
with the set of functions and parameters describing the interactions, ODEs become a powerful tool for
accurate modeling.
501
For reverse engineering, one has to choose the functional form of f i. Furthermore, it is necessary to
estimate the parameters regarding to given data by means of an optimization methods, such as genetic
algorithm (GA) (Wahde and Hertz, 2000), single value decomposition (SVD) (Yeung et al., 2002), simulated annealing (Chen et al., 2001), or algebraic approaches (Laubenbacher and Stigler, 2004).
Linear networks and neural networks represent special cases of differential equation networks. The
functions f i are linear functions or sigmoid type, for linear networks or neural networks, respectively.
The reverse engineering methods based on these network types apply special learning techniques.
Linear Networks
Very simplified models are linear networks. Dhaeseleer et al. (1999) used linear networks to infer interactions from mRNA expression. The authors pointed out that a linear model is only a caricature of
the real system, but the hope is, that one can still draw some interesting conclusions from it. Especially
near steady state linear models are successfully applicable (Gardner et al., 2003; Bansal et al., 2006). A
linear dynamical system can be described by:
dxi (t )
= fi (x pa[i ](t ), ui (t ), Q)
dt
=

wx
ij
j
(t ) + ui (t ) + bi
pa[i ]
(9)
(10)
j
where the functions f i are linear in the values x pa[i ](t ) of all parents. Each value is weighted by wij and
summed up to have a combined effect on node i. Hence, a weight wij W unequal to zero represents a
strength of control of node j on node i. Note that there is also a different definition in the literature (e.g.,
van Someren et al., 2000), where a value wij 0 means an influence of node i on node j. Positive values
represent activations and negative values inhibitions. The corresponding graph is directed. A degradation of node Xi is implicitly contained in the weight wii representing the self-regulation effect. Therefore,
modeling of self-regulation is not possible. In function ui all additional inputs which are not dependent
of the values xi are included. Due to the linear sum this linear model is also call an additive model.
With a discretization in time:
xi[t + t] = xi [t ]+ t f i (x pa[i ] [t ], ui [t ])
(11)
j
= xi [t ]+ t wij x pa

[i ] + ui [t ]+ bi
(12)
For m time points (last time point is T) a compact matrix notation is possible:
X = W X
502
(13)
with:
x1 [t ] x1 [T t ]
X =
x
t
x
T
t
[
]
[
]
n
n
with xi [t ] = (xi [t + t ] xi [t ]) / t
(14)
W = [W b u[t]]
(15)
x1 [0]
X =
xn [0]
1
1
(16)
x1 [T t ]
x1 [T t ]
In general Eq.(13) has no unique solution for the extended weight matrix W since there are much
more genes than data points, but you can find a least square fit:
= X XT X XT
W
)
1
(17)
The inclusion of l given time course data into the matrices:

X = [X1 Xl ]
and X = [X1 Xl ]
(18)
improves the least square fit Eq. (17). The resulting matrix can be discretized to get a binary or ternary
matrix representing the reconstructed network topology.
The model assumes, that all nodes can have an effect on each node, but in the calculated weight
matrix non-regulations should have small values compared to those of the regulations, such that after
discretization, they will be set to zero.
Besides the determination of a least square fit of the weight matrix, there are various other methods
to solve Eq. (10) or (12), like genetic algorithm, singe value decomposition, simulated annealing, or algebraic approaches, mentioned in the preceding section. A further method was developed by Gardner et
al. (2003). It is called NIR and based on multiple linear regression analysis of steady-state transcription
profiles. The algorithm EXAMINE by Deng et al. (2005) is based on the sparse network assumption,
i.e., only a few genes are highly connected (Barabsi and Oltvai, 2004). A clustering step is preceded
to reduced the data set. The model network will then be adaptively changed during the fitting process
by the use of a genetic algorithm. Bansal et al. (2006) proposed a network inference method from time
course data after gene perturbations, called TSNI. At first, this algorithm performs a smoothing and
interpolation step followed by a Principle Component Analysis (PCA) that is applied to the data to reduce
fluctuations and dimensionality. Finally the reduced system is solved.
Neural Networks
Originally used in neuroscience as a model for a nervous system, neural networks were also applied for
gene regulatory network models (Vohradsky, 2001; Wahde and Hertz, 2000). A neural network for gene
regulation is usually used as a dynamic model based on an ordinary differential equation system:
503
dxi (t )
= fi (x pa[i ](t ), Q)
dt

= ai Si wij x j (t ) + bi di xi (t )
j
(19)
(20)
with the parameters: Weights W = wij | i, j = 1, , n , where wij represents the influence of node j on
node i, activation strengths a = {ai | i = 1, , n}, bias parameters b = {bi | i = 1,, n} as delay parameters,
and degradation rates d = {di | i = 1, , n}. A weighted sum over all connected nodes of the node values at
time t transferred through a sigmoidal activation function S ( x) = (1 + e x ) 1 which maps the input values
to the interval (0,1). The effects of all regulating nodes are summed up and have a combined effect on
the connected node. This is a so call additive model. In contrast to linear models, self-regulation and
degradation are distinguishable in the mathematical model.
The simulated annealing algorithm and the genetic algorithm by Sexton et al. (1999) use differential
equations directly without discretization. Other inference methods for neural networks, such as Backpropagation through time (BPTT), takes into account, that expression is measured at discrete time points,
therefore a discretization in time of Eq. (20) has to be performed, similar to Eq. (11).
BPTT is described by Werbos (1990) and applied to genetic data by Hache et al. (2007). The BPTT
algorithm is an iterative, gradient based parameter learning method which minimizes the error function:
E (x, x ) =
1
2
(x [t ] x [t ])
2
(21)
by varying the parameters of the model {W,a,b,d} during every iteration step. x = (x1[t],...,xn[t]) are the
computed values at the end of an iteration and the values x = ( x1 [t ], , xn [t ]) are the given expression data
of n mRNAs at m discrete time points t {t1 = 0, , tm = T }. Similar to the linear approach the resulting
matrix W is a matrix of real values, which has to be discretized to obtain a binary or ternary matrix.
A Bayesian approach can extent neural network learning techniques (Lampinen and Vehtari, 2001).
A non-uniform prior probability P(Q) of the network parameters can be used. Finding the maximum
of the posterior distribution (maximum a posteriori - MAP):
P(Q | D) P(D | Q)P(Q)
(22)
is equal to minimizing the negative logarithm thereof. D is the given data set. With a Gaussian assumption for the marginal likelihood P(D | Q) and a uniform parameter prior the minimization of
log P(D | Q) is equal to the minimization of the error function Eq. (21) (maximum likelihood - ML).
w2
A Gaussian prior for the weights P ( wij ) = N (0, 2 ) results in an additional weight decay term
ij ij
in the error function, which reduces unneeded connections in the learned matrix. More complicated
parameter priors can be included as well, but the learning routine has to be adopted.
Associated Networks
In contrast to other models associated networks are undirected graphs where to each edge a statistical
dependence or statistical similarity measure of the connected nodes is assigned. Pearson correlation or
504
mutual information, see Eq. (3), are often used to determine the measure. Large values indicate high
similarities in both node value profiles, low values indicate statistical independence. Associated networks cannot be used for gene regulation simulation, since the edge values are the result of a statistical
analysis of expression data.
Distance measures known from clustering can be used to assign an edge value, such as Pearson
correlation:
rij =
cov (X i , X j )
var ( X i ) var ( X j )
(23)
where cov means covariance and var variance. The correlation coefficient measures the linear relationship between two variables found in the data. If cov X i , X j = 0 then Xi and Xj are not correlated linearly,
although they can have a nonlinear relationship. In a second step, usually a pruning process is applied,
where the algorithm seeks to remove edges that are actually indirect interactions.
Basso etal. (2005) developed ARACNe (Algorithm for the Reconstruction of Accurate Cellular
Networks) which uses an information-theoretic approach to calculate a different similarity measure.
Statistically significant gene-gene correlations are found by means of mutual information (see Eq. (3)).
In contrast to Boolean networks, the data values here are continuous, therefore the probability distribution P(X) of the random variable has to be estimated. This can be implemented by means of a Gaussian
kernel estimator (see e.g., Steuer et al., 2002). Mutual information is zero for statistically independent
variables, otherwise it is greater than zero. In contrast to Pearson correlation it measures not only linear
dependencies, but also other relationships. Nevertheless, mutual information alone is not a distance
measure in a mathematical view, since it does not fulfill the triangle inquality axiom for a distance.
ARACNe calculates the mutual information of each gene-gene pair. Then all not-significant edges
will be excluded based on a computed p-value. After that, it eliminates indirect relationships by using
the Data Processing Inequality (DPI). In each triplet of fully connected nodes, the edge with the lowest
mutual information will be removed.
To point it out, associated networks are undirected graphs, has no information about the kind of
interaction, and can not detect auto loops, but it can give fast information about possible relationships
in large data sets.
Other associated network methods with different pruning steps are used by Schmitt etal. (2004)
with time-delayed correlation, dela Fuente etal. (2002) with partial correlation, Butte and Kohane
(2000) with relevance networks, and (Schfer and Strimmer, 2005) with graphical Gaussian models.
See Soranzo etal. (2007) for a comparison of different associated network algorithms.
Static and Dynamic Bayesian Networks

A Bayesian network is a stochastic probabilistic graphical network defined by a directed acyclic graph
(DAG) which represents the topology and a family of conditional probability distributions. In contrast
to other models the nodes represent random variables X and the edges conditional dependence relations
between these random variables. Assuming that nodes are only dependent of direct parents (Markov
assumption), the joint probability distribution of a Bayesian network can be factorized:
505
P ( X) =
P (X
| X pa[i ]
(24)
For discrete random variables, P(Xi | X pa[i]) are multinomial conditional probability distributions.
Basically, these are tables of probabilities of discrete states for each combination of parents states. With
a multinomial distribution, nonlinear regulations can be modeled, but the number of free parameters,
i.e., entries in all conditional distribution tables is exponential in the number of parents. Therefore, in
Bayesian network models often one restricts the maximum number of possible parents.
Continuous node values can be used within a linear Gaussian model which is given by the probability density distribution:
p ( xi | x pa[i ] ) = N
( (x )+ b , )
pa[i ]
2
i
with
(x [ ] )= w x
ij
pa i
j
pa[i ]
(25)
( )
Each random variable Xi is normally distributed around a mean value x pa[i ] which is determined
as a sum of weighted parents values. Due to the linear sum, combinatorial effects of regulators cannot be
modeled, such as cooperative binding. Only linear relations are considered. To capture nonlinear relationships one can assume a different mean value function , e.g., Imoto et al. (2002) used a nonparametric
additive regression model based on B-splines to approximate the dependency of the input values.
Static Bayesian networks have several limitations especially for reconstruction purposes. First, several graphs with different edge directions can be consistent with the same joint probability distribution
(Chickering, 2002). They are in one equivalence class. That means that they are not distinguishable
after learning from data. There is no information of direction for these edges.
Another major drawback of static Bayesian networks is that no cycles are allow. Since in gene regulatory networks cyclic regulation pathways can occur, dynamic Bayesian network were introduced by
Friedman et al. (1998). Dynamic Bayesian networks are based on the effect, that regulations does not
take place instantaneously but with a time delay. By unfolding a Bayesian network over discrete time
steps one obtains again a valid Bayesian network but with a different joint probability distribution:
P (X [t ]| X [ ] [t t ])
P ( X) = P (X [0])
pa i
(26)
where X is the set of random variables in the static Bayesian network including all random variables Xi
at time t, depicted by Xi[t]. The temporal process is Markovian and homogeneous in time, that means a
variable Xi[t] is independent of all nodes which are not in time slice t t and the conditional distribution
does not change over time, respectively. Reversing an edge in a dynamic Bayesian network would result
in an invalid network, since the temporal causality would be violated. Therefore, the joint probability
distribution can be uniquely factorized and the corresponding structure is unambiguous.
Special kinds of dynamic Bayesian networks are state space models (SSM), where the observed
measurements depend on hidden states, which are not measured, such as proteins, genes which are not
included in the network, degradation, or external signals (Beal et al., 2005; Rangel et al., 2004).
Nonlinear regulations can also be modeled in dynamic Bayesian networks, e.g., Kim etal. (2002)
expanded their nonparametric regression model to dynamic models. Nachman etal. (2004) used a
transcription model based on Michaelis-Menten kinetics. Rogers etal. (2007) follows as well a nonlin-
506
ear approach described by Michaelis-Menten kinetics to infer a transcription factor activity from the
expression of target genes.
Learning Bayesian networks from a set of measurements D means finding a network G* that best
matches the given data and the parameters Q* which maximize the posterior parameter distribution given
the network G*. One have to find the posterior distribution of network structures and parameters, given
the data and choose the network and parameters, respectively, which maximize these distributions:
G = argmax G {P (G | D )}
{(
Q = argmax Q P Q | G , D
)}
(27)
By means of the Bayes rule one can write for the posterior distribution:
P(G | D) = P(D | G)P(G) / P(D)
P(D | G)P(G)

(28)
(29)
The normalization constant is given by:
P (D ) =
P (D | G )P (G )
(30)
The marginal likelihood is an integration over the whole parameter space

P (D | G ) =
P (D | Q, G )P (Q | G )dQ
(31)
This can be interpreted as an averaging of the probability of generating the data D with a graph G
and parameters Q over all possible parameter assignments weighted with the parameter prior P(Q | G)
of the network. An advantage of using Bayesian models is the possibility of integrating priors for the
graphs P(G) and the parameters P(Q | G).
A common approach is to assign a score to each network which evaluates the network with respect to
the data. Such a score is usually based on the posterior distribution given in Eq. (29). Scores which are
based on the marginal likelihood Eq. (31) are not recommended since complex networks receive higher
values than sparse networks. Therefore methods using this score tend to overfit the data. To determine
a score of a network one has to compute the high-dimensional integration in Eq. (31) for each possible
graph. This computation is usually intractable.
Under certain conditions the integral is analytically solvable. For two function families F there are
closed forms for the conditional distribution P(Xi | X pa[i]) and parameter priors P(Q | G). Multinomial
distribution with Dirichlet prior results in the BDe score (Heckerman et al., 1995) and linear Gaussian
distribution with a normal-Whishart prior results in the BGe score (Geiger and Heckerman, 1994). If
there is no closed form approximations has to be used, such as Bayesian information criterion (BIC)
score (Schwarz, 1978) or others.
Besides the calculation of an appropriate score the use of an available search algorithm is as well a
crucial point. There is e.g., hill-climbing algorithm which searches in the graph space the next graph
with a higher score by applying local changes to a graph, such as adding, deleting, or reversing edges.
It is possible that the algorithm runs into a local minimum where it is trapped. Simulating annealing
methods have the chance to get out of such a local minimum with a certain probability, which decreases
during the process. A further search method is the K2 algorithm developed by Cooper and Herskovits
507
(1992), which is a greedy search algorithm on the parameter space starting with an empty network.
This search algorithm requires a prior ordering of the nodes as an input from which another network
structure will be constructed.
Instead of the determination of the best matching graph G* in Eq. (27) one can more be interested
in the posterior distribution P(G | D) itself, since the distribution is often very flat and cannot be represented adequately by a single network. Markov Chain Monte Carlo (MCMC) simulations produces
a Markov Chain:
Pn +1 (Gi ) =
T (G | G )P (G )
i
(32)
which converts to the posterior distribution:

n
Pn (G )
P (G | D )
(33)
One has to construct an appropriate transition matrix T. See Husmeier et al. (2005) for a further
description of Bayesian networks and MCMC.
The Bayesian network inference application Banjo was developed in the group of Hartemink (see
http://www.cs.duke.edu/~amink/software/banjo/). With this tool static and dynamic data can be analyzed.
Another source of Bayesian network inference methods is the open-source library PNL (http://sourceforge.
net/projects/openpnl) based on Murphy (1998) or the Bayes net toolbox for Matlab (Murphy, 2001).
Other Methods
Besides the models mentioned above, there are some other approaches which are different in principle,
like Petri Nets and decision trees or hybrid models. A Petri Net model has been introduced by Petri
(1962). The simplest kind of Petri Net is a directed graph consisting of arcs and two different kinds of
nodes, place nodes and transition nodes. An arc connects a place with a transition node or vice versa
labeled by a positive integer value as a weight. Each place contains tokens. If the number of tokes of
each place connected to a transition node larger than the corresponding arc weights, the transition is
enabled to move tokes from the pre-transition to the post-transition place (see Pinney et al. (2003) for a
short introduction in Petri Nets). Many extensions were developed, e.g., hybrid, hierarchical, stochastic,
and timed Petri Nets. Some are used to model gene regulation. For instance, Goss and Peccoud (1998);
Matsuno et al. (2000). Marwan etal. (2005) used Petri Nets for reconstruction, where the structure is
continually refined by including additional experimental data in the reconstruction method.
Another class of reconstruction methods is based on decision trees which is a predictive model. Soinov
et al. (2003) used machine learning techniques to build a decision-tree-related classifier by seeking for
genes which are relevant for the prediction. With such a classifier one can predict gene expression from
the expression data of other genes. In the model there is no explicit dynamics included.
One is not restricted to the models mentioned above. Many hybrid models were proposed and several
extensions are possible, such as including prior knowledge. With a certain probability one can incorporate
known regulations measured in preceding experiments or published already in literature or databases.
The reverse engineering system seeks then for unrevealed interactions.
508
Validation
A very crucial point of a reverse engineering framework is the validation of the inference algorithm.
This has to be done by means of artificially generated or real biological data. The validation should
identify the strengths and weakness of the algorithm and indicate under which condition it shows reliable results. Each method has to cope with noisy, high dimensional, and incomplete data, hence the
performance with these kind of data should be validated.
The use of artificially generated data has several advantages. The underlying network is known as
well as the kinetics and noise level. One can easily change the conditions and analyze the impact on the
reconstruction results. An arbitrary number of data sets with diverse network sizes can be generated.
Therefore, the scalability regarding to large data sets can be determined.
Nevertheless the significance of these statistical evaluations is decisively connected with the artificial
data generator model. Many authors of a reverse engineering system have adapted their artificial data
models for generating test data sets. For instance, linear models were used by van Someren etal. (2000)
for their linear approach and by Basso etal. (2005) for the ARACNe algorithm. Yu etal. (2004) took a
stochastic term to the linear model to test the dynamic Bayesian network method. The continuous-time
neural network method of Wahde and Hertz (2000) used data produced with neural networks as well as
Weaver etal. (1999). The performance test for the REVEAL algorithm of Liang etal. (1998) was tested
with a set of Boolean state transitions.
There is obviously a lack of a standard validation procedure. Therefore, authors, as den Bulcke
etal. (2006), focused on the development of data generators. den Bulcke etal. (2006) proposed the tool
SynTReN which simulates steady state data by using subnetworks of previously published regulatory
networks. The subnetworks show properties of real biological networks, which is crucial to estimate
the performance of the algorithm in a real biological context.
On the other hand, the use of experimental data in a test case highlights the actual performance,
since the data are not artificially generated with a simplified model. A validation is only possible if
the underlying network is known. A gold-standard network can be assemble of gene-gene regulations
published in literature or databases. But there is still the uncertainty of unrevealed interactions, that
can bias the performance.
Conclusion
In this chapter models are described that are used for the reconstruction of gene regulatory networks
from gene expression data. Each model is based on different mathematical aspects and shows different
complexities. There are for instance continuous or discrete, deterministic or stochastic models with
simple or complex reaction kinetics, or even no kinetics at all.
Comparative studies of reverse engineering algorithms are performed for instance by Soranzo etal.
(2007); Wessels et al. (2001); Werhli et al. (2006). Bansal et al. (2007) revealed a very low overlap between different methods (ARACNE, BANJO, NIR, Clustering). It would be interesting to investigate
the overlap in case of more sophisticated algorithms.
All reverse engineering applications need a preprocessing of the experimental data, such as a normalization and significance analysis. Some methods can only cope with discretized values, hence a
mapping of the continuous expression values to a discrete set of values has to follow. The number of
509
discrete states and an elaborate method, that performs the discretization has to be chosen carefully,
since an inaccurate mapping can result in a lost of a huge amount of information.
Most reconstruction methods are limited in the number of nodes. An analysis of whole genome expression measurements is not reasonable, since the parameter and network space is too large to find an
optimal fitting parameter set and a best matching network, respectively, given the data. Large machine
power for calculations cannot solve this problem. The model has to be kept simple with a sufficient
parameter dimensionality and the data input has to be reduced. In a variable selection step one has to
decide which genes will be considered and which will be neglected by means of a significance analysis,
correlation analysis, or clustering.
It is a challenging task to extract information of structure and interactions of the underlying gene
regulatory system from gene expression. Therefore, the quality and quantity of measurements has to
be improved together with the performance of the algorithms. Benchmarks with realistic artificial data
have to identify those methods which show the best results under different conditions.
Acknowledgment
I would like to thank Ralf Herwig, Christoph Wierling, and Elisabeth Maschke-Dutz for proof-reading
of this chapter and their constructive feedback and comments. This work was funded by the Max Planck
Society and the EMBRACE Network of Excellence.
REFERENCES
Akutsu, T., Miyano, S., & Kuhara, S. (1999). Identification of genetic networks from a small number of
gene expression patterns under the Boolean network model. Pac Symp Biocomput, 17-28.
Akutsu, T., Miyano, S., and Kuhara, S. (2000). Algorithms for inferring qualitative models of biological
networks. Pac Symp Biocomput, 293-304.
Bansal, M., Belcastro, V., Ambesi-Impiombato, A., & diBernardo, D. (2007). How to infer gene networks from expression profiles. Mol Syst Biol, 3, 78.
Bansal, M., Gatta, G. D., & di Bernardo, D. (2006). Inference of gene regulatory networks and compound
mode of action from time course gene expression profiles. Bioinformatics, 22(7), 815-822.
Barabsi, A.-L., & Oltvai, Z. N. (2004). Network biology: Understanding the cells functional organization. Nat Rev Genet, 5(2), 101-113.
Basso, K., Margolin, A. A., Stolovitzky, G., Klein, U., Dalla-Favera, R., and Califanon, A. (2005). Reverse engineering of regulatory networks in human B cells. Nat Genet, 37(4), 382-390.
Beal, M. J., Falciani, F., Ghahramani, Z., Rangel, C., & Wild, D. L. (2005). A Bayesian approach to
reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3), 349-356.
Bernard, A., & Hartemink, A. J. (2005). Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. Pac Symp Biocomput, 459-470.
510
Butte, A.J., & Kohane, I.S. (2000). Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 418-429.
Chen, T., Filklov, V., & Skiena, S.S. (2001). Identifying gene regulatory networks from experimental
data. Parallel Comput., 27(1-2), 141-162.
Chen, T., He, H.L., & Church, G.M. (1999). Modeling gene expression with differential equations. Pac
Symp Biocomput, 29-40.
Chickering, D.M. (2002). Learning equivalence classes Of Bayesian-network structures. J. Mach.
Learn. Res., 2, 445-498.
Cooper, G.F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks
from data. Machine Learning, 09(4), 309-347.
dela Fuente, A., Brazhnik, P., & Mendes, P. (2002). Linking the genes: Inferring quantitative gene
networks from microarray data. Trends Genet, 18(8), 395-398.
den Bulcke, T.V., Leemput, K.V., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., Moor, B.D.,
& Marchal, K. (2006). Syntren: A generator of synthetic gene expression data for design and analysis
of structure learning algorithms. BMC Bioinformatics, 7, 43.
Deng, X., Geng, H., & Ali, H. (2005). Examine: A computational approach to reconstructing gene
regulatory networks. Biosystems, 81(2), 125-136.
DeRisi, J., Penland, L., Brown, P.O., Bittner, M.L., Meltzer, P.S., Ray, M., et al. (1996). Use of a cdna
microarray to analyse gene expression patterns in human cancer. Nat Genet, 14(4), 457-460.
DeRisi, J.L., Iyer, V.R., & Brown, P.O. (1997). Exploring the metabolic and genetic control of gene
expression on a genomic scale. Science, 278(5338), 680-686.
Dhaeseleer, P., Wen, X., Fuhrman, S., & Somogyi, R. (1999). Linear modeling of mRNA expression
levels during CNS development and injury. Pac Symp Biocomput, 41-52.
Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, 95(25), 14863-14868.
Friedman, N., Murphy, K., & Russell, S. (1998). Learning the structure of dynamic probabilistic betworks. In Cooper, G. and Moral, S., (eds.), UAI 98: Proceedings of the Fourteenth Annual Conference
on Uncertainty in Artificial Intelligence, 139-147, Madison, Wisconsin: Morgan Kaufmann.
Gardner, T.S., diBernardo, D., Lorenz, D., & Collins, J. J. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301(5629), 102-105.
Gardner, T. S., & Faith, J. J. (2005). Reverse-engineering transcription control networks. Physics of
Life Reviews, 2(1), 65-88.
Geiger, D., & Heckerman, D. (1994). Learning gaussian networks. In UAI, 235-243.
Goss, P.J., & Peccoud, J. (1998). Quantitative modeling of stochastic systems in molecular biology by
using stochastic Petri Nets. Proc Natl Acad Sci USA, 95(12), 6750-6755.
511
Hache, H., Wierling, C., Lehrach, H., & Herwig, R. (2007). Reconstruction and validation of gene regulatory networks with neural networks. In FOSBE 07: Proceedings of the 2nd Foundations of Systems
Biology in Engineering Conference, 319-324.
Heckerman, D., Geiger, D., & Chickering, D.M. (1995). Learning Bayesian Networks: The combination
of knowledge and statistical data. Machine Learning, 20(3), 197-243.
Heinrich, R., & Schuster, S. (1996). The regulation of cellular systems. Springer
Husmeier, D., Dybowski, R., & Roberts, S., (eds.) (2005). Probabilistic modeling in bioinformatics and
medical informatics. Advanced Information and Knowledge Processing. Springer Verlag.
Imoto, S., Goto, T., & Miyano, S. (2002). Estimation of genetic networks and functional structures between
genes by using Bayesian networks and nonparametric regression. Pac Symp Biocomput, 175-186.
Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., & Miyano, S. (2003). Combining microarrays
and biological knowledge for estimating gene networks via Bayesian Networks. Proc IEEE Comput
Soc Bioinform Conf, 2, 104-113.
Kauffman, S.A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. Journal
of Theoretical Biology, 22(3), 437-467.
Kim, S., Imoto, S., & Miyano, S. (2002). Dynamic Bayesian network and nonparametric regression for
nonlinear modeling of gene networks. Genome Informatics, 13, 371-372.
Lampinen, J., & Vehtari, A. (2001). Bayesian approach for neural networks-review and case studies.
Neural Netw, 14(3), 257-274.
Laubenbacher, R., & Stigler, B. (2004). A computational algebra approach to the reverse engineering
of gene regulatory networks. J Theor Biol, 229(4), 523-537.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., et al. (2002). Transcriptional regulatory networks in saccharomyces cerevisiae. Science, 298(5594), 799-804.
Liang, S., Fuhrman, S., & Somogyi, R. (1998). Reveal, a general reverse engineering algorithm for
inference of genetic network architectures. Pac Symp Biocomput, 18-29.
Marwan, W., Sujatha, A., and Starostzik, C. (2005). Reconstructing the regulatory network controlling
commitment and sporulation in physarum polycephalum based on hierarchical Petri Net modelling and
simulation. J Theor Biol, 236(4), 349-365.
Matsuno, H., Doi, A., Nagasaki, M., & Miyano, S. (2000). Hybrid Petri Net representation of gene
regulatory network. Pac Symp Biocomput, 341-352.
Murphy, K.P. (1998). Inference and learning in hybrid Bayesian networks. Technical Report UCB/CSD98-990, Computer Science Division (EECS), University of California, Berkeley, CA.
Murphy, K.P. (2001). The Bayes Net toolbox for matlab. Computing Science and Statistics, 33.
Nachman, I., Regev, A., & Friedman, N. (2004). Inferring quantitative models of regulatory networks
from expression data. Bioinformatics, (20 Suppl 1), I248-I256.
512
Newman, J. R.S., Ghaemmaghami, S., Ihmels, J., Breslow, D.K., Noble, M., DeRisi, J.L., & Weissman, J.S. (2006). Single-cell proteomic analysis of s. cerevisiae reveals the architecture of biological
noise. Nature, 441(7095), 840-846.
Petri, C.A. (1962). Kommunikation mit Automaten. PhD thesis, Insitut fr Instrumentelle Mathematik,
Technische Hochschule Darmstadt, Bonn.
Pinney, J.W., Westhead, D.R., & McConkey, G.A. (2003). Petri Net representations in systems biology.
Biochem Soc Trans, 31(Pt 6), 1513-1515.
Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E., Gaiba, A., et al.(2004). Modeling t-cell
activation using gene expression profiling and state-space models. Bioinformatics, 20(9), 1361-1372.
Rogers, S., Khanin, R., & Girolami, M. (2007). Bayesian model-based inference of transcription factor
activity. BMC Bioinformatics, 8(Suppl 2), S2.
Ruvkun, G. (2001). Molecular biology. glimpses of a tiny RNA world. Science, 294(5543), 797-799.
Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). Quantitative monitoring of gene expression
patterns with a complementary DNA microarray. Science, 270(5235), 467-740.
Schmitt, W.A., Raab, M.R., & Stephanopoulos, G. (2004). Elucidation of gene interaction networks
through time-lagged correlation analysis of transcriptional data. Genome Res., 14(8), 1654-1663.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.
Schfer, J., & Strimmer, K. (2005). An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21(6), 754-764.
Sexton, R.S., Dorsey, R.E., and Johnson, J.D. (1999). Optimization of neural networks: A comparative
analysis of the genetic algorithm and simulated annealing. European Journal of Operational Research,
114(3), 589-601.
Shmulevich, I., Dougherty, E.R., & Zhang, W. (2002). From Boolean to probabilistic Boolean networks
as models of genetic regulatory networks. Proceedings of the IEEE, 90(11), 1778-1792.
Soinov, L.A., Krestyaninova, M.A., & Brazma, A. (2003). Towards reconstruction of gene networks
from expression data by supervised learning. Genome Biol, 4(1), R6.
Soranzo, N., Bianconi, G., & Altafini, C. (2007). Comparing association network algorithms for reverse
engineering of large-scale gene regulatory networks: Synthetic versus real data. Bioinformatics, 23(13),
1640-1647.
Steuer, R., Kurths, J., Daub, C.O., Weise, J., & Selbig, J. (2002). The mutual information: Detecting and
evaluating dependencies between variables. Bioinformatics, 18(Suppl 2), S231-S240.
Thomas, R. (1973). Boolean formalization of genetic control circuits. Journal of Theoretical Biology,
42(3), 563-585.
van Someren, E.P., Wessels, L.F., & Reinders, M.J. (2000). Linear modeling of genetic networks from
experimental data. Proc Int Conf Intell Syst Mol Biol, 8, 355-366.
513
Vohradsky, J. (2001). Neural network model of gene expression. FASEB J, 15(3), 846-54.
Wahde, M., & Hertz, J. (2000). Coarse-grained reverse engineering of genetic regulatory networks.
Biosystems, 55(1-3), 129-136.
Weaver, D.C., Workman, C.T., & Stormo, G.D. (1999). Modeling regulatory networks with weight
matrices. Pac Symp Biocomput, 112-123.
Werbos, P.J. (1990). Backpropagation through time: What it does and how to do it. In Proceedings of
the IEEE, 78.
Werhli, A.V., Grzegorczyk, M., & Husmeier, D. (2006). Comparative evaluation of reverse engineering
gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks.
Werhli, A.V., & Husmeier, D. (2007). Reconstructing gene regulatory networks with bayesian networks
by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol, 6,
15.
Wessels, L.F., van Someren, E.P., & Reinders, M.J. (2001). A comparison of genetic network models.
Pac Symp Biocomput, 508-519.
Wierling, C., Herwig, R., & Lehrach, H. (2007). Resources, standards and tools for systems biology.
Brief Funct Genomic Proteomic, 6(3), 240-251.
Wilczynski, B., & Tiuryn, J. (2006). Regulatory network reconstruction using stochastic logical networks. In Priami, C., (ed.), CMSB, 142-154.
Yeung, M. K.S., Tegnr, J., & Collins, J.J. (2002). Reverse engineering gene networks using singular
value decomposition and robust regression. Proc Natl Acad Sci U S A, 99(9), 6163-6168.
Yu, J., Smith, V.A., Wang, P.P., Hartemink, A.J., & Jarvis, E.D. (2004). Advances to Bayesian network
inference for generating causal networks from observational biological data. Bioinformatics, 20(18),
3594-603.
Key Terms
Associated Network: Each arc of this network is associated with an similarity measure between
values of the nodes. This measure can be Pearson correlation, mutual information, or others. The node
values are vectors of real numbers.
Bayesian Network: This refers to a probabilistic graphical network model defined by a set of random variables and a set of conditional probability distributions. These can be multinomial for discrete
variables, Gaussian, for continuous variables, or others.
Boolean Network: This refers to a graphical structure with nodes that can have two discrete states.
A state of a node is determined by the state of other connected nodes. The state of the network is determined by the state of each node.
514
Joint Probability Distribution: It is the probability distribution of two or more random variables
together, i.e., P(X1,X2,...).
Likelihood Function: It is the probability of the occurrence of a sample configuration. The conditional
probability distribution of a random variable given the parameters of the distribution has to be known.
L(q | X = x) = P(X = x | q) is a likelihood function, where X is a random variable, x is the observed value
of X, and q is a parameter.
Markov Assumption: The conditional probability distribution of the current state is independent
of all non-parents. It means for a dynamical system that given the present state, all following states are
independent of all past states.
MCMC: Short for Markov Chain Monte Carlo. It is a class of algorithms for sampling from a probability distribution. This distribution is simulated with a Markov Chain whose equilibrium distribution
is the desired probability distribution.
Mutual Information: It is the information of one random variable in another one. In other words,
it is the reduction of uncertainty about one variable after observing the other. It is symmetric in respect
to the variables.
Neural Network: This refers to a graphical structure with artificial neurons as nodes. The node
value of each node is determined by the input signals of the connected nodes passing through a nonlinear transfer function.
Ordinary Differential Equation: It is an equation that contains exactly one real variable and its
derivatives.
Posterior Probability Distribution: It is the conditional probability distribution of a random variable after observing an other variable. It is computed from the prior and the likelihood function.
Prior Probability Distribution: It is the probability distribution of a random variable before any
data has been observed. It expresses information about a variable obtain beforehand. Often it is called
prior.
Reverse Engineering: In general, it is the reconstruction of a system by analyzing of its structure,
functions, and operations. Reverse engineering of gene regulatory networks is the process of revealing
the underlying structure of gene regulation from biological measurements, such as gene and protein
expression, or others.
515
516
Chapter XXX
Data Integration for Regulatory

Gene Module Discovery
Alok Mishra
Imperial College London, UK
Duncan Gillies
Imperial College London, UK
abstract
This chapter introduces the techniques that have been used to identify the genetic regulatory modules by
integrating data from various sources. Data relating to the functioning of individual genes can be drawn
from many different and diverse experimental techniques. Each piece of data provides information on a
specific aspect of the cell regulation process. The chapter argues that integration of these diverse types
of data is essential in order to identify biologically relevant regulatory modules. A concise review of the
different integration techniques is presented, together with a critical discussion of their pros and cons.
A very large number of research papers have been published on this topic, and the authors hope that
this chapter will present the reader with a high-level view of the area, elucidating the research issues
and underlining the importance of data integration in modern bioinformatics.
Introduction
A network of transcription factors regulating transcription factors or other proteins is called a transcriptional regulatory network or gene regulatory network. The understanding and reconstruction of this
regulation process at a global level is one of the major challenges for the nascent field of bio-informatics
(Schlkopf et al., 2004).
Considerable work has been done by molecular biologists over the last few years in identifying the
functions of specific genes. In an ideal world it would be desirable to apply these results in order to build
Data Integration for Regulatory Gene Module Discovery
detailed models of regulation where the precise action of each gene is understood. However, large number
of genes and the complexity of the regulation process means that this approach has not been feasible.
Research into discovering causal models based on the actions of individual genes has encountered a
major difficulty in estimating a large number of parameters from a paucity of experimental data. Fortunately however, biological organisation opens up the possibility of modelling at a less detailed level.
In nature, complex functions of living cells are carried out through the concerted activities of many
genes and gene products which are organized into co-regulated sets also known as regulatory modules
(Segal et al., 2003). Understanding the organization of these sets of genes will provide insights into
the cellular response mechanism under various conditions. Recently a considerable volume of data on
gene activity, measured using several diverse techniques, has become widely available. By fusing this
data using an integrative approach, we can try to unravel the regulation process at a more global level.
Although an integrated model could never be as precise as one built from a small number of genes in
controlled conditions, such global modelling can provide insights into higher processes where many
genes are working together to achieve a task. Various techniques from statistics, machine learning and
computer science have been employed by researchers for the analysis and combination of the different
types of data in an attempt to identify and understand the function of regulatory modules.
There are two underlying problems resulting from the nature of the available data. Firstly, each
of the different data types (microarray, dna-binding, protein-protein interaction and sequence data)
provides a partial and noisy picture of the whole process. They need to be integrated in order to obtain
an improved and reliable picture of the whole underlying process. Secondly, the amount of data that is
available from each of these techniques is severely limited. To learn good models we need lots of data,
yet data is only available for few experiments of each type. To alleviate this problem many researchers
have taken the path of merging all available datasets before carrying out an analysis. Thus there can
be some confusion regarding the term integrative because it has been used to describe both of these
two very different approaches to data integration: one among datasets of the same type, for example
microarrays, but from different experiments, and the other among different types of data, for example
microarray and DNA binding data.
In the rest of the chapter we will describe various techniques proposed to carry out both of these
types of integration and will discuss their pros and cons. We will review some of the prominent research
following the former approach by Ihmels et al. (2002) and Segal et al. (2005), and work following the
latter approach by Bar-Jospeh et al.(2003), Tanay at al. (2004, 2005) and Lemmens et al. (2006).
Background
Biological Background
Higher organisms are made up of various different cell types each of which performs a specific role that
contributes to its overall functioning. The fascinating fact is that each of these cells contains exactly
the same set of genes. The cells of higher organisms, known as eukaryotes, differ from those of the less
evolved prokaryotes in having a well-defined nucleus that carries the genetic material. The remarkable
diversity among the cells is a result of a precisely controlled mechanism of expression and regulation
of a subset of genes in each cell type. The expression of genes into their complements, called m-RNAs
or transcripts, is known as transcription while the next step of the process, which leads to creation of a
517
protein from the intermediate m-RNA is called translation. Proteins can react with each other and influence the regulation of cells from the outside by a process of signal transduction. Like most biological
systems, this whole process is regulated at multiple places. The process begins when some molecules
known as transcription factors (TFs) are activated by a trans-membrane receptor, leading them to bind
to gene regulatory elements and to promote access to the DNA and facilitate the recruitment of RNA
polymerase to the transcriptional start site. The gene regulatory elements of the DNA, also known as
promoter regions, are situated upstream of the gene at a distance which can vary from a few base pairs
to hundreds of base pairs. The regulatory elements contain binding sites for multiple transcription factors
allowing each gene to respond to multiple signalling pathways and facilitate fine-tuning of the m-RNAs
that are produced. Once the transcription factors are bound on the regulatory elements, they can either
promote or inhibit gene expression. In the case of a promoter the process of transcription starts. A protein called RNA polymerase starts to copy the information contained in the gene into messenger RNA
(m-RNA). These m-RNA molecules, being exact replicas of the gene, contain both exons (which will
be used in the later process) and introns (which will be removed). A process known as splicing removes
the introns and the remaining m-RNA, called spliced m-RNA, is transported out of the nucleus into
the cellular material. There it is translated into a polypeptide chain with the help of ribosomes and this
chain then folds into a three-dimensional structure known as protein. A detailed review of the whole
process can be found in any standard textbook on molecular biology.
The previous paragraph gives only a partial picture. Since transcription factors themselves are
proteins, the same process may regulate them. In fact there are genes that code just for transcription
factors. This process is similar to a feedback loop in which transcription factors are regulated by other
transcription factors. In particular, a major goal is to understand how transcription factors affect gene
expression and which groups of genes are co-regulated by certain sets of transcription factors.
Data Sources
Various types of data are used to identify regulatory mechanisms. These are primarily generated by
molecular biologists using experimental techniques. Some of the types currently available are:

m-RNA expression measured using microarrays

Transcription factor binding to DNA measured using ChIP-chip (chromatin immunoprecipitation)
Transcription factor binding motifs from the promoter sequences of genes
Protein-protein interactions (PPI) using co-immunoprecipitation and other techniques
One of the most important sources of data is genome-wide measurement of m-RNA expression levels carried out using microarrays. These have received considerable attention in the last six years and
various technologies for microarray measurement have been developed (Schulze & Downward, 2001).
Microarrays allow simultaneous measurement of the expression levels of a large number of genes. Similar
expression profiles identify genes that may be controlled by a shared regulatory mechanism. Spellman
is one of the microarray pioneers who used it to study global expression of genes at various time points
in yeast cell cycle (Spellman et al., 1998). He along with some other researchers (Gasch et al., 2000) also
studied the response of the yeast genes when subjected to various kinds of stress. Processing microarray
data to reduce the errors introduced at various stages is known as normalization. Quakenbush (2006)
518
provides a good overview of the techniques used for normalization and analyzing while Smyth et al.
(2003) discuss in detail the statistical issues involved in normalization.
Another source of data is transcription factor-DNA binding data that is generated as a result of the
chromatin immunoprecipitation (ChIP) technique also popularly known as the ChIP-chip assay. The
technique is used to determine whether proteins, including transcription factors will bind to particular
regions of the chromatin within living cells. Harbison et al. (2004) determined the global genomic occupancy of 203 transcription factors in yeast, which are all known to bind to DNA in the yeast genome.
Lee et al. (2002) produced a similar yeast dataset for a smaller number of transcription factors. Both
these researchers reported results in the form of a confidence value (statistical P value) of a transcription
factor attaching to the promoter region of a gene. The reason behind using statistical techniques was
to average the errors in microarray technology and account for multiple cell populations. One of the
prominent problems with such approaches is that in order to infer whether a transcription factor attached
to the promoter sequence or not, we have to choose an arbitrary artificial threshold of the P-value.
Transcription factor binding motifs are sequence patterns observed in the intergenic regions of the
genome usually located upstream of the genes. They are thought to be responsible for allowing access
of transcription factors to binding sites. Initial approaches to identifying these were based on first clustering genes by co-expression, and then looking for common sequences in the upstream regions of the
genes located in the same cluster. Kellis et al. (2003) used comparative genome analysis between three
related yeast species to find these motifs.
Protein-protein interaction (PPI) data for human and other proteins are available as a result of advances in technologies like co-immunoprecipitation, mass-spectroscopy and yeast two-hybrid assays.
There has been a tremendous growth in this type of data in the recent years.
Data Integration
Plain Clustering
When microarray data started becoming available in the 1990s, a prime goal was to identify sets of
genes that act together functionally to perform certain cellular tasks such as metabolism or cell-cycle
functions. In this early phase of data analysis, various clustering algorithms, e.g. Eisen et al. (1998),
were applied in order to find such gene modules. An assumption behind this clustering approach was
that co-expression implied co-regulation. In other words, if sets of genes were showing similar patterns
of microarray expression they must be co-regulated and hence belong to the same module. So, co-expression was assumed to imply co-regulation and co-regulation was assumed to imply similar function.
However, both these assumptions are not always correct. The validity of the resulting clusters could
be tested by identifying common promoter elements on the upstream portion of genes within the same
cluster on the assumption that genes are co-regulated because they have common promoter elements.
Another popular way to show validity was by using gene ontology to show that the majority of genes
belonging to a module were similar in function. In these early works no prior information was used to
guide the process of clustering.
519
Causal Networks
Naturally the research community wanted to model the causal relationships among various genes in much
more detail, and this precipitated a second phase of modelling in which mostly Bayesian networks and
their variants, such as dynamic Bayesian networks, were applied to model the gene regulatory processes
(Friedman at al., 2000; Husmeier, 2003; Murphy & Mian, 1999; Zou & Conzen, 2005). Friedman et al.
(2000) were the first to utilise Bayesian networks for modelling gene expression data and they tried two
types of local distribution - discrete (multinomial) and continuous (linear Gaussian) to express the
relation between dependent genes. They tested the work on the microarray expression data of Spellman
et al. (1998). When networks that modelled the data accurately were identified, two pairwise features
were computed from them - Markov relations and order relations. The Markov relation just checks if
each gene of a pair is in the Markov blanket of the other. This would imply a direct causal relationship
between them indicating a biological relation. The order relation checks if X is ancestor of Y in all the
networks of an equivalence class. This can be determined directly from the directed graph by checking
whether there is a path from X to Y that is directed towards Y consistently. An order relation implies
that the two genes have a role in some more complex regulatory process. Temporal aspects of data were
incorporated into the model by adding a discrete variable as the root. They suggested that non-linear
local/temporal models should be used for better accuracy. Their analysis of the results shows that the
method is sensitive to the choice of local model and in the case of the multinomial distribution is also
sensitive to the discretization method used. Werhli et al. (2006) carried out a comparative study of the
performance of modelling gene regulatory networks using graphical Gaussian models (GGMs), relevance
networks and Bayesian networks. They used both laboratory data as well as simulated data to evaluate
the different approaches. They observed that on both types of data, Bayesian networks outperformed
both relevance networks and graphical Gaussian models.
The major difficulty with this fine tuned modelling approach is that for such a high dimensional problem involving many thousands of genes, the amount of experimental data available is never enough for
accurate modelling. Moreover, its very hard to deal with the cyclical feedback nature of gene networks
using Bayesian networks since, without the explicit incorporation of time, they only handle acyclic relationships among the variables. The end result of such models was that the performance was not good
and not many verifiable findings were made (Husmeier, 2003). In order to improve upon the results, work
was done to incorporate better prior knowledge in the Bayesian network based modelling. Imoto et al.
(2003) combined PPI, DNA binding, promoter element motifs as well as literature text mining. Tamada
et al. (2003, 2005) also used similar diverse datasets to build Bayesian network models.
Weakly Supervised Module Algorithms

After these initial frustrations in moving from very naive modelling (plain clustering) to highly detailed
modelling (dynamic Bayesian networks), research began to tread a path somewhere in the middle. This
pragmatic approach did yield very good results and is still the basis of current research. One of the
most complete studies using these types of weakly supervised methods was carried out by Segal et
al. (2003). Their method uses gene expression microarray data and very weak prior knowledge in the
form of the names of genes producing the transcription factors, in order to separate genes into sets that
are co-regulated. It takes as input a gene expression data set and a large precompiled set of candidate
regulatory genes and outputs groups of co-regulated genes (modules), their regulators, and a regulation
520
program that specifies behaviours of the modules as a function of regulators expression and the conditions under which regulation takes place.
The Module Networks algorithm, used by Segal et al. (2003), takes a list of potential regulators
and microarray expression data as input and uses an iterative procedure that searches for a regulation
program for each module (set of genes) and then reassigns each gene to the module whose program
best predicts its behaviour. It uses an iterative procedure, based on the Expectation Maximization (EM)
method that is initialized with the results of another clustering algorithm. For each cluster of genes it
searches for a regulation program that provides the best prediction of the expression profiles of genes
in the module as a function of the expression of a small number of genes from the regulator set. After
identifying regulation programs for all clusters, the algorithm re-assigns each gene to the cluster whose
program best predicts its behaviour. It iterates till convergence, refining both the regulation program
and the gene partition in each iteration.
In their experiments they compiled a set of regulators from the Saccharomyces Genome Database
(SGD) and the Yeast Proteome Database (YPD) based on annotations that broadly suggest that certain
genes have a regulatory role, as either a transcription factor or a signalling protein. They also identified
more potential regulators by finding genes similar to those above but removing the global regulators
from the list. Microarray data for gene expression for yeast was collected from the Stanford microarray
database. They chose a subset that had significant gene expression change and removed from this set
the cluster known to be generic environmental response genes. Finally, they added all the genes from
the regulator list above. With these two datasets (expression and regulators), they use a module network
learning algorithm (Segal, Peer et al., 2005) to find separate sets of regulators and the regulated modules. They obtained modules that showed significant similarity in promoter element motifs as well as
annotations in the gene ontology compiled by the Gene Ontology Consortium (2001).
Ihmels et al. (2002) proposed an algorithm called Signature, which performs bi-clustering, that is
to say clustering genes, and conditions together based on expression data. It is unlike the later bi-clustering algorithms in that it does not simultaneously generate data partitions but works in steps. The
input to the algorithm is a set of genes and, in the first step, experimental conditions under which these
genes change their expression above a threshold are chosen. In the second stage, all genes that have
changed expression significantly under these conditions are selected. They evaluate the consistency of
their clustering algorithm by analyzing the recurrence of the output gene sets in their resulting modules
when the input is mixed with irrelevant genes. The idea is that the results of any good algorithm should
not deviate too much when slight perturbations are introduced in the data. A module is considered to
be reliable if it is obtained from several distinct slightly perturbed input gene sets. Since it carries out a
refinement of clusters in two stages, there can be no guarantee that the results would be clustered in an
optimal manner. A better formulation might be to use the EM (expectation maximization) algorithm in
order to maximize their objective function.
Despite its success in moving one step ahead from plain clustering algorithms, one of the biggest
shortcomings of their research was that the biological prior knowledge was almost of insignificant
level. Only names of transcription factors or conditions for which experiments were carried out were
employed. At about this time more significant prior knowledge started becoming available in the form
of ChIP-chip DNA binding data and other sources as described in an earlier section. The next step of
research focused on ways of integrating these datasets in order to find gene modules.
521
Strongly Supervised Module Algorithms

Bar-Joseph et al. (2003) describe an algorithm for discovering regulatory modules. Their algorithm is
called GRAM (Genetic Regulatory Modules), and combines microarray expression data with DNA-binding data. This was one of the first papers to have combined these two sources in order to achieve better
clusters. DNA-binding data provides direct physical evidence of regulation and thus offers an improvement on previous work where only indirect evidence of interaction, for example promoter sequences,
were used for prior information. The GRAM algorithm begins by performing an exhaustive search over
all possible combinations of transcription factors indicated by the DNA-binding dataset using certain
(strict) threshold P-values. This yields sets of genes that are regulated by sets of transcription factors.
This gene list is filtered by studying their expression patterns to find genes that show co-expression.
These act as seeds for gene modules. The next pass revisits transcription factors and expands the seed
modules by adding genes with a relaxed P-value criterion that show co-expression. GRAM allows a
gene to be part of more than one module. They identified 106 modules with 655 distinct genes regulated by 68 transcription factors. Within a module the role of each transcription factor was identified as
activator or repressor by analysing the correlation between the transcription factors expression and the
expression of regulated genes. Validation was done by analyzing the promoter gene sequences in same
cluster using the TRANSFAC database to identify common sequences.
Tanay et al. (2004) analysed several diverse datasets in an attempt to reveal the modular organization
of the yeast regulation system. They defined modules as groups of genes with statistically significant
correlated behaviour across the diverse datasets. Their algorithm is called SAMBA (Statistical-Algorithmic Method for Bicluster Analysis) and is an extensible framework that can be easily updated as
new datasets become available. In their analysis they have integrated expression, PPI and DNA-binding
datasets. In SAMBA, all genomic information is modelled as weighted bi-partite graphs. Nodes on one
side of graph represent genes while the other side represents properties of genes, for example proteins
encoded by them. Edges between property nodes and gene nodes are assigned weights. A module is
a sub graph of this bi-partite graph and a high quality module is defined as a heavy sub graph in the
weighted bi-partite graph. The key point is that all sources of data are considered as properties of genes
or proteins encoded by genes and there is one unified representation of all data as a bi-partite graph.
Since their algorithm is based on combinatorial principles rather than graph theoretic (spectral) methods
there are no guarantees of a globally optimum partitioning. For evaluation, they found the biological
significance of resulting clusters by calculating the enrichment score of all gene ontology (GO) terms
associated with the genes of a module and later annotated the modules with the highest valued terms,
that is to say those terms that are shared by the highest number of genes. They also analyzed 600 base
pairs in the upstream promoter region of the genes in a module for common motif enrichment. For each
potential motif they calculated the enrichment score among all the genes of the module. The positive
aspect of their approach is that it utilises all sources of information in one uniform representation and
only requires a measure of similarity of genes across a subset of properties. It also allows overlapping
modules (with common genes), which is not a feature of traditional clustering algorithms. One of the
limitations of their approach is that all sources of data are assigned equal weights and it isnt possible
to weigh them separately according to reliability or importance.
In a later piece of work Tanay et al. (2005) extended the work described above by investigating
the SAMBA algorithm in more detail. They analysed more diverse datasets and focused more on the
biological significance of the results, explaining them much more fully. The paper mainly describes
522
a study of fresh data in the context of an extensive compendium of existing datasets using SAMBA.
They proposed that future work should be carried out on integration across species on the basis that
transcription modules are highly conserved among species.
The work of Lemmens et al. (2006) is similar to other module discovery algorithms in that they
propose a very simple and intuitive algorithm to find co-regulated sets of genes that have similar expression profiles, the same binding transcription factors and a commonality of motifs. The principal
difference from other algorithms is that where others used motif information to validate their results,
they have used it in order to find the modules itself. Their algorithm, known as ReMoDiscovery works
in two passes. In the first pass, known as the seed discovery step, tightly co-expressed genes having
a minimum number of common transcription factors and a minimum number of common conserved
motifs are put together in separate modules known as seed modules. In the second pass, known as the
seed extension step, the size of the modules is increased by computing the mean of the modules gene
expression and ranking the remainder of the genes in the dataset in order of their decreasing correlation with the mean profile. They compared their algorithm results with SAMBA and GRAM (discussed
separately) and reported their findings. All parameters, such as the cut off for various datasets, have been
chosen without much justification, and the basic idea seems very similar to the work of Bar-Joseph et al.
(2003). Some of the comparison metrics used do not seem very sound, for example average functional
enrichment values have been calculated for the modules without normalizing to account for the size of
the modules. Similarly, summary statistics like minimum and maximum number of genes in modules
do not provide relevant information for comparison of algorithms .
Huang and Pan (2006) investigated a traditional clustering method known as K-medoids which is a
robust version of the K-means clustering method. Unlike K-means, which uses the mean of all genes in a
cluster as its centre, K-medoids uses the most central gene. It is found by locating the one with minimum
average dissimilarity to rest of the genes. They incorporated prior knowledge into it by modifying the
distance metric used while clustering. They have used microarray expression data for clustering while
biological knowledge about the known similarity between pairs of genes is derived from gene ontology. Previous approaches to including biological knowledge in distance based clustering methods have
included gene ontology and metabolic pathways to estimate distance, or similarity, measures among
gene pairs and then used these along with microarray expression based distance metrics to create an
average distance, which is later used to cluster expression data. The authors used a shrinkage approach
for the distance metric to shrink it towards zero in cases where there is strong evidence that two genes
are functionally related. Their algorithm has two steps in which the first step uses the shrunk distance
metric to cluster genes whose functionality is known from gene ontology. The second step clusters the
remaining genes. In the second step clustered genes are assigned to either one of the step one clusters
or to a step two cluster, depending on their distance from the medoids. The shrinkage parameter is
chosen using cross validation. They evaluated their algorithm using both simulated as well as real data.
In a later piece of work Pan (2006) used known functions of genes from existing biological research
to assign different prior probabilities for a gene to belong to a cluster. He developed an Expectation
Maximization algorithm for this stratified mixture model.
The research described above concerns the evaluation of individual techniques to integrate data
from multiple sources. Some researchers have also focussed on creating generic frameworks for data
integration. Troyanskaya et al. (2003) developed a meta framework for integration of diverse sources
of data. We call it meta because it doesnt directly integrate the datasets but uses results from other
techniques like clustering algorithms and combines them with other evidence. Their proposed frame-
523
work is known as MAGIC (Multisource Association of Genes by Integration of Clusters) and is based
on a Bayesian network whose conditional probability tables have been built with the advice of yeast
genetic experts. Given a pair of genes, it outputs the probability that they are functionally related after
weighing the evidences from various sources. Evaluation of the predictions from the system is done
using gene ontology data.
Most of the techniques that we have described work well for real (numerical) data but are less effective
when dealing with string data, for example gene sequences, or graph data such as protein interactions.
In many cases ad-hoc techniques have been deployed. In an approach to this problem, Lanckriet et al.
(2004) have proposed a framework where such diverse data could be merged in a principled manner. It
is based on kernel methods in which algorithms work on kernel matrices that are derived from pairwise
similarity among variables using so called kernel functions (Shawe-Taylor & Cristianini, 2004). If a
valid kernel function can be defined to encode the similarity between two variables, then the methods
are applicable regardless of the different types of data - strings, vectorial or graphical - being used.
This framework will provide a means to integrate more diverse types of data as and when they become
available in the future. The original paper proposed the framework only for supervised learning but
extensions to unsupervised learning are possible.
Future Trends
One of the biggest challenges in understanding transcriptional regulation is that the whole process is
regulated at multiple points from transcription to actual protein synthesis and it is known that transcription activity (m-RNA concentration) is not a perfect indicator of protein concentration (Griffin
et al., 2002) as there are many post translational factors (m-RNA stability, protein degradation, posttranslational modifications etc.) that affect the process. Since its still not possible to get the protein
concentration data for all the available m-RNA data, we must keep this severe limitation in mind when
drawing conclusions from models where we assume that m-RNA expression can be used as a surrogate
for protein activity level.
Another big challenge that inhibits precise modelling of the process is lack of available data about
the 3D structure of chromatin (DNA). Apart from the promoter sequence the 3-D structure of chromatin decides whether a transcription factor is allowed access to a certain position or not. Sometimes
a transcription factor itself facilitates changes in the chromatin structure that allows it access to the
promoter sequence.
Based on the results so far we are far away from a fully comprehensive model of regulation in even
simpler organisms like yeast. Higher organisms pose other challenges because of cell and tissue heterogeneity. Apart from this, multi-cellular organisms are a big challenge as its very difficult to segregate
the expression of one cell from its neighbouring ones. Most genomic techniques measure average signal
in a sample from a cell population. When analysing a heterogeneous tissue, this is a big concern as
individual signals from different cell types are obfuscated. Moreover, the averaging effect introduces
an additional source of noise as the proportions of different cells are different across samples.
Interpretation of results is very hard because, even though gene ontology databases have contributed
significantly to the creation of a common language to describe properties, we do not have annotations
for all genes and gene products. Without high quality annotations, the best algorithms are rendered
useless as we can never know how accurate they are.
524
Future research in the area of integration will continue as more data of different types becomes
available. The focus will likely shift towards integration of data from multiple cell types, conditions
and even organisms. Apart from integration techniques, future research is likely to move towards better
validation of the various techniques and the creation of gold standards against which results can be assessed. Another growing area of research is based on more detailed modelling using reaction kinematics
of gene products. This could help understanding not only the qualitative models of regulation but also
detailed quantitative ones.
Most of the research work that we have discussed till now has been validated using data involving
yeast. Simple unicellular organisms have the advantage that the sample of cells used in an experiment
is homogenous. Each cell is assumed to be performing the same regulatory actions. Now that some
understanding of the regulatory mechanisms in simple organisms has been gained, the research focus
is shifting towards the parts of the human genome specifically related to cancer. Human tumour expression data is slowly growing in size and despite all the challenges, positive results have been obtained by
researchers while studying both individual cancers and, with an integrative approach, simultaneously
studying a large cancer compendium of multiple datasets. Segal, Friedman et al. (2004) used microarray data from various types of cancer related experiments to create a cancer module map. In this map
they used the modules to characterize various tumour stages and types based on whether modules
were activated or suppressed. This research highlighted the value of integration as well as the module
level view for analyzing complex medical conditions. More computational approaches are required to
combine experimental data involving cancer in different animals.
Conclusion
Orphanides and Reinberg (2002) argue very explicitly that there is no single model of regulation and
each cell process has evolved its own detailed regulation model. Moreover, we usually observe only a
few snapshots of these processes, which makes it very hard to reconstruct the underlying mechanisms.
The data that is integrated comes from various laboratories where experiments are done under different
conditions and with different platforms. We must be very careful while integrating such data and care
must be taken to check beforehand if the data shows similar trends. Otherwise, instead of complementing
each other, various datasets would only add to the noise and obfuscate the meaningful patterns (Mishra
& Gillies, 2007). The above conditions are some of the reasons why in most of the research, insignificant
amount of overlap has been observed in the results (Dolinski & Botstein, 2005).
Despite all the challenges, high-throughput technologies have changed the research focus from
studying a handful of genes to studying interactions at the whole genome level. Data integration seems
to be the only approach which can help us understand the underlying processes. We have only begun to
understand regulation quantitatively and have a long way to go before we can construct fully detailed
regulatory network models.
References
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., et al. (2003). Computational
discovery of gene modules and regulatory networks. Nature Biotechnology, 21 (11), 1337-1342.
525
Gene Ontology Consortium (2001, August). Creating the gene ontology resource: Design and implementation. Genome Research, 11 (8), 1425-1433.
Dolinski, K., & Botstein, D. (2005, December). Changing perspectives in yeast research nearly a decade
after the genome sequence. Genome Research, 15(12), 1611-1619.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998, December). Cluster analysis and
display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA,
95(25), 14863-14868.
Friedman, N., Linial, M., Nachman, I., & Peer, D. (2000, August). Using Bayesian networks to analyze
expression data. Journal of Computational Biology, 7(3), 601-620.
Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., et al. (2000, December). Genomic expression programs in the response of yeast cells to environmental changes. Molecular
Biology of the Cell, 11 (12), 4241-4257.
Griffin, T. J., Gygi, S. P., Ideker, T., Rist, B., Eng, J., Hood, L., et al. (2002). Complementary proling
of gene expression at the transcriptome and proteome levels in saccharomyces cerevisiae. Molecular &
Cellular Proteomics, 1(4), 323-333.
Harbison, C. T., Gordon, B. D., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., et al. (2004).
Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004), 99-104.
Huang, D., & Pan, W. (2006). Incorporating biological knowledge into distance-based clustering analysis
of microarray gene expression data. Bioinformatics, 22(10), 1259-1268.
Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., et al. (2000,
July). Functional discovery via a compendium of expression proles. Cell, 102, 109-126.
Husmeier, D. (2003). Sensitivity and specicity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics, 19(17), 2271-2282.
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., & Barkai, N. (2002). Revealing modular
organization in the yeast transcriptional network. Nature Genetics, 31, 370377.
Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., & Miyano, S. (2003). Combining microarrays
and biological knowledge for estimating gene networks via Bayesian networks. In Proceedings - 2nd
computational systems bioinformatics (pp. 104-113).
Kellis, M., Patterson, N., Endrizzi, M., Birren, B., & Lander, E. S. (2003, May). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423(6937), 241-254.
Lanckriet, G. R., De Bie, T., Cristianini, N., Jordan, M. I., & Noble, W. S. (2004, November). A statistical framework for genomic data fusion. Bioinformatics, 20(16), 2626-2635.
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., et al. (2002, October).
Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298 (5594), 799-804.
Lemmens, K., Dhollander, T., De Bie, T., Monsieurs, P., Engelen, K., Smets, B., et al. (2006). Inferring
transcriptional modules from chip-chip, motif and microarray data. Genome Biology, 7(5).
526
Mishra, A., & Gillies, D. (2007). Effect of microarray data heterogeneity on regulatory gene module
discovery. BMC Systems Biology, 1(Suppl 1), S2.
Murphy, K., & Mian, S. (1999). Modelling gene expression data using dynamic bayesian networks.
Tech. rep., MIT Artificial Intelligence Laboratory.
Orphanides, G., & Reinberg, D. (2002, February). A unied theory of gene expression. Cell, 108(4),
439-451.
Pan, W. (2006, April). Incorporating gene functions as priors in model-based clustering of microarray
gene expression data. Bioinformatics, 22(7), 795-801.
Quackenbush, J. (2006). Computational approaches to analysis of dna microarray data. Methods of
Information in Medicine, 45(Suppl 1), 91-103.
Schulze, A., & Downward, J. (2001, August). Navigating gene expression using microarraysA technology review. Nat Cell Biol, 3(8).
Schlkopf, B., Tsuda, K., & Vert, J.-P. (Eds.). (2004). Kernel methods in computational biology. Cambridge, MA: MIT Press.
Segal, E., Friedman, N., Koller, D., & Regev, A. (2004, Oct). A module map showing conditional activity
of expression modules in cancer. Nature Genetics, 36 (10), 1090-8.
Segal, E., Friedman, N., Kaminski, N., Regev, A., & Koller, D. (2005, Jun). From signatures to models:
Understanding cancer using microarrays. Nature Genetics, 37 Suppl, S38-45.
Segal, E., Peer, D., Regev, A., Koller, D., & Friedman, N. (2005). Learning module networks. Journal
of Machine Learning Research, 6(Apr), 557-588.
Segal, E., Shapira, M., Regev, A., Peer, D., Botstein, D., Koller, D., et al. (2003). Module networks:
Identifying regulatory modules and their condition-specic regulators from gene expression data. Nature Genetics, 34(2), 166-176.
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University
Press.
Smyth, G. K.,, Yang, Y., & Speed, T. P. (2003). Statistical issues in c-dna microarray data analysis.
Methods in Molecular Biology, 111-136.
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., et al. (1998, Dec). Comprehensive identication of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray
hybridization. Molecular Biology of the Cell, 9(12), 3273-97.
Tamada, Y., Bannai, H., Imoto, S., Katayama, T., Kanehisa, M., & Miyano, S. (2005, Dec). Utilizing
evolutionary information and gene expression data for estimating gene networks with Bayesian network
models. Journal of Bioinformatics and Computational Biology, 3(6), 1295-313.
Tamada, Y., Kim, S., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., et al. (2003). Estimating gene
networks from gene expression data by combining Bayesian network model with promoter element
detection. Bioinformatics, 19(90002), 227-236.
527
Tanay, A., Sharan, R., Kupiec, M., & Shamir, R. (2004). Revealing modularity and organization in the
yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proceedings
of the National Academy of Sciences U S A, 101(9), 2981-2986.
Tanay, A., Steinfeld, I., Kupiec, M., & Shamir, R. (2005, March). Integrative analysis of genome-wide
experiments in the context of a large high-throughput data compendium. Molecular Systems Biology,
1(1), msb4100005-E1msb4100005-E10.
Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., & Botstein, D. (2003). A Bayesian
framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces
cerevisiae). Proceedings of the National Academy of Sciences U S A, 100(14), 8348-8353.
Werhli, A. V., Grzegorczyk, M., & Husmeier, D. (2006, October). Comparative evaluation of reverse
engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics, 22(20), 2523-2531.
Key Terms
Bayesian Network, or belief network is a probabilistic graphical model that represents a set of variables and their probabilistic dependencies. For example, a Bayesian network can be used to calculate
the probability of a disease given the expression levels of certain genes. Expert knowledge is required
in order to specify the structure and probabilistic dependencies among variables (genes and disease).
Chromatin Immunoprecipitation, also popularly known as ChIP, is an experimental method to
determine whether proteins (e.g. transcription factors) bind to certain regions of cells. When used with
microarrays, the technique is known as ChIP-chip, and is used to identify the binding of proteins on
the entire genome simultaneously.
Clustering is the process of organizing objects into groupings (clusters) where members of one
group are similar to each other but dissimilar to the objects belonging to other groups. In the field of
machine learning it is assigned under the category of unsupervised learning as we have to find structure
in unlabelled data.
Gene Ontology, also commonly referred to as GO, provides a controlled vocabulary (ontology) to
describe gene and gene product attributes in various organisms. It has three sub parts that describe gene
products in terms of their associated biological processes, cellular components and molecular functions
in a species-independent manner. It was developed to address the need for consistent descriptions of
gene products in different databases (from different or the same organisms).
K-Means Clustering is an algorithm to group (cluster) objects based on certain attributes into a
pre-determined number (K) of groups or clusters. The grouping is done by minimizing the sum of
squares of distances between individual data and the corresponding cluster centre which is calculated
by averaging all the data within the cluster. It is an iterative procedure that refines the groupings in
multiple steps each improving the cluster quality.
528
Microarray, also known as a gene chip, DNA chip, or gene array is glass slide on which there is a
grid pattern of small spots each of which will react with single individual genes. They are commonly
used for measuring expression levels of thousands of genes simultaneously, a technique called expression
profiling. For example, microarrays can be used to identify disease genes by comparing gene expression
in diseased and normal cells.
Protein-Protein Interaction describes the interaction between different protein molecules which
are of central importance for virtually every process in a living cell. Since proteins are gene products,
these interactions when studied along with gene expression data, provide a better understanding of the
underlying processes.
529
530
Chapter XXXI
Discrete Networks as a Suitable

Approach for the Analysis of
Genetic Regulation
Elizabeth Santiago-Corts
Universidad Nacional Autnoma de Mxico, Mexico
Luis Mendoza
Universidad Nacional Autnoma de Mxico, Mexico
abstract
Biological systems are composed of multiple interacting elements; in particular, genetic regulatory networks are formed by genes and their interactions mediated by transcription factors. The establishment
of such networks is critical to guarantee the reliability of transcriptional performance in any organism.
The study of genetic regulatory networks as dynamical systems is a helpful methodology to understand
the transcriptional behavior of the genome. From a number of theoretical studies, it is known that
networks present a complex dynamical behavior that includes stability, redundancy, homeostasis, and
multistationarity. In this chapter we present some particular biological processes modeled as discrete
networks to show that the theoretical properties of networks have a clear biological interpretation.
INTRODUCTION
Development of multicellular organisms requires the coordinated accomplishment of many molecular and
cellular processes, like division and differentiation. Regulation of those processes must be very reliable,
capable of resisting fluctuations of the internal and external environments. Without such homeostatic
capacity, the viability of the organism would be compromised. For instance, unrestrained division of
some cells may lead to the appearance of tumors, which may possibly cause death. Cellular processes
Discrete N etworks as a Suitable Approach for the Analysis of Genetic Regulation
are finely controlled by a number of regulatory molecules, among them transcription factors. These are
present inside cells at low quantities, and variations in their concentrations might alter cellular fate.
Modern high-throughput techniques have greatly increased the rate at which genomes are sequenced
and genes are identified. Nonetheless, classic biochemical and physiological studies are necessary to
identify the functions and molecular targets of the coded proteins. Of interest for this chapter are those
genes that code for transcription factors. These proteins bind to cis-regulatory sequences of other genes,
controlling or somehow modifying the transcriptional rate of their targets. If these targets code for other
transcription factors, then and interdependence is created among genes forming a genetic regulatory
network (Kauffman, 1991). The existence of regulatory networks have as a result the controlled and
coordinated expression of a large group of genes. While these ideas are commonly accepted, biologists
are not usually aware of the global properties of these networks. The reason is that they have some
properties that are not evident or intuitive.
Modeling regulatory networks is very useful to understand how different gene expression patterns
arise and are maintained. All cells in an organism have the same genes, and therefore the same global
genetic regulatory network. Yet, each cell type differs from others in their particular molecular profile,
i.e. in their patterns of transcriptionally active genes and the presence of other molecular markers.
In addition, such genetic activation patterns are stable, in a normal situation cells do not differentiate
continually from one type into another. This characteristics are due to the global properties of the underlying genetic regulatory networks (Kauffman, 1993; Thomas et al., 1995).
It is a common practice to graphically represent transcriptional regulatory interactions using graphs,
since they are intuitive and easy to understand. However, the knowledge of the connectivity is not
enough to determine the behavior of a regulatory network. For example, it is not possible to know how
many steady states of genetic activation are allowed by a particular network, neither if those steady
states are stable or not. To know these properties, it is necessary to incorporate the transcriptional rate
of each gene as a function of its regulators. By doing this, a genetic regulatory network is translated
into a dynamical system.
There is a large number of methodologies to analyze regulatory networks as dynamical systems (de
Jong, 2002). Most modelers prefer to represent the dynamical system in the form of a set of ordinary
differential equations that describe the transcriptional rate of genes. However, for most biological systems
there is a lack of quantitative experimental information to fit the whole set of parameters in the system
of equations. In contrast, there is a wealth of published experimental results that include qualitative
information regarding the spatio-temporal activation of genes. Hence, some modelers have opted to
model genetic regulatory networks as discrete dynamical systems.
It might appear at first sight that modeling using discrete variables is somehow inferior to the use of
continuous variables, but it has been shown that continuous and discrete models share many qualitative
dynamic features (Bagley and Glass, 1996; Glass, 1975; Glass and Kauffman, 1973; Muraille et al.,
1996; Mendoza and Xenarios, 2006). In this chapter we present some properties of discrete networks,
as well as some biological examples of regulatory genetic networks modeled as discrete state dynamical
systems. These topics will show the reader that many important aspects of regulatory networks can be
appropriately studied with the use of discrete dynamical systems.
531
DISCRETE NETWORKS
A network is a system formed by nodes and interactions among them (Figure 1a). These nodes have
states of activation, which are determined by the state of other nodes (Figure 1b). In a genetic regulatory network, nodes represent genes, the state of activation represent the transcriptional activity of
genes, and the interactions among nodes are the transcriptional regulatory relationships among genes.
Discrete networks are dynamical systems that use discrete variables, for representing not only the state
of activation of the participating elements, but also to describe time.
In discrete networks events occur at specific time intervals, rather than continuously. Under this
particular assumption, the state of the whole system at a particular time step is a function of its own
state at the preceding time step. Mathematically speaking, the supposition is that the dynamical behavior of the system is governed by an equation of the form St+1=f(St), were S represents the state of the
network. Notice that this is an autonomous deterministic system, meaning that the network does not
have any input from the outside, and that the network state at time t determines a unique successor at
time t+1. St is a vector of the form (x1,x2,x3...xn), which contains the state of activation of all the nodes in
the vector, from x1 to xn. For example, the state of a binary network is represented by a vector containing binary variables like (0,1,1,0,1), although often a shorthand without parenthesis or commas is used
instead, i.e. 01101.
The dynamical behavior of the network, starting from a given time point is expressed by a succession of network states, namely S0, S1, S2 ... Sn. It is important to note that the state space is finite; if there
are n nodes in the network, and there are m possible states in a node, then the total number of possible
network states is mn. Now, the deterministic nature of the system and the finite size of the state space,
imply that as time advances a network must eventually repeat a previously visited state, thus cycling
repeatedly around a number of recurrent states. The set of states that are cyclically repeated in a trajectory
Figure 1. a) A network has nodes and interactions among them. b) Each node in the network has a state
of activation, which is determined by the states of the nodes interacting with it. c) The discrete network
is a dynamical system that has attractors, in this example there are two: a fixed point (000) and a period-2 cycle (010?101).
532
constitutes an attractor (Figure 1c). Attractors can be characterized by its period, if the same network
state is repeated after m time steps, then it is a period-m attractor. Period-1 attractors are also known as
fixed-point attractors. The collection of all network states leading, or belonging to a particular attractor is called the basin of attraction. While modeling a particular biological system, the task is to find a
function f such that St+1=f(St) has attractors that qualitatively describe the real experimental system as
closely as possible, under as many conditions as possible.
Attractors confer stability to networks, allowing the possibility of resisting some perturbations. To
illustrate the property, take the following trajectory of a binary network: 1110, 1101, 0000, 1111. Suppose that the four states form the whole basin of attraction of the fixed point 1111. Once the network
reaches the attractor, the network will remain at 1111 for as long as no perturbations exist. Now suppose that an external stimulus changes the activation state of the fourth node from 1 to 0, modifying
the state to 1110. It turns out that such state is part of the basin of attraction of 1111; thus, after three
time steps the system returns to the original fixed point: the perturbation was absorbed by the system.
Notice that a similar situation holds if the third node is perturbed. Even more, if all the elements are
perturbed at once, turning the network state to 0000, it takes only one time step for the network to return
to the original fixed point. Therefore, the existence of basins of attraction guarantees that the effect of
a number of perturbations will die out after a transient response. Of course, not all perturbations have
the same effect. In the previous example an alteration of the attractor on the first node results in 0111,
which is outside the depicted basin of attraction. Such new activation state lies necessarily in another
basin of attraction, and the network will follow a trajectory ending in an attractor different from 1111.
Therefore, some perturbations may result in a change in the final stable state attained by the network.
In any case, the existence of attractors gives to the networks the ability to resist some perturbations,
which is a fundamental quality of biological systems.
Networks usually contain feedback loops, or circuits. Their presence is necessary to ensure multistationarity and homeostasis (Thieffry et al., 1995; Thomas, 1978; Plahte et al., 1995; Gouz, 1998; Snoussi,
1998), which are particularly important properties of biological systems. A feedback loop can be either
positive or negative. If the number of negative interactions in the loop is zero or even, then the circuit
is positive. Conversely, if the circuit has an odd number of negative interactions, then the loop itself is
negative. Negative feedback loops generate homeostasis, in the form of damped or sustained oscillations; while positive feedback loops generate multiple alternative steady states. These are important
characteristics in biological systems. The role of homeostasis in maintaining the internal ambient of an
organism is well known, and dates back from the work of Cannon (1929), while the interpretation of
multiple stable steady states as the basis for cellular differentiation goes back to Delbrck (1949).
DISCRETE NETWORKS MODELS OF SOME BIOLOGICAL SYSTEMS

The intention of the following recount is to stress some of the novel results and interpretations emanated
from the approach of modeling regulatory networks as discrete dynamical systems. It does not pretend
to be a thorough review of each and every published model; rather, it will be an illustrative description
of the kind of insights obtained from the modeling approach. Of special interest will be the biological
interpretation of the attractors observed in the dynamical systems.
533
Lambda Phage
The lambda phage is a virus of Escherichia coli that can integrate its genetic material into the host DNA
or multiply in the cytoplasm. The choice between these two fates, known as the lytic and lysogenic
pathways, is controlled by the interaction of many bacterial and viral genes. Thieffry and Thomas (1995)
elaborated a regulatory network incorporating the interactions among genes cI, cII, cro and N of the
lambda bacteriophage. The network established between these four genes constitute the main control
mechanism that eventually will determine if the host bacterium will lyse or become lysogenic. This
process is very similar to a cellular differentiation process, in which the stable patterns of expression
determine the cellular fate.
Despite the fact that the complete genome sequence of the lambda phage is known, and that there
is a wealth of molecular information in this biological system, modeling of the dynamical behavior of
the network emphasizes some aspects that are not appreciated intuitively. First, the model shows that
the cI and cI-cro circuits are sufficient for the shift of stable expression patterns induced by a change
in temperature. Specifically, the model has two stable expression patterns at low temperature, but only
one a a high temperature. Second, the circuit analysis of the model shows that the inclusion of cI-cII
and cI-N-cII negative feedback loops, as well as the cI-cro-cII positive feedback loop, results in the
increase of cooperativity but do not play a crucial role in the decision for or against immunity. This
model is capable of reproducing the effect of many known mutations; but its most important feature is
that it permitted to study the role and importance of the known feedback loops present in the network
during the infection process of the lambda phage.
Drosophila
Genetic and molecular studies on Drosophila melanogaster have shown the existence of gradient mechanisms in the generation of embryonic patterns. Sanchez et al. (1997) published a regulatory network
established by the genes dorsal, twist, snail, decapentaplegic, sog, toll, and rhomboid, to explain the
establishment of the dorso-ventral genetic pattern of Drosophila. They found that the first four of the
mentioned genes were enough for the creation of a discrete dynamical system to describe the patterning. The model contains two feedback loops, which together generate five steady states. Three of these
states are fixed point attractors, which correspond to the genetic expression found in the ectoderm,
neuroectoderm and mesoderm. This model correctly describes at a qualitatively level not only the
observed stable gene expression patterns, but also some transient patterns and the effect of single-gene
mutations on the number and nature of these patterns. Finally, the model allowed to propose that the
autoregulation of snail is dispensable for the establishment of the dorso-ventral pattern, which was a
non-evident result from the formal analysis of the model.
Genes that establish the antero-posterior pattern of Drosophila also have received some attention
from the modeling community. Burstein (1995) elaborated a network that incorporates maternal, gap,
pair-rule, segment polarity, and homeotic genes. Even though the network contains 16 nodes, the author
was interested in the dynamics of the deformed gene alone. The model permitted to find the concentrations of the proteins coded by the genes bicoid, hunchback and even-skipped required to establish
the striped expression pattern of deformed. This network model helps understand how relatively few,
broadly expressed gap genes specify organized stripes of downstream genes. Specifically, it shows that
534
the mechanism depends upon a combination of overlapping patterns and gradient concentrations of the
proteins codded by the gap genes.
Without doubt, the most elaborated models of genetic regulatory networks have been made for Drosophila early embryogenesis. In particular, Bodnar (1997) presented an extraordinary spatio-temporal
discrete model that integrates the genetic and nuclear events from the egg to the syncytial blastoderm,
comprising 13 nuclear divisions. In this model, all genes and their products have four possible states.
The rules controlling the activation of genes depend upon protein concentrations and chromatic states,
while the maternal effects are simulated as the initial state of the system. The spacial effect is incorporated by the protein gradients that activate or repress genes in neighbor nuclei. A network of 11 genes
was sufficient to generate the 16 compartments corresponding to the parasegments formed along the
antero-posterior axis of the Drosophila embryo. In contrast, a network of 14 elements was necessary to
simulate the expression pattern of homeotic genes. Finally, the dorso-ventral expression was modeled
with a network formed by seven nodes. The complexity of this model allowed not only to describe the
early development of the Drosophila embryo, but also permitted the study of some aspects about the
evolution of developmental pathways. For example, by the elimination of two genes, and the addition of
one gene and three connections, it was possible to obtain the patterning of homeotic genes that occurs
in beetles and grasshoppers. This last result shows that the model of a particular network can be easily
modified to study similar networks in related species.
Arabidopsis
Flower development has demonstrated to be a suitable system for network modeling. Mendoza and
Alvarez-Buylla (1998) developed a genetic regulatory network with 11 genes that control the flower
morphogenesis of Arabidopsis thaliana. The model has six attractors, where four of them correspond
to the genetic activity observed in the four floral organs, i.e. sepals, petals, stamens and carpels. A fifth
attractor, in turn, represents the genetic activity of meristems not competent to initiate a flower, it is
a non-flowering state. The sixth attractor corresponds to the prediction of a stable expression pattern
of two non-flowering genes and another two involved in flower development. Notably, the model also
predicted the existence of one interaction, the activation of AG by LFY, which was confirmed independently by an experimental group (Parcy et al., 1998). A subsequent analysis of the model (Mendoza et
al., 1999), using the generalized logical analysis, showed that only two feedback loops of the network,
namely AP1-AG and AP3-PI, are sufficient to obtain the six attractors already mentioned. Moreover, the
analysis predicted the existence of a yet undiscovered activator of the gene LFY.
The differentiation process in the root epidermis of Arabidopsis also has been the subject of modeling. There is one model (Mendoza and Alvarez-Buylla, 2000) that incorporates the genetic regulation
and signal transduction pathways leading to the development of root hairs. Alternate files of hairs and
non-hair cells form the root epidermis of Arabidopsis, although there are multiple mutants that alter the
number and distribution of these hairs. The model that describes this pattern is a discrete dynamical
system that incorporates eight elements, including transcription factors and signaling proteins. Interestingly, two variables represent external inputs to the network, one due to the ethylene availability of cells,
and another represents an uncharacterized signal coming from the root cortex. The combination of the
two signals determines the attractor reached by the network, where each attractor represents a genetic
expression pattern leading to the appearance of different number of root hairs. An important difference
of this model with other previous non-network models for the root epidermis, is its capacity to describe
535
and predict the morphological effects of single and multiple mutations, as well as the response to some
pharmacological treatments.
Cell Cycle
Cell cycle in mammalian cells can be roughly seen as the succession of duplication of the genetic
material in the S phase, and cell division in the M phase, preceded by the G1 and G2 phases, respectively. These events are well characterized at the cellular level, and the main intervening molecules are
known. The large quantity of information regarding the molecules and interactions that conforms the
regulatory network controlling the cell cycle shows that it is a rather complex system. Because of it,
there are many efforts to model such network, so as to fully understand how it generates and controls
the different phases of the cell cycle. Faure et al. (2006) were capable of describing this network with
the use of logical rules to describe the activation of 10 molecules; namely, CycA, CycB, CycD, CycE,
Rb, E2F, p27, Cdc20, UbcH10 and Cdh1. Despite the apparent simplicity of the network, the system has
two attractors that are consistent with the known experimental data. On the one hand, the model has a
stable steady state where Rb, p27 and Cdh1 are active and CycD is inactive. This state corresponds to
the quiescent cellular state. On the other hand, whenever there is a presence of CycD, all the trajectories of the network converge towards a unique complex dynamical cycle, which corresponds to the cell
cycle itself. Importantly, this discrete model is in fully agreement with more sophisticated continuous
models (Novak and Tyson, 2004).
Neuroendocrine Regulation
Network elements not necessarily represent genes, proteins or other single molecules. Muraille et al.
(1996) elaborated a network model to study the neuroendocrine regulation of the immune response. The
model contains only four elements representing a pathogen, the immune response, the hypotalamo-pituitary-adrenocortical axis, and the host organism itself. With the use of the generalized logical analysis,
authors showed that the model has six feedback loops, functional in some regions of the variable space.
The combined functionality of these circuits result in the generation of ten steady states. One of these
is the zero state, where nothing happens. Another represents the state where the pathogen is dead; and
a third one represents the organisms death. The remaining seven steady states are characteristic of the
feedback loops and hence do not correspond to stable expression patterns.
One peculiarity of this particular model of the neuroendocrine regulation, is that the discrete system
was used as a stepping stone to construct a more refined model using ordinary differential equations.
With the analysis of both the discrete and the continuous models, it was possible to study some types of
well-characterized immune responses, like immunogenicity, toxicity, neuro-hormonal feedback, toxic
shock syndrome, the relation of pathogen with infection, and the relation of stress with the immune
response. One of the important contributions of this work, was the demonstration that the elaboration
of a discrete network helps to locate and identify the steady states in a continuous model.
T-Helper Cells
The vertebrate immune system is made of diverse cell populations. One of these is the population of
CD4+ lymphocytes, also known as T-helper cells. These cells have no cytotoxic or phagocytic activ-
536
ity, but they coordinate several cellular and humoral immune responses via the release of cytokines,
influencing the activity of several cell types. In vitro, T-helper cells can be subdivided into precursor
Th0 cells, and effector Th1 and Th2 cells, depending on the pattern of secreted molecules. Various
mathematical models have been proposed to describe the differentiation, activation and proliferation of
T-helper cells. However, early models aimed to describe the cellular interactions of these cells, while
more recent models have been developed to describe the molecular mechanism that determine the differentiation process of these cells.
There is currently a lack of quantitative data about the levels of expression of the molecules involved
in the differentiation process of T-helper cells. There is, however, vast amounts of qualitative information
regarding the regulatory interactions among many of such molecules. As a result, it has been possible to
reconstruct the basic signaling network that controls the differentiation of T-helper cells. This network
has been studied as a discrete network, implemented as a dynamical system (Mendoza 2006), a Petri
Net (Remy et al., 2006), and a binary decision diagram (Garg et al., 2007). In all cases, it was possible
to obtain the basic differentiation process going from Th0 cells to either Th1 or Th2 cells. Furthermore, these models are capable of describing the effect of null-mutations, or over-expression of some
molecules as reported by several experimental groups. The consistency among the results of several
modeling approaches on the same signaling network shows that the qualitative dynamical behavior of
the network is determined to a large extent by its topology. Another relevant aspect, is that the network
has been modeled also as a continuous dynamical system (Mendoza and Xenarios 2006). Here again,
the results between the discrete and continuous approaches are directly comparable, showing that the
qualitative behavior of the network is strongly determined by the topology of the network, and not by
its actual implementation as a discrete or a continuous system.
CONCLUSION
Regulatory networks are made of macromolecules and the interactions among them. In the particular
case of genetic regulatory networks, the molecules of interest are genes and their coded proteins, while
the relevant interactions are transcriptional relationships from one gene to another. With the advance
of molecular techniques, the availability of gene sequences and the determination of their expression
patterns is increasing at a high rate. Therefore, it is necessary to integrate the large quantity of molecular information into functional networks, and develop models to understand the dynamical behavior of
such networks.
Studies about the properties of discrete state networks as models for genetic regulation date back at
least four decades (Kauffman, 1969; Thomas, 1973; Glass, 1975). These studies were done on abstract,
or idealized networks. It was only until recently that models of specific genetic regulatory networks
started to appear. From all the biological experimental systems, Drosophila melanogaster is by far the
most studied system from the point of view of regulatory networks. Indeed, models for this organism
have reached high sophistication (Kosman and Reinitz, 1998). However, other biological network are
attracting the attention of the scientific community. Hence, it is relevant to show the descriptive and
predictive capacity of the modeling approach with the use of discrete dynamical systems.
As a general rule, when genetic and biochemical studies suggest a sequence of regulatory events,
people use arrows to represent the interactions among macromolecules. Such representations as graph
are necessary but not sufficient to understand the dynamical behavior of the pathways or networks. This
537
chapter had the intention to show that discrete networks offer a simple, yet powerful methodology that
permits to study the collective behavior of complex biological networks. Moreover, this kind of modeling permits a thorough dynamical and analytical study, which gives relevant information to elaborate
more realistic continuous models.
Despite the clear usefulness of discrete networks, it is important to keep in mind the difficulties
behind this type of modeling. A significant problem in elaborating a model of any kind is the correct
inference of the interactions among the nodes. Such process involves a thorough analysis of a large
quantity of experimental literature to infer the connectivity. This step includes at least two possible
sources of error. The first is involved with the discrimination process, what information is relevant and
which is superfluous? It is not evident at all if a reported experimental result is related to a particular
regulatory process, or rather to a molecular response not expected by the modeler. An ideal solution would
be the formation of interdisciplinary groups with experts in the modeling process and experimentalists
familiar with the biological system. However, this is not a common trend yet; more often than not the
modeler has to gather information from published data. This brings about the second problem. Different
experimental laboratories often work with different methodologies, animal subspecies, plant ecotypes,
etc. Here again, how to distinguish between changes due to regulatory processes or due to differences
in the data acquisition methodology? There is no easy response; one has to balance between knowledge
of the experimental system and modeling intuition.
The use of integrative methodologies is necessary to reproduce the complex dynamical spatio-temporal
patterns of biological systems. In particular, the analysis of genetic regulatory networks as dynamical
systems provides us with a suitable tool for the integrative analysis of the large quantity of molecular data
available (Huang, 2004; Huang and Ingber, 2000). Moreover, network models not only synthesize data
but they also permit the elaboration of predictions that are not evident by using the classical conception
of hierarchical static models. Of special importance is the prediction of missing regulatory interactions,
of missing genes, or stable genetic expression patterns; all these are useful guides for the experimental
biologists to continue with the molecular analysis of certain experimental organisms.
REFERENCES
Bagley, R.J., and Glass, L. (1996). Counting and classifying attractors in high dimensional systems. J.
Theor. Biol. 183(3), 269-284.
Bodnar, J.W. (1997). Programming the Drosophila embryo. J. Theor. Biol., 188(4), 391-445.
Burstein, Z. (1995). A network model of developmental gene hierarchy. J. Theor. Biol., 174(1), 1-11.
Cannon, W.B. (1929). Organization for physiological homeostasis. Physiol. Rev., 9(3), 399-431.
de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: a literature review. J.
Comput. Biol., 9(1), 67-103.
Delbrck, M. (1949). Discussion. In: Units Biologiques Doues de Continuit Gntique. CNRS (Lyon)
(pp.33-34).
Faure, A., Naldi, A., Chaouiya, C., & Thieffry, D. (2006). Dynamical analysis of a generic Boolean
model for the control of the mammalian cell cycle. Bioinformatics, 22(14), e124-e131.
538
Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J. Theor. Biol.,
54(1), 85-107.
Glass, L., & Kauffman, S.A. (1973). The logical analysis of continuous, non-linear biochemical control
networks. J. Theor. Biol., 39(1), 103-129.
Gouz, J.L. (1998). Positive and negative circuits in dynamical systems. J. Biol. Systems, 6(1), 11-15.
Huang, S. (2004). Back to the biology in systems biology: What can we learn from biomolecular networks. Brief. Funct. Genomic. Proteomic., 2, 279-297.
Huang, S., & Ingber, D.E. (2000). Shape-dependent control of cell growth differentiation, and apoptosis:
Switching between attractors in cell regulatory networks. Experimental Cell Research, 261(1), 91-103.
Kauffman, S.A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. J.
Theor. Biol., 22(3), 437-469.
Kauffman, S.A. (1991). Antichaos and adaptation. Sci. Am., 265(2), 78-84.
Kauffman, S.A. (1993). The origins of order: Self-organization and selection in evolution. Oxford
University Press.
Kosman, D., Reinitz, J., & Sharp, D.H. (1998). Automated assay of gene expression at cellular resolution. Pac. Symp. Biocomput., 6-17.
Mendoza, L. (2006). A network model for the control of the differentiation process in The cells. BioSystems, 84(2), 101-114.
Mendoza, L., & Alvarez-Buylla, E.R. (1998). Dynamics of the genetic regulatory network for Arabidopsis
thaliana flower morphogenesis. J. Theor. Biol., 193(2), 307-319.
Mendoza, L., & Alvarez-Buylla, E.R. (2000). Genetic regulation of root hair development in Arabidopsis
thaliana: a network model. J. Theor. Biol., 204(3), 311-326.
Mendoza, L., Thieffry, D., & Alvarez-Buylla, E.R. (1999). Genetic control of flower morphogenesis in
Arabidopsis thaliana: a logical analysis. Bioinformatics, 15(7), 593-606.
Mendoza, L., & Xenarios, I. (2006). A method for the generation of standardized qualitative dynamical
systems of regulatory networks. Theor. Biol. Med. Model., 3, 13.
Muraille, E., Thieffry, D., Leo, O., & Kaufman, M. (1996). Toxicity and neuroendocrine regulation of
the immune response: A model analysis. J. Theor. Biol., 183(3), 285-305.
Parcy, F., Nilsson, O., Busch, M.A., Lee, I., & Weigel, D. (1998). A genetic framework for floral patterning. Nature, 395, 561-566.
Plahte, E., Mestl, T., & Omholt, S.W. (1995). Feedback loops, stability and multistationarity in dynamical systems. J. Biol. Systems., 3(2), 409-413
Sanchez, L., van Helden, J., & Thieffry, D. (1997). Establishment of the dorso-ventral pattern during embryonic development of Drosophila melanogaster: A logical analysis J. Theor. Biol., 189(4), 377-389.
539
Snoussi, E.H. (1998). Necessary conditions for multistationarity and stable periodicity. J. Biol. Syst.,
6(1), 3-9.
Thieffry, D., Snoussi, E.H., Richelle, J., & Thomas, R. (1995). Positive loops and differentiation. J. Biol.
Syst., 3(2), 457-466.
Thieffry, D., & Thomas, R. (1995). Dynamical behaviour of biological regulatory networks -II. Immunity
control in bacteriophage lambda. Bull. Math. Biol., 57(2), 277-297.
Thomas, R. (1973). Boolean formalization of genetic control circuits. J. Theor. Biol., 42(3), 563-585.
Thomas, R. (1978). Logical analysis of systems comprising feedback loops. J. Theor. Biol., 73(4), 631656.
Thomas, R., Thieffry, D., & Kaufman M. (1995). Dynamical behaviour of biological regulatory networks
-I. Biological role of feedback loops and practical use of the concept of the loop-characteristic state.
Bull. Math. Biol., 57(2), 247-276.
KEY TERMS
Attractor: A set of states of a dynamical system towards which the system approaches asymptotically. Region of the state space towards which all the trajectories of a set of initial states converge.
Dynamical System: A set of equations that describe the change of some variables over time.
Feedback Loop: A circular chain of interactions, such that each element in the loop influences its
own future level of activation. Feedback loops are also known as circuits.
Graph: A collection of points and lines connecting a subset of them. The points of a graph are
commonly known as vertices or nodes. Similarly, lines connecting the vertices of a graph are known
as edges, arcs, or interactions.
Homeostasis: The property of a system to regulate its internal environment, to maintain a stable or
constant condition.
Multistationarity: The property of a dynamical system of having two or more steady states.
Steady State: In a dynamical system, a state where none of the variables in the system changes
with time.
540
541
Chapter XXXII
Investigating the Collective

Behavior of Neural Networks:
A Review of Signal Processing Approaches

A. Maffezzoli
Politecnico di Milano, Italy
F. Esposti
M.G. Signorini
abstract
In this chapter, authors review main methods, approaches, and modelsfor the analysis of neuronal network data. In particular, the analysis concerns data from neurons cultivated on Micro Electrode Arrays
(MEA), a technology that allows the analysis of a large ensemble of cells for long period recordings.
The goal is to introduce the reader to the MEA technology and its significance in both theoretical and
practical aspects of neurophysiology. The chapter analyzes two different approaches to the MEA data
analysis: the statistical methods, mainly addressed to the network activity description, and the system
theory methods, more dedicated to the network modeling. Finally, authors present two original methods,
introduced by their selves. The first method involves innovative techniques in order to globally quantify
the degree of synchronization and inter-dependence on the entire neural network. The second method is
a new geometrical transformation, performing very fast whole-network analysis; this method is useful
for singling out collective-network behaviours with a low-cost computational effort. The chapter provides an overview of methods dedicated to the quantitative analysis of neural network activity measured
through MEA technology. Until now many efforts were devoted to biological aspects of this problem
without taking in to account the computational and methodological signal processing questions. This is
precisely what the authors try to do with their contribution, hoping that it could be a starting point in
an interdisciplinary cooperative research approach.
Investigating the Collective Behavior of Neural Networks
INTRODUCTION
In last years Neuroscience has been greatly enrichedbyengineering techniques and methods, and this
scientific exchange supports specific applications of micro- or nano-technology in neurobiology and
molecular biology. Thisnew high-technological approach is called Neuroengineering. Engineering
contribution is not simply restricted to instrumentation, but it also supplies various approaches to analyze neuronal activity, studying mathematical models and computer aided simulation of neurobiological
phenomena present in in vitro and in vivo cultivations.
The widespread instrumentation is the, so-called, Micro Electrode Array (MEA) technology complementing traditional electrophysiological techniques in neuroscience research (e.g., how the brain stores
and process information), prosthesis development (using living neurons as components of an integrated
circuit or directly connecting a computer to them), and bio-analytics and information technology. MEA
technology is very helpful to understand the dynamics of a functioning neuronal network, because it
allows understanding of which different processes or components are acting together at the same time,
going beyond the traditional single-neuron approach.
Engineering contribution is not however restricted to instrumentation design, and the related utilization, but it supplies advanced approaches for the neuronal activity investigation; among others,
we recall data recording and elaboration, mathematical modelling and computer aided simulations of
neurobiological phenomena, with virtual simulations of the behavior of a single neuron and cluster of
neurons both in vitro and in vivo cultivations.
At this aim, neuroengineering gives extremely valid tools to get all information coming from neurons
in an extremely wide scale, from system behavior down to single neuron. MEA data can be elaborated
with custom methods of signal processing and pattern recognition or machine learning.
Specific and very promising long-term applications of MEA technologies are chemicals and pharmaceuticals set-ups, where new drugs are tested on in vitro neuron ensemble. At this aim, a tool able
to implement a method for the evaluation of neuronal network behavior, as a consequence of different
stimuli, would be is very useful by making such tests smarter and cheaper.
MEA instrumentation, summing up, complements traditional electrophysiological techniques for:

Fundamental neuroscience research

in-vitro drug assays
Bio-analytics (biosensors)
Prosthesis development
Information technology
and allows:
Long-term cultures
Multi-site extra-cellular recordings/stimulations
Combination with micro-fluidic systems and bio-patterning techniques
542
BACKGROUND
A Short Description of Micro Electrode Array Technology
There is a number of different MEA types available for extra-cellular multichannel recordings: for
example, the most used are standard MEAs with flat round TiN electrodes in an 8 x 8 (64 electrodes,
where 4 of them are useless) layout grid, but there are also MEAs with 3-D shaped Pt electrodes for
acute slices, those with a very low thickness for high-resolution imaging, or with a clustering structure,
or perforated MEAs for easy slice positioning, recording and stimulation.
MEA instrumentation is composed by multiple electrodes which allow the simultaneous targeting
of several sites for mid- to long-term extra-cellular recording, and eventually non-invasive stimulation,
of continuous spontaneous or evoked in vitro cells activity. Cell lines or tissue slices are placed and
cultivated directly on the MEA and almost all spontaneously active or excitable cells and tissues can be
used, e.g. central or peripheral neurons, heart excitable cells, retina (tissue or single cells), or muscles.
During data acquisition, signals are amplified, filtered and sent to a computer, where data are stored,
analyzed and processed by custom software-tools. Commercial MEAs grant low impedance electrodes
(lower than 1M at 1 kHz), together with good cellular sealing and high charge injection capacity for an
efficient stimulation. The acquisition rate bandwidth is 10Hz-5kHz, usually reduced to 100Hz-2kHz
for computational cost reasons.
Neural cells can be cultured in vitro and kept alive for several days or months, while preserving their
adaptive properties. Nowadays MEA is the preferred interface technique to perform multisite recording
and/or stimulation of electrically excitable cells.
Figure 1. Micro Electrode Array (MEA) system, both hardware and software for the acquisition of extracellular recordings from a single multichannel array (courtesy from http://www.multichannelsystems.
com/products/measystem/measystemintro.htm)
543
Figure 2. MEA for the extracellular recording from up to 64 electrodes of in-vitro neuronal activity of
dissociated or slice neuronal cultures (courtesy from http://www.multichannelsystems.com/products/
meaprobes).
Data Description
Data deriving from MEA are generally internally pre-processed, with pruning and Spike Sorting routines,
and then data are arranged in the typical form of Point Processes, that is time series of discrete events,
represented by spike trains. We remark also that, for each MEA, channel the dataset gives recordings
of different not-silent neurons, generally up to 3, which were presumably placed under the channel and
whose activities were extracted by spike sorting.
Spike sorting consists of a series of techniques used in the analysis of electrophysiological data, in
particular the spike waveforms collected with one or more electrodes, to distinguish the activity of one
single neurons from background electrical noise and from the activities of the other neighbors neurons,
given that the spike shapes are unique and reproducible and so useful to separate the activity produced
by each neuron; techniques applied here range from the very simple, but also inaccurate, threshold detection to the more complex and reliable Feature, Principal Component, Clustering, Template Matching
and Wavelet analyses.
Spike sorting is thus necessary to extract the most significant information about spiking activity for
each neuron, but after this pre-elaboration Point Processes data are not available to further analyses,
but they generally need other transformations, as explained later.
First of all, our list of spike events instants is commonly pre-elaborated and transformed into a representation more useful for further processing: a simple and fast method largely applied, called binning
procedure, consists in dividing the total time window in sub-windows, also known as time-bins or
simply bins, and to build a new discrete time series made of integer values, representing how many
spikes we count inside every subsequent bins (Eden, 2004). It is also possible to get the signals recorded
from various neurons and merge them in one channel only, as example the recordings corresponding
to Principal Components in the single channel, and then the choice to process them together or separately.
544
NEURONAL DATA ANALYSIS

Aims
The analysis of cultured neurons dataholds a preeminent place in the study of neuron collective behavior.
In the last century hundredsof studies investigated the cellular characteristics of the neuron, both from
the electric and molecular point of view. Thisimpressive effort allows a detailed knowledge of the neuron
cell characteristic; unfortunately it is not enough. In last years the great attention paid to the potentialities of MEA tools was, in fact, moved by the awareness thatour comprehension of thenervous system
is based on the understanding of neuronal network organization, and not only of neurons. MEA, and
other recent analysis tools such as calcium imaging, are specifically thought for the study of neuronal
collective behaviors. Starting from this considerations it is easy to understand why, both for physiologic
or pathological studies, the analytical methods usually employed are linear or non-linear correlation or
entropy estimators or pattern recognizers.
Signal Processing Approaches in Computational Neuroscience

Computational neuroscience develops and uses mathematical models to study how neuronal networks
transmit information in the form of spiking activity. Although all these models in computational neuroscience cover a very wide range of mathematical methods, there are two principal and also distinct
approaches to the study of neural systems:

Computational approach: The computational approach is concerned with neural information

processing and with how various forms of adaptive behavior can be implemented in neural networks with evolving synapses responses: here we ask how networks of spiking neurons allow the
transmission of specific information to perform goaloriented tasks in multiples environments. It
develops signal processing algorithms and statistical methods to analyze spiking data collected in
neuroscience experiments. The growing complexity of neuroscience experiments makes statistical
data analysis methods a necessary complement of neural network modeling: it allows validation
of neural network model predictions, testing also the modeling of biologically relevant parameter
values for simulation studies;
Physiological approach: The physiological approach is more concerned with the intrinsic dynamics of single cells and its effects on network behavior, as described by more biophysical realistic
conductancebased models. This approach studies the molecular, cellular and network mechanisms underlying electrical neuronal activity both in normal and pathological (such as epilepsy,
Alzheimer and Parkinson diseases) conditions. Physiological approaches considers from reduced
neuron models to more complex biophysical (Hodgkin and Huxley and their variants) models of
individual neurons, networks of neurons or artificial neural network models to study emergent
behaviors of neural systems, together to computational methods performing synaptic and dendritic
processing.
Methods Overview
From the point of view of the computational approach, Neuroscience data analysis exploits established
statistical and signal processing paradigms as well as data mining and pattern recognition methods
545
wherever possible. Several standard statistical procedures, widely used in other fields of science have
found their way into various applications in neuroscience.
Statistical Methods
The timing of successive spikes both in in vivo and in vitro neurons seems very irregular and randomly distributed in time and in space. In this sense the study of timing of the spikes events or, equivalently, of the Inter Spike Interval (ISI) distribution is very useful to evaluate the richness of information
transferred by neurons, but first the analysis of the stochastic process underlying spikes generation is
required (Rieke, 1997). A number of previous studies, experiments and simulations report that the
spikes fired by a neuron are distributed as Poisson process: this suggests that they rise about randomly
during spontaneous, not stimulated, periods and that the number of spikes succeeded in different time
windows is independent each other, or more simply that spikes happen randomly during a given time
window. The rise of spikes depends by a spike rate, which could be constant, for Homogenous Poisson
Models, or time-dependent, for Inhomogeneous Poisson Models. Of course, this is not case in reality,
e.g. because of absolute, when spikes are absent, and also relative, when the likelihood of spikes decreases, refractory periods after a single spikes or during Bursting behavior, whose statistical features
are definitely non Poissonian both in time and space.
Although Inhomogeneous Poisson Models are very reliable, it is necessary to model a firing rate
function which would be not only varying instantaneously in time but also dependent from previous
spikes history, so contradicting the initial independent hypothesis reported. Such models are as example
the so-called statistical renewal process, whose ISI distribution is calculated thanks to a gamma distribution (Heeger, 2000), which becomes an exponential distribution in the particular case the model of
spikes generation is exactly Poissonian.
Other procedures are based on the likelihood principle: the likelihood is formulated by deriving
the joint distribution of the data, and then considering this joint distribution as a function of the model
parameters with the data fixed. This function helps estimating the model parameters, constructing confidence bounds, and for making inferences about the particular problem under study. The conditional
intensity function can be used to derive the joint probability density of the neural spike train. Evidence
of such behaviors are discussed in (Troy, 1992) and (Softky, 1993).
It seems also very useful to study these data under a Markovian hypothesis, giving finally the
possibility to build an alternative machine learning method with predictive applications and to better
understand the meanings and functions of the various behavioral states alternating in a neuronal ensemble recordings in spontaneous or stimulated activity: the main objective of this application is the
construction of a Markovian model trained on experimental data of neural spiking activity on MEA,
where preferably some external stimulus was applied on, in order to perform predictive tasks on future
experiments with similar set-up. At the same time, such a method is very interesting in order to single
out new information about the underlying hidden states that regulate the network response before and
after stimulation. Finally it results in a reliable method to investigate the hypothesis of statistically
dependent walks in ensemble spiking behavior.
A lot of studies imply the use of classical signal processing techniques both linear such, as CrossCorrelation and the Auto-Correlation methods, and non-linear, such as the information theory based
indexes, e.g. Entropy and Mutual Information, or other innovative processing methods able to quantify
the similarity between two data series, such as Dynamic Time Warping. In particular, it is possible to
546
find in literature applications of the following linear and non-linear time series indexing techniques:
normalized Cross-Correlation with zero time delay, also known as Normalized Cross-Covariance, and
Auto-Correlation method (Mood, 1974); Mutual Information, that is a non-linear index that represents
a measure of the average amount of information we can obtain about one random variable by observing another one, and quantifying the degree of statistical independence between them (Shannon, 1948;
Ash, 1965; Reza, 1994); Dynamic Time Warping (DTW), a method that iteratively finds an optimal
match between two sequences warped non-linearly in time and that gives a measure of their similarity more reliable than Cross-Correlation above cited, whereas less computational efficient. The first
two techniques give results based on specific hypothesis about the stochastic random processes under
analysis, specifically supposing they are stationary and also ergodic, that is the reason why we extract
their statistical properties, in practice the first two statistical moments, directly from a single time series
(Bittanti, 1999; Reza, 1994). We also have to take into account that all these indexes are averaged on the
total time duration of the data series, so we can re-sample the data to get better significance of results,
and also process separately recording which are presumably not stationary, which happens quite often
given the self-adaptive characteristics of neuronal systems, specially after stimuli.
System Theory Methods

Very promising is the system oriented and graph-theoretic based approach, given by Small-World and
Scale-Free model theories, which are able to quantify the spatial connectivity and global synchronization of the neurons ensemble, both in a static and dynamical manner, so specifying some of the most
interesting features characterizing their behavior.
Complex networks in nature can show particular topologies, with dense and well localized clustering
of different components of the systems and cliquishness of connections between them: if we interpret
these components as nodes of a graph modeling the entire system, we can apply methods of graph theory
to verify if pairs of nodes are joined by short paths and these nodes are also neighbors of each other,
so creating a sort of clusters: in practice, small-worlds are obtained from a nearest-neighbor model by
adding some log-range connections, linking cluster above described. If a network or a graph shows all
these features, clearly quantifiable thanks to analytical methods, it could be modeled as a Small-World
network (Watts, 1998).
Example of this sort of networks are various and very common: the neural network of Caenorhabditis elegans, the power grid of western United States, the collaboration graph of film actors, the visual
cortex, even if actually under study.
The mathematical concept underlying these methods is the following: first, we describe a model as
a graph, with nodes and weighted edges connecting each pair of them. What it is more interesting is
the number of nodes connected to a node, the so-called degree of a node, and the degree probability
distribution. The path length between any pair of nodes, which is calculated as the minimum number
of edges that have to be covered to connect the two given nodes; and finally the clustering coefficient
of a node, which measures the number of edges connecting its nearest neighbors.
Small-world networks features are very interesting: widely distributed and fast signal propagation,
and high computational power. They are economical, in a sense that they can minimize wiring costs
while preserving a high dynamical complexity, supporting both specialized and distributed information processing (Bassett, 2000). Finally they achieve easily synchronization, whereas in some cases the
original nearest-neighbor model cannot (Wang, 2001).
547
At the opposite of the very ordered nearest neighbors and Small World models there are the random
networks, having connections between nodes linked randomly. Another very interesting model is the
Scale-Free, where nodes are linked in a hierarchical manner, in a structure so-called rich-get-richer,
and its degree probability distribution follows a power law, so giving such system evolving independent
of time and consequently of dimension or scale, that is a sort of scale-free stationary scale (Barabasi,
1999).
Neurons Models Methods

Besides analytical methods, modeling approaches are also largelyimplemented in literature. These
methods try to understand the network features throughthe simulation of high-level mechanisms or the
replication ofhundreds of virtual neurons (Izichevich, 2007). Obviously, the quality of the observations
depends on the model of neuron chosen. The most widespread models of single neuron activityare
the Hodgkin and Huxley model (Hodgkin 1952)and the Integrate and Fire model (e.g Geisler, 1966).
Together with single cell models, synaptogenesis models are often used, i.e. models that are able to
simulate the auto-connection process of dissociated neurons (Hely, 2001) (Segev, 2007). At last there
are models of the interaction of neuron population (as the very famous Synfire model (Abeles, 1991))
and culture conditions (e.g. Marinaro, 2004). For a review of these modeling techniques, that is beyond
the purpose of these chapter, we refer to Bibliography.
Figure 3. Block schema of all methods discussed in the chapter about the processing and treatment of
data recorded from neuronal networks cultured on MEA.
548
FUTURE TRENDS
Examples of MEA Data Analyses
In the following we present the methods we developedfor the analysis ofthe multivariate neuronal
data described above. The first method involves both linear and non-linear techniques togetherwith
a combinatorial approach based on Genetic Algorithms, in order to globally quantify the degree of
synchronization and inter-dependence on the entire network. The second one is a geometrical method
which transforms the (2D plus time) MEA signal in a 1D plus time signal in order to perform very
fast whole-network analysis: this method is useful forsingling out collective-network behaviorswith a
low-cost computational effort.
First Approach: Linear and Non-Linear Indexes of Global Network Synchronization

In order to analyze the behavior of the whole network evolving in time, we implemented a global index
of similarity and correlation between MEA channels (Maffezzoli, 2007). We quantified this global index
by finding pairs of MEA channels most correlated based on a specific criterion. Indeed, we took the list
of the N channels on MEA, as at the top or bottom of Figure 1, and we calculated the sum s of all terms
c calculated between pairs of channels adjacent on the list, as defined in the following:
s=
N 1
c (i; i + 1)
i =1
(1)
where i represents the i-th position in the list and c(i; i+1) is a term calculated between channels in
adjacent positions, according to a time series indexing technique, such as Cross-Correlation, MutualInformation and Dynamic Time Warping. We got the sum s as a concise, effective and sub-optimal index
of global similarity and correlation of the neuron ensemble on MEA.
There are two possible approaches to find a good solution of global index s:

Calculating the index for all possible permutation of N channels, and then selecting the optimal
one with best global index quantified by c: with this brute-force approach, in order to calculate
and compare all different indexes c, we have to perform of factorial order.
Getting a sub-optimal solution, thanks to Genetic Algorithms (GA), which help to iteratively find
a combination of the channels in our list giving a good value, higher or lower depending on the
time series indexing method previously chosen, of the global index c.
So we proceed as follows: first, we give in input to GA the original and unsorted channels list, as at
the top of Figure 1. Then we choose the time series indexing technique fitness function will be based
on, as showed at the middle of Figure 1, and after GA will sort channels summing pair-wise terms for
all the adjacent channels on the list, in order to get higher values of index s given by (1), according to
the criterion chosen initially. Finally, after a fixed number of iterations, GA will give as output a well
sorted list, having sub-optimal value s, as at the bottom of Figure 1.
In this manner, we defined a new method giving a sub-optimal index of global similarity and synchronization of neurons ensemble and we implemented it in a software-tool: this index represents an
549
effective technique to get information about behavior and evolution of neuronal network during various
experiments. At this aim we applied Genetic Algorithms and three time series indexing techniques:
Cross-Correlation, Mutual-Information and Dynamic Time Warping.
Second Approach: Space Amplitude Transform (SAT) Method

The method is based on the concept of Space-Amplitude Transform (Esposti, 2007). The method, as
said,allows to approach an intrinsically 2D plus time data, i.e. time recordings from a 2D electrode
array, as a 1D plus time signals in order to speed up and make simpler the data analysis. The SpaceAmplitude Transform, A(s,R), is a geometric transform that projects from a 2D domain set s(x,y,t), e.g.
the usual Raster plot, to a 1D image set I(r,t) = A(s,R), exploiting an Arrangement algorithm R. In the
domain set s(x,y,t), i.e. in the Raster plot, a specific spike is coded in terms of 0-1 event that is located
in a space-temporal coordinate (x,y,t), i.e. in a specific MEA channel (or trace if a PCA analysis is
implemented) at a specific temporal instant (function of the system resolution).
The A(s,R) transform arranges the MEA channels in a 1D list r, according to a chosen arrangement
algorithm R (with r = R(x,y)), and associates each element of the list, r, to an ordinate numeric value
(Figure 2). This operation allows building up a new signal, I(r,t), that associate to a domain space coordinate (x*,y*)a specific numeric value (r*), i.e. an image amplitude.
Figure 4. Block diagram representing successive steps of our methods to find sub-optimal value s from
sorted channels list in GA output.
550
So, in the image set I(r,t), a spike s(x*,y*,t*) of the raster plot is the point (r*,t*) of a new 1D function.
In a graphic representation, the output of the Space-Amplitude Transform can be imaged as the
interpolating function of the Raster plot that locally assumes a value assigned on the basis of the Arrangement algorithm. Because of the method involves a temporal sampling, an excessively low resolution can lead to computational problems. In fact, too big temporal windowing could code some spikes
as simultaneous. For this reason the best use of this algorithm is in high resolution analysis, e.g. with a
resolution < 1 ms. This high resolution approach is ideal, for example, the intra and inter Burst analysis.
Starting from a 1D signal, in fact, the Burst detection task becomes a trivial frequency analysis with
an appropriate threshold value.
Application Examples
As an example, we show the application of our methods to a concrete analysis case. We compared different neuronal network behaviors recorded on MEA after administration of two chemical neuronal
inhibitors: precisely AP5, which is a selective NMDA receptor antagonist, and Tetrodotoxin (TTX),
which blocks action potentials in nerves binding the pores of the voltage-gated sodium channels in
nerve cell membranes.
Results reported in (Schneidman, 2003), (Segev, 2004), (Maffezzoli, 2007) and (Esposti, 2007),
confirm the usefulness of the quantitative analysis based on signal processing techniques in order to
investigate the neuronal network behavior in different experimental conditions.
Figure 5. Graphical example of the role of the arrangement algorithm R and the Space-Amplitude
Transform A(s,R) with a 3x3 domain matrix.
551
Such results are also promising toward the building of an integrated processing system dedicated
to Neuroengineering applications.
REFERENCES
Abeles, M. (1991). Corticonics, neural circuits of the cerebral cortex. Cambridge: Cambridge University
Press.
Ash, R. (1965) Information theory. New York: Dover Publications.
Barabasi A., & Albert R. (1999). Emergence of scaling in random networks. Science, 286(509).
Bassett, D. S., & Bullmore E. (2006). Small-world brain networks. Neuroscientist, 12(6), 512-23.
Berdondini, L., Overstolz, T., de Rooij, N. F., Koudelka-Hep, M., Martinoia, S., Seitz, P., et al. (2002).
High resolution electrophysiological activity imaging of in-vitro neuronal networks. Microtechnologies in
Medicine & Biology 2nd Annual International IEEE-EMB Special Topic Conference on, Iss, 241-244.
Bittanti, S. (1999). Theory of prediction and filtering. Bologna, IT: Pitagora Editrice. (in italian)
Deci, E. L., & Ryan, R. M. (1991). A motivational approach to self: Integration in personality. In R. Dienstbier (Ed.), Nebraska Symposium on Motivation, 38., 237-238, Perspectives on motivation. Lincoln:
University of Nebraska Press.
Duda, O. R., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley-Interscience.
Eden, U.T., Frank, L.M., Barbieri, R., Solo, V., & Brown, E.N. (2004). Dynamic analyses of neural
encoding by point process adaptive filtering. Neural Computation, 16(5), 971-998.
Esposti, F., Lamanna, J., & Signorini, M.G. (2007). A new approach to the spatio-temporal patter identification in neuronal multi-electrode registrations. Proceedings of Neuroscience Today 07, (pp. 21-24).
Geisler C., & Goldberg, J. (1966). A stochastic model of repetitive activity of neurons. Biophys J, 6,
5369.
Heeger, D. (2000). Poisson model of spike generation (Handouts for teaching). New York: Center for
Neural Science, New York University.
Hely, T., Graham, B., & van Ooyen, A. (2001). A computational model of dendride elongation and
branching based on MAP2 phosphorilation. J. theor. Biol. 210, 375-384.
Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application
to conduction and excitation nerve. J. Physiol., 117, 500-540.
Izhikevich E.M. (2007). Dynamical systems in neuroscience: The geometry of excitability and bursting. The MIT press.
Keogh, E., & Ratanamahatana, C. A. (2004). Exact indexing of dynamic time warping. Knowledge and
Information Systems 7, 358-386.
552
Maher, M. P., Pine, J., Wright, J., & Tai, Y.-C. (1999). The neurochip: A new multielectrode device for
stimulating and recording from cultured neurons. J. Neurosci. Methods 87,(1),45-56.
Lewicki, M. S. (1998). A review of methods for spike sorting: the detection and classification of neural
action potentials. Network: Computation in Neural Systems. 9, R53-R78.
Maffezzoli, A., Signorini, M. G., Cerutti, S., Gullo, F., & Wanke, E. (2007). A sub-optimal criterion
to estimate and compare neural spiking activity on micro-electrode array technology. Proceedings of
3rd International IEEE EMBS Conference on Neural Engineering.
Marinaro, M., & Scarpetta, S. (2004). Effects of noise in a cortical neural model. PHYSICAL REVIEW
E 70, 041909.
MEA types of Multi Channel Systems, from http:// www.multichannelsystems.com/products/meaprobes/
meatypes/meatypesintro.htm
Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the theory of statistics. Columbus,
OH, US: McGraw-Hill, Inc.
Reza, F. M. (1994). An introduction to information theory. New York: Dover Publications.
Rieke F., Warland D., Van Steveninck, R., de R., & Bialek, W. (1997). Spikes: Exploring the neural
code. Cambridge, MA: MIT Press.
S.M. Potter (2001). Distributed processing in cultured neuronal networks. Progress In Brain Research
130, 49-62.
Schneidman, E., Bialek, W. & Berry, M. J. II (2003). An information theoretic approach to the functional classification of neurons. In Advances in Neural Information Processing 15, 197-204, S Becker,
S Thrun & K Obermayer, (eds.). Cambridge, MA: MIT Press.
Segev, R., & Ben-Jacob, E. (2007). Self-Wiring of Neural Networks available at [http://arxiv.org/abs/
cond-mat/9710352].
Segev, R., Baruchi I., Hulata, E. & Ben-Jacob, E. (2004). Hidden neuronal correlations in cultured
networks. PHYSICAL REVIEW LETTERS, 92(11), 118102.
Shannon,C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27,
379-423 & 623-656.
Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334-350.
Troy, J. B., & Robson, J. G. (1992). Steady discharges of X and Y retinal ganglion cells of cat under
photopic illuminance. Visual Neuroscience, 9, 535-553.
Wagenaar, D. A., Madhavan, R., Pine, J., & Potter, S. M. (2005). Controlling bursting in cortical cultures
with closed-loop multi-electrode stimulation. J. Neurosci., 25: 680-688.
Wang, X. F., & Chen., G. (2001). Synchronization in small-world dynamical networks. International
Journal of Bifurcation and Chaos, 12, 187-192.
553
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of small-world networks. Nature, 393,
440-42.
key TERMS
Burst: A burst happens when several neurons in a neighborhood spike about simultaneously and
at very high frequencies. Bursts last from 100 to 500 millisecond, and they seem to be fundamental in
neuronal network synchronization as much as for information transmission among neurons.
Dissociated Cultures: Dissociated cultures are cultures in which neurons, taken from an already
formed brain, are chemically and mechanically treated in order to remove existing connections, and
then placed on MEA devices, allowing the analysis of synaptogenic processes.
Long-Term Depression (LTD): LTD in neurophysiology consists in the weakening of a neuronal
synapse activity obtained by prolonged low frequency stimulation on pre-synaptic neuron. It is another
feature of neuronal plasticity phenomena, as LTP, and it is considered involved in learning and memory
formation processes.
Long-Term Potentiation (LTP): LTP consists in an increase of the strength of chemical synapsis
both in experimental preparations (in vitro) and in living animals (in vivo). It is stimulated by applying
a series of short, high-frequency electric stimuli on pre-synaptic neuron, able to potentiate the synapsis
for minutes to hours. LTP is involved in synaptic plasticity in living animals, providing the foundation
for a highly adaptable nervous system, and so in memory formation and behavioral learning. LTP was
discovered in the mammalian hippocampus by Terje Lmo in 1966.
MEA: A microelectrode array (MEA) is an arrangement of several, typically 64, electrodes allowing
the targeting of several sites for stimulation and extracellular recording at once.
Neuronal Network: In general a biological neural network is composed by a group of physically
connected or functionally associated neurons. Connections, called synapses, are usually formed from
axons to dendrites. Connectivity and mean activity of neuronal network depend from two different sort
of systemic equilibrium, i.e. Hebbian and omeostatic rules: the first one is an activity dependent synaptic strength potentiation and the second one is a feedback control of the mean activity of the network.
Both these mechanisms work trough Long-term potentiation (LTP) and Long-term depression (LTD),
as explained in the correlated key terms.
Slice Cultures: Slice cultures are neuron cultures in which existing connections are not removed
(compare with: Dissociated cultures). A typical set-up is the analysis of mice hippocampus; it is sectioned
in slices which, when placed on MEA devices, allow the analysis of mature connections.
Spikes: An action potential or spike is an electro-chemical discharge traveling along the membrane
of a cell, rapidly carrying information within and between neurons and indeed tissues. An action poten-
554
tial is a rapid change of the polarity of the voltage from negative to positive and then back to negative,
the entire cycle lasting on the order of milliseconds. This cycle shows a rising phase, a falling phase,
and finally an undershoot. After spiking episodes cells are unable to spike for a time called refractivity
period; usually such phenomenon holds for 2-5 ms, depending from neurons types.
555
556
Chapter XXXIII
The System for Population

Kinetics:
Open Source Software for
Population Analysis
Paolo Vicini*
University of Washington, USA
abstract
This chapter describes the System for Population Kinetics (SPK), a novel Web service for performing
population kinetic analysis. Population kinetic analysis is a widely-used tool for extracting information
about the probability distributions of unknown parameters in kinetic models. The statistical population
model is usually hierarchical, with a nested structure encompassing both variation between subjects and
residual unexplained variation associated with the model predictions. The complexity of the analysis is
largely driven by the nonlinearity of the models employed. Here, we provide a concise introduction to
the topic and a historical perspective for the benefit of the reader who is new to these concepts. Next, we
briefly describe the SPK open source system and its multi-tiered architecture, indicating the user goals
it set to achieve and elucidating its practical usage with examples.
Introduction
Population kinetic analysis is an increasingly important tool for modeling and analyzing biomedical
kinetic (time series) data affected by an unfavorable signal-to-noise ratio and relatively short duration.
Population kinetics is characterized by the simultaneous modeling of population typical values of
kinetic parameters and the variability of these kinetic parameters (between subjects) as well as the
residual errors in measurement. Its historical development and use, especially in drug development,
has been extensively reviewed elsewhere (Pillai et al., 2005). Since the pioneering work of Beal and
Sheiner (Beal and Sheiner, 1982) and the development of the NONMEM software (Beal et al, 1989Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
The System for Population Kinetics
2006), population kinetics has been invoked as a useful, and sometimes essential, step in understanding the determinants (demographic, clinical and genetic) of biological variation among experimental
subjects. This is particularly useful in presence of sparse data at the individual level. What appears at
first to be random variability is gradually explained by invoking deterministic covariates in a process
often described as model building (Mandema et al., 1992; Ette and Ludden, 1995). Population kinetic
analysis describes the information available at the population level: both typical values and variability
estimates. By providing reliable population estimates, these can also be used to inform likely kinetic
profiles at the individual subject level. An application where this concept has been applied is individualized, pharmacokinetic-based dosing (Jelliffe et al., 1998; Salinger et al., 2006, among others). Indeed, it
can be argued that the first step towards individualized medicine is the understanding of the magnitude
of variation among subjects in drug disposition and effect, the knowledge of which then allows one to
deploy statistical models that link such observed, quantified variation to other covariates more amenable
to direct measurement. The next step is the individualization of models of drug disposition and effect
through the availability of individual covariates, thus allowing customized prediction of the events
surrounding dose administration (Sheiner and Beal, 1992). Population kinetics is complicated by the
fact that the underlying models for drug disposition and effect (termed pharmacokinetics and pharmacodynamics, or PK-PD, respectively), or indeed any other biological phenomenon, are nonlinear in their
parameters. That the parameters vary among subjects according to unknown probability distributions
adds further layers of complexity. The nonlinear dependence on the parameters prevents the likelihood
function required for model fitting from being written in closed form (Davidian and Giltinan, 1995).
Thus, since its optimization requires the solution of a multidimensional integral. Indeed, even numerical
evaluation of the likelihood is extremely demanding, so much so that optimization of the true likelihood
function remains to a large degree impractical.
The availability of the NONMEM software (Beal and Sheiner, 1982), which implements various
linearization-inspired parametric approximations to the maximum likelihood problem, as well as the
appearance of other modeling software tools, has greatly contributed to the impact of population kinetics on the science and practice of drug development (Sheiner and Steimer, 2000). However, since
population kinetics is, in its present realizations, tackled via numerical software predicated on a series
of assumptions and approximations, there remains the need for complementary approaches that build
on modern software development practices. These should supply the user with a variety of established
and novel approximations or approaches to population maximum likelihood, so that the model building
process can be adapted to an ever greater variety of data sets and experimental situations. The System
for Population Kinetics (SPK) is being developed at the University of Washington with these issues in
mind. The SPK has been developed as part of the Core Research and Service missions of the Resource
Facility for Population Kinetics (RFPK), a NIH/NIBIB research resource devoted to the development
and application of modeling and simulation technology to relevant biomedical problems.
This chapter describes the philosophy behind the SPK, its components and its implementation as a
web service, currently available at http://spk.rfpk.washington.edu. The SPK is first and foremost an open
source product, and as such it builds on the availability of many open source tools. Its flexible modular
structure allows rapid deployment of new features and user documentation. With the open source release
of the SPK, it is our hope that this software tool will become a collaborative effort spanning many user
communities and developers associated with population kinetic analysis.
557
The Software
The SPK software is designed to be used to quickly and efficiently build mathematical and statistical
interpretative models for population kinetic data affected by substantial variation among individuals
and unfavorable signal to noise ratio. Examples of such data are the measurements arising from clinical
and preclinical trials, where the variation among subjects is essentially dictated by intrinsic biological
variability and the signal to noise ratio is low. This unfavorable signal behavior is due to the characteristics of most bioassay, together with the related practical difficulties of carrying out biomedical
experiments. The theoretical framework for the model building process is provided by mixed effects
models (Figure 1), where the variation among individuals coexists with the residual unknown variation
associated with the measurements. Among others, the parametric approach to this methodology has
been well described by Davidian and Giltinan (1995), Vonesh and Chinchilli (1997) and Pinheiro and
Bates (2000). In this mixed effects framework, the population response is characterized by variation
among individuals (between-subject variation, BSV). Such variation can be modeled as the output of
a structural model (based on algebraic or differential equations, built using available knowledge of the
system) conditional on statistical distributions for the model parameters. These statistical distributions
are functions of fixed effects (features that do not change across subjects, such as expected values and
variances) and random effects (features that differ among subjects, such as the individual-specific value
attained by a metabolic parameter of interest). The interplay between fixed and random effects shapes
uncertainty in the population response. A portion of this uncertainty stems from the true, but unknown,
Figure 1. A pictorial representation of the mixed effects modeling framework. The parameters informing
the kinetic model differential equations (top left) are random variables with their probability distributions (top right). The combination of random parameters and differential equation systems provides a
mathematical model which essentially depends on the measures of location and spread of the parameters probability distributions, i.e. the fixed effects (center). Lastly, random variation (middle right) is
superimposed to the model, giving rise to a nested statistical model (bottom) which provides integrated,
formal descriptions of both between-subject variability and random residual variation.
A 1(t) = K a A1 (t) + Dose(t)

A 2 (t) = +K a A1 (t) (CL / V)A 2 (t)
Ka, CL, V ~ LN(, )
s(t,K a ,CL,V) = A 2 (t) / V
y = s(t,,)
~ N(0, )
y = s(t,,) +
558
biological variability. Additional random fluctuations, provided for example by noise in the data, superimpose on to the model predictions additional uncertainty on the model output (residual unknown
variation, RUV), so that the quantities being modeled are not assessed directly, but are always affected
by statistical noise.
The modelers task is to separate BSV and RUV from the underlying expected values, thus separate
biological trends from variability and, in turn, variability from extraneous, experimental sources of
noise. Knowledge of the statistical composition of BSV introduces a second modeling step: building a
relationship between BSV features and individually measured variables (covariates). These demographic,
clinical or genetic covariates are related to the observed BSV in an attempt to determine which population features may be associated with or responsible for variability. The ultimate goal of the process is
to develop individualized predictions for drug disposition or therapy effectiveness, so as to lead to an
increased understanding of the subject-specific factors associated with these processes.
Mixed effects models have been extensively studied, and have been applied in a variety of fields.
The traditional approach to parameter estimation is based on the well-known maximum likelihood
method. More precisely, the likelihood being maximized is often a marginal likelihood. This arises
from an appealing feature of mixed effects modeling: the ability to focus estimation efforts on fixed
effects. This is accomplished by, essentially, averaging the random effects across the population and
thus marginalizing the likelihood function. In this way, the likelihood function becomes a function of
the fixed effects alone, thus drastically reducing the dimension of the optimization problem. These and
related issues have been well reviewed in (Ette and Williams, 2004).
As mentioned, an additional complication in the population analysis application domain is that the
BSV model generally appears nonlinearly in the structural model. This causes a nonlinear dependence of
both the fixed effects and the random effects on the structural model function. Thus, in Gaussian linear
mixed effects models the marginalization of the likelihood is straightforward and leads to an analytic
solution. For nonlinear mixed effects, the integration cannot be performed as easily, even in presence of
Gaussian BSV and RUV. In fact, for the typical nonlinear mixed effects problem of biomedical interest,
accurate integration of the likelihood function requires numerical solution of very high-dimensional
integrals, which in turn may require techniques of Monte Carlo simulation (Dartois et al., 2007) or
quadrature approaches (Samson et al., 2007), which can be very computationally intensive and thus
impractical even for relatively simple problems. The key developments in this area are due to Beal and
Sheiner, whose pioneering work on nonlinear mixed effects models, motivated by the need to precisely
estimate population parameters for empirical Bayesian dose individualization, is at the basis of the
widely used NONMEM software (Beal and Sheiner, 1980). Their approach was to develop a battery of
approximations to the nonlinear mixed effects likelihood function, all based on Laplaces approximation
to the integral and various linear approximations to the underlying structural model. Increased accuracy
in the approximation can be achieved at the expense of increased computational requirements. Such
model linearization-based approximations have been independently investigated by our group (Bell,
2001). The key factor here is on the required accuracy for the nonlinear model approximation. Beal and
Sheiners first approach involved approximating the expected value and variance of the model with a
first-order and zero-order Taylor series, respectively, centered at zero random effects. This is normally
referred to the FO, or First-Order method. The reasoning here was that misspecification of higher-order
moments, such as that arising from a zero-order approximation to the covariance, would be less dire than
misspecification of expected values. Given the Taylor series approximation, the FO method would be
expected to be reasonably accurate for data arising from small BSV. The FO method allowed approaching
559
nonlinear mixed effects modeling for models of substantial complexity. Increasing the computational
burden led to the development of an additional method based on linearization of the model function
around individualized estimates of the random effects (the FOCE, or First-Order Conditional Estimation
method). This approach increases the accuracy of the Taylor approximation, but requires the solution
of many more optimization problems, and is thus much more computationally demanding. In addition,
the method can be made more accurate by allowing the individualized random effects to appear in the
expression for the model-predicted observation covariance. This is usually described by stating that
there is interaction between random effects in the expected value and the variance of the data (this is
the FOCE-I, or FOCE with Interaction method). Lastly, another approach to approximating the required
integral is based on calculating individual second derivatives (the Hessian) at an optimized estimate,
as opposed to using averaged second derivatives. This approximation is close in spirit to the Laplacian
approximation to the integral. All these approximations are implemented in the software NONMEM
and described in its user documentation. They are also described with some detail in (Wang, 2007).
While the SPK also provides these approximate maximum likelihood methods, it also provides the
infrastructure to deal with two-stage approaches and nonparametric population analysis methods. In
addition, one issue that the modeler has to deal with is how to distinguish between the various approximations, or select the best one. While the more accurate approximations are certainly more desirable for
a variety of reasons, the numerical cost for many problems can be challenging or even insurmountable.
On the other hand, the appropriateness of the simpler approximations may be doubtful, especially for
data arising from large BSV. We have addressed this problem by incorporating in the SPK a tool for
likelihood profiling, where the true likelihood function is evaluated via Monte Carlo integration at the
optimal fixed effects estimate provided by the approximation of choice. Evaluation of the likelihood
allows the user to determine whether the approximation has led to a solution that is reasonably representative of the (true but unknown) maximum likelihood solution. Our approach to likelihood profiling
is described below, together with a narrative of the SPK architecture.
The SPK Architecture

The SPK is articulated in a multi-tiered architecture (Vicini and Westhagen, 2004) that proactively uses
available World Wide Web technology. To our knowledge, there are few software tools for biomedical modeling that have been structured as a web service, one example being the Virtual Cell (Moraru
et al., 2002). By and large, however, the biomedical software development and deployment approach
remains that of operating-system specific standalone executables or locally compiled programs. This
poses issues of compatibility across platform and consistency across compilers, especially when highlevel scientific computing is involved and calculations at the limit of numerical accuracy are required.
On the other hand, the choice of establishing SPK as a web service allows great flexibility in terms of
accessible platforms: since all that is required at the user end is a web browser and the Java Runtime
Environment (Sun Microsystems), the user could conceivably use any computing environment, from
Windows to Linux. In addition, since the scientific calculations performed by SPK all take place on a
centralized server, there is no issue of ambiguity of numerical results across platforms, where small
differences in output could be caused by machine round-off or heterogeneous compiler flags, and could
ultimately turn into difficulties of output interpretation for the end user.
The SPK multi-tiered architecture contains a client tier, an application tier and a foundation tier.
The client tier is comprised of a web server and the user application, where the user interacts with the
560
remainder of the architecture. Since a large part of what the SPK user does is actually develop models
for data analysis, we have called the user application the Model Design Agent (MDA). The third tier
is the CSPK (Computational Server for Population Kinetics). This foundational tier lives on servers
that could be remote from the user, and includes all the compiled computer code necessary for the
calculations of model solution and identification subroutines. The middle tier is the ASPK (Application
Server for Population Kinetics), which is comprised of the SPK Compiler and a job queue server that
maintains the user connection with a database where user models and datasets (together with sample
models and datasets accessible to all users) are stored. The three components of the SPK are described
in detail below.
SPK Computational Tier: The CSPK (Computational Server for Population Kinetics)
The CSPK is a set of C++ routines and libraries that includes customized code and otherwise available open source code for carrying out the scientific computations necessary for simulation and fitting
nonlinear mixed effects models of various kinds. Here, by simulation and fitting we mean to indicate
two steps in the model development process. Given a certain model structure and a set of unknown
parameters, fitting implies the presence of (real or simulated) data which the model parameters are
adjusted to match. The matching is done according to some criterion, which in the case of the SPK is
an approximation to the maximum likelihood problem for nonlinear mixed effects models. As we have
mentioned above, approximating the maximum likelihood problem for nonlinear mixed effects models
requires linearization of the model function, since estimation of the fixed effects can only be done once
the random effects have been integrated out (marginalized). Briefly, the CSPK uses nonlinear regression
to minimize a given likelihood approximation. The SPK optimizer subroutine was developed in-house,
and is based on a modified Gauss-Newton method requiring convergence in the unknown parameters.
This tier also includes a server to provide the user with the current status of the parameter optimization
(trace server). Most recently, the CSPK design has been extended to allow both distributed and parallel computing via PVM (http://www.csm.ornl.gov/pvm/). The derivatives required for estimation are a
combination of numerical implementations for the maximum likelihood approximations (Bell, 2001)
and algorithmic (automatic) differentiation of computer programs as implemented in the open source
software CppAD (C++ Automatic Differentiation, http://www.coin-or.org/CppAD/). The use of automatic
differentiation allows for accurate optimizer gradient calculation and complete generality in the users
algebraic or ordinary differential equation model. For optimization, the user specifies the number of significant digits that are required in the estimate. Once this is met, the optimization is deemed successful.
Linear algebra calculations are done by interfacing our own code with the LAPACK (http://www.netlib.
org/lapack/) and ATLAS (Automatically Tuned Linear Algebra Software, http://math-atlas.sourceforge.
net/) libraries. Basic arithmetic support is provided by the GNU Multiple Precision Arithmetic Library
(GMP, http://gmplib.org/) and the GNU Scientific Library (GSL, http://www.gnu.org/software/gsl/). The
availability of open source code has been used extensively in the development of the CSPK.
The users model is parsed using Xerces (http://xerces.apache.org/xerces-c/), and a driver program
allows for translation of users specifications into XML and eventually C++. The XML markup language
that is used in SPK has been developed internally to facilitate the definition and parsing of results of
nonlinear mixed effects models. Several markup languages have been developed to better interface
models of various scales and temporal and spatial resolutions, like CellML (Lloyd et al., 2004) and
SBML (Hucka et al., 2003). The internal XML code used by SPK fits in this line of research.
561
Middle Tier: The ASPK (Application Server for Population Kinetics)

The ASPK serves as the communication hub of the SPK. The collection of software programs that form
the ASPK coordinates communication between the client tier and the core tier of the system. It is built
on a web technology that is not that different from those utilized by financial institutions around the
world, including Secure Socket Layer (SSL) technology. The ASPK allows the user to submit models
and data to the CSPK through the MDA, and provides access to databases containing all the users past
runs, models and data, in addition to model and data libraries that the user could employ as templates.
The MySQL (http://www.mysql.com/) database is employed due to its reputation as the open source
standard. Apache Tomcat (http://tomcat.apache.org/) technology is at the basis of the web server. The
web service includes also a job queue server, where a continuously running daemon monitors input
from the users and submits pending jobs to the CSPK when applicable. The ASPK parcels traffic to and
from the database and the CSPK.
Client Tier: The MDA (Model Design Agent)

The MDA is the end user portal to the computational capabilities of the SPK. The MDA was written
using the Java Runtime Environment (JRE, Sun Microsystems). The JRE allows Java applications to
run on the users local machine and the almost universal compatibility of the Java language allows for
substantial flexibility in the design and deployment of the MDA. The MDA interacts with the ASPK to
submit the users jobs (models and data) and retrieve results of past runs and simulations.
For biomedical model development, user-friendliness is often in the eye of the beholder. Model building in this environment might be carried out by scientists that are conversant in the statistical basis of
the process, or by scientists whose proficiency is more biologically-oriented. Ideally, a model design
environment should appeal to both categories of users. Specifically, lack of knowledge of technical aspects
of the maximum likelihood problem at the core of the SPK system should not be a hindrance to the less
technically savvy user, but enough of these quantitative aspects should be maintained in the interface
so that chances of misusing it are low. We have addressed these conflicting priorities by first developing
a wizard interface, where the user is guided step by step in the process of model building. Depending
on initial choices the user makes (individual or population model development, parametric or nonparametric models, and so on), the wizard changes structure and only asks the user for information that is
specifically relevant to the specific modeling exercise. In addition to the text-based model development
interface, we have recently added a graphical model building option where the model is created on the
screen using a set of circles and arrows to represent model compartments and flow and control transfers
across the systems components; thus freeing the user from writing a system of differential equations.
Lastly, the MDA includes capabilities to generate plots and tables from the analysis results.
The MDA is written entirely in Java, thus maximizing its compatibility with heterogeneous systems.
Three XML documents are built in the MDA from the user input (source, model and dataset) and data
specification and transmitted to the web server when a job is submitted. The web server performs revision control, and then forwards the documents to the database. The MDA also includes auto-updating
by Java Web Start and data security by SSL.
562
The SPK Analysis Flow

The users point of entry is the MySPK website (http://spk.rfpk.washington.edu) where the user enters
his or her account information. From here the user can deploy the MDA, or check job status, file bug
reports, and perform other such tasks. Most commonly, the user will employ the MDA to associate a
model to a certain dataset. Currently, both model and dataset structure loosely follow the specifications provided by the NONMEM (and especially NM-TRAN) language. Early in the development of
SPK, we decided that, given the wide availability of NONMEM and especially its use in the literature,
it would not be necessary to develop an entirely different model specification and data management
language. Thus, the user can develop models in the SPK using a language with which there should be
some familiarity. The dataset contains all available measurements, together with the dosing and input
schedules. This allows for clear separation between knowledge and interpretation, or, in other words,
between the known inputs and outputs of the study (doses, concentrations, effects), which form the en-
Figure 2. The SPK Job Information window. This window provides access to the various components of
a SPK job, including the model, the dataset, the XML formulation for the job input and the job output,
the job history, from submission to completion, a link to the job parent and the opportunity to download
the run results. Both models and datasets are subject to version control. To enhance collaborative efforts, the job can also be shared with another user: if that feature is selected, the other user will see the
job appear in his/her job list. Lastly, a new job can be initialized from the current input or from the final
output, or a likelihood profiling job can be started.
563
tire knowledge one has of the underlying system shaping these responses, and whatever interpretative
tools the user may establish (models, either structural, or for BSV or RUV). It may be worthwhile to
note that other modeling software tools such as SAAM II (Barrett et al., 1998) include the dosing and
measurement locations as part of the model instead of the data, reflecting perhaps some philosophical
differences as far as what is known and unknown about the system.
SPK allows version control of both dataset and model (Figure 2). If a dataset is changed by, e.g.
deleting a subject or a data point, this can be saved as a new version of the same dataset. If a model is
changed by, e.g. modifying the BSV model for a parameter from Gaussian to log-normal, it is saved as
a new version of the same model. In the SPK framework, various approximation methods (First-Order,
etc.) and optimizer and ODE integration tolerances (required number of significant digits, etc.) are
considered part of the model, and thus changing them gives rise to a new model version. The merging
of model and dataset, together with the output arising from the required analysis, forms a job, which
is the database unit of the SPK. A job can have different outcomes (successful run, optimizer failure,
integrator failure, ) and can be recalled from the user database when needed. An attractive feature of
the system is the ability to use jobs that have already run to spawn new runs, such as for example using
the output of a successful First-Order run to seed a First-Order Conditional Estimation run. Nonlinear
mixed effects models, as a rule, are extremely sensitive to starting values, so the availability of refined
starting values could be the limiting factor in the success or failure of an analysis. Given the relative ease
with which a First-Order job can be run, this can form a good starting point for a more computationally
demanding, and also more computationally delicate, First-Order Conditional Estimation run. The SPK
has streamlined the process of using jobs as starting points for new jobs, thus facilitating the model
development pipeline. In this context, the originating job is called the parent job, and can be recalled
from the child job at the touch of a button, thereby saving jobs in a tree structure.
On completion of the job, the user has access to a variety of post-run analysis tools, including both
graphical and tabular output (Figure 3). SPK automatically tabulates all parameters, model outputs and
defined additional variables, as well as various flavors of population and individual kinetic model prediction and weighted and unweighted residuals. It can be safely said that tools to qualify the performance
of nonlinear mixed effects models are in their infancy. Just as an example, model performance can be
assessed at the individual level (how well does the model fit individual data) and the population level
(how well does the model account for the specific variability characteristics of the given experimental
data). These and other issues regarding model specification have been covered elsewhere (Ette and Ludden, 1995). The SPK allows for multilayered output where all these features can be assessed.
Usage
An example of usage of the SPK software follows. The hypothetical scenario parallels that of a scientist that is trying to develop the best nonlinear mixed effects model for a set of data, and in doing so
uses various approximation methods. The challenge in this context would be to determine if the solution obtained via a particular method (such as a First-Order approximation) is good enough, when
evaluated against the degree of model nonlinearity, or whether a more complex method (such as the
Expected Hessian) should be employed instead. The tools being used involve model building for the
data set, fixed effects estimation using the approximations available in SPK and likelihood profiling
for the approximations.
564
Figure 3. Examples of graphical output provided by the SPK. From top and clockwise: the initial wizard
model building window; a histogram of random effects; two examples of likelihood profiling; a plot of
a random effect (V) vs. a clinical covariate (WT); a diagnostic plot of population prediction (at zero
random effects, PRED) vs. data values (DV).
The dataset expresses plasma concentrations of cadralazine (Wakefield et al., 1994) after administration to 10 cardiac failure patients, where the drug was administered as a 30 mg intravenous (IV) injection (bolus). This is well modeled by a kinetic model with two unknown parameters, apparent volume
of distribution V and decay rate constant K. We assume that the parameters of the kinetic model for the
ith subject are log-normally distributed around their mean given by the vector :
V(i) = 1e
bi ,(1)
, K (i) = 2 e
bi ,( 2 )
where the subscript (j) signifies the jth element of a vector. We assume that the random effects b have
mean zero and covariance:
Var[b (1) ] = 1,1 , Cov[b (1) , b (2) ] = 1,2 = 2,1 , Var[b (2) ] = 2,2
The error in the data can also be estimated as the variance of the random vector :
Var[ (1) ] = 1,1
565
Table 1. Results of First-Order (FO) and Expected Hessian (EH) linearized mixed effect model estimation in a population of 10 subjects administered with cadralazine. The fixed effects and their precision
of the estimate (as %CV) are reported. The estimators provide different estimates, slightly in some cases
but more so in others. This is due to two reasons: one is that the dataset is relatively small (only 10
subjects), which may impair the accurate and precise estimation of BSV; the other is that the estimation
methods are approximations of the true likelihood function and thus may provide biased information
and inconsistent estimates.
FO Estimates
(FO Objective = -46.277)
Value
%CV
EH Estimates
(EH Objective = -39.913)
1,1
1,1
3.11
18.8
0.133
2.23
18.6
0.168
13
43
11
32
1,2
2,2
1,2
2,2
Value
0.0226
0.00386
0.121
0.0490
0.0171
0.0317
%CV
81
144
29
33
63
25
In this problem, there are six population fixed effects (two means, three covariance elements and
one RUV variance) and two individual random effects for each subject. There are several methods to
provide estimates of the fixed effects, each using a particular approximation of the likelihood function.
First-order linearizations of the nonlinear model of the data give rise to tractable likelihoods, and are
most commonly used. The computational complexity increases with the accuracy of the approximation. There are alternative approaches that do not minimize the likelihood. For example, one can apply
an iterative scheme based on EM (Expectation-Maximization)-algorithms, e.g. the Iterative Two Stage
(Steimer et al., 1984). We used both the FO and Expected Hessian (EH; or FOCE, First-Order Conditional
Estimation in NONMEM) methods to approximate the likelihood in optimizing the likelihood function.
The FO method approximates the nonlinear two-compartment model around the fixed effects assuming
random effects equal to zero, while the EH method approximates the nonlinear two-compartment model
around the fixed effects using individualized empirical Bayesian estimators. Since the fixed effects are
estimated parameters, these methods also provide confidence limits for the estimates, and consequently
confidence limits on the mean and variance of the population.
The application of these models allows one to explicitly represent BSV and RUV, together with the
structural model, in one synthetic formulation. It is generally recognized that first-order approximations have the potential for introducing serious biases: however it has not been clear how to assess the
accuracy of a given estimate. Each estimate is an approximate solution to an approximate optimization
problem. The underlying statistics also suffer by being filtered through these two approximation layers. The validity of these approximations and their associated statistics remains questionable, since the
degree of approximation, and especially its impact on the solution, remains unknown. SPK includes
numerical tools for addressing these issues. In particular, we have developed a novel approach to the postoptimality evaluation of mixed effects models. Briefly, the true marginal likelihood function is directly
estimated at, and in the vicinity of, the fixed effects estimate by Monte Carlo sampling. This provides
an approximate profile of the true likelihood in each component of the fixed effects. As can be seen in
Figure 4, these profiles of the multidimensional likelihood surface provide an intuitive way to assess the
566
Figure 4. Monte Carlo likelihood evaluation and profiling performed at the optimal solutions of FO
(left six panels) and EH (right six panels) linearized mixed effect model estimation (as shown in Table
1). As can be seen, the FO solution appears seriously biased when the likelihood is profiled at the estimate. The large discrepancy between MC and FO objective values is also cause for concern regarding
the quality of the FO estimate. The original fixed effects are scaled to facilitate optimization: so, while
alpha-1 and alpha-2 refer to 1 and 2, alpha-3, alpha-4 and alpha-5 are monotone increasing transformations of the Cholesky factor of the BSV covariance matrix, while alpha-6 is the natural logarithm
of the RUV variance.
FO Objective = -46.2771, MC Objective = 1.47
EH Objective = -39.9132, MC Objective = -44.49
567
quality of the estimate that arises from the nonlinear model approximation. This post-optimality check
could be applied in principle to all population analysis approaches, including two-stage methods, thus
providing a universal method to assess the quality of estimates provided by computationally efficient
linearization approaches. This set of methodologies is available to users of the SPK web service.
Several improvements are possible to the current version of the SPK software. Clearly, the next step
would be to provide some guidance to the user in terms of interpretation of the analysis output. The
availability of the SPK under open source licensing provides several avenues along which this could
take place. In the future, existing tools for analysis of covariates and their role in shaping BSV could
also be integrated under the SPK framework. Other possibilities for added features include tools to facilitate experimental design, more sophisticated ODE integration approaches, tools for parallelization
of population analysis runs and further likelihood approximation methods.
Conclusion
We have described here the System for Population Kinetics (SPK), a novel web service for user-driven
model building in population kinetics. The SPK contains a number of features available in other programs, such as approximation methods for the nonlinear mixed effects modeling framework and model
performance analysis capabilities. It also contains features that are novel and not available elsewhere,
among which are likelihood profiling methods for assessing likelihood approximation quality. The SPK
should be a positive addition to the current array of tools for population kinetic analysis, and its availability under open source licensing is a point of strength.
Acknowledgment
This work was partially supported by NIH/NIBIB grant P41 EB001975. Current information about
the SPK is available at http://spk.rfpk.washington.edu. The research, service and training activities of
the Resource Facility for Population Kinetics are described at http://www.rfpk.washington.edu. David
Foster and Paolo Vicini have served as the Principal Investigators of the Resource. The work of Jiaji Du,
Andrew Ernst, Robert Goddard, Sachiko Honda, David Salinger and Mitchell Watrous in the coding
and testing of the SPK; the initial prototyping of the SPK by Bradley Bell; the algorithmic development
by James Burke; the pivotal role of Alan Westhagen in the development and deployment of the threetiered architecture; the contribution of Hugh Barrett, Robert Bies and Claudio Cobelli in actively using
test versions of the SPK; and the valuable input over the years of the members of the RFPK Advisory
Committee are gratefully acknowledged. NONMEM is a registered trademark of The Regents of the
University of California. Java is a trademark of Sun Microsystems, Inc.
NOTE
* Credit for this work also goes to the Resource Facility for Population Kinetics in the Department of
Bioengineering, University of Washington.
568
References
Barrett, P. H., Bell, B. M., Cobelli, C., Golde, H., Schumitzky, A., Vicini, P., et al. (1998). SAAM II:
Simulation, analysis, and modeling software for tracer and pharmacokinetic studies. Metabolism, 47,
484-492.
Beal, S. L., & Sheiner, L. B. (1980). The NONMEM System. The American Statistician, 34, 118-119.
Beal, S. L., & Sheiner, L. B. (1982). Estimating population kinetics. Critical Reviews in Biomedical
Engineering, 8, 195-222.
Beal, S.L., Sheiner, L.B., & Boeckmann, A.J. (Eds.) NONMEM Users Guides, (1989-2006). Ellicott
City, Maryland: Icon Development Solutions.
Bell, B.M. (2001). Approximating the marginal likelihood estimate for models with random parameters.
Applied Mathematics and Computation, 119, 57-75.
Dartois, C., Lemenuel-Diot, A., Laveille, C., Tranchand, B., Tod, M., & Girard, P. (2007). Evaluation
of uncertainty parameters estimated by different population PK software and methods. Journal of
Pharmacokinetics and Pharmacodynamics, 34, 289-311.
Davidian, M., & Giltinan, D. (1995). Nonlinear models for repeated measurement data. Chapman and
Hall.
Ette, E. I., & Williams, P. J. (2004). Population pharmacokinetics II: Estimation methods. Annals of
Pharmacotherapy, 38, 1907-1915.
Ette, E. I., & Ludden, T. M. (1995). Population pharmacokinetic modeling: The importance of informative graphics. Pharmaceutical Research, 12, 1845-1855.
Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H. et al. (2003). The systems
biology markup language (SBML): A medium for representation and exchange of biochemical network
models. Bioinformatics, 19, 524-531.
Jelliffe, R. W., Schumitzky, A., Bayard, D., Milman, M., Van Guilder, M., Wang, X., et al. (1998). Modelbased, goal-oriented, individualised drug therapy. Linkage of population modelling, new multiple
model dosage design, Bayesian feedback and individualised target goals. Clinical Pharmacokinetics,
34, 57-77.
Lloyd, C. M., Halstead, M. D., & Nielsen, P.F. (2004). CellML: Its future, present and past. Progress in
Biophysics and Molecular Biology, 85, 433-450.
Mandema, J. W., Verotta, D., & Sheiner, L. B. (1992) Building population pharmacokinetic--pharmacodynamic models. I. Models for covariate effects. Journal of Pharmacokinetics and Biopharmaceutics,
20, 511-528.
Salinger, D. H., McCune, J. S., Ren, A. G., Shen, D. D., Slattery, J. T., Phillips, B., et al. (2006).Realtime dose adjustment of cyclophosphamide in a preparative regimen for hematopoietic cell transplant:
a Bayesian pharmacokinetic approach. Clinical Cancer Research, 12, 4888-4898.
569
Moraru, I. I., Schaff, J. C., Slepchenko, B. M., & Loew, L. M. (2002). The virtual cell: An integrated
modeling environment for experimental and computational cell biology. Annals of the New York Academy of Sciences, 971, 595-6.
Pillai, G. C., Mentr, F., & Steimer, J. L. (2005). Non-linear mixed effects modeling - from methodology and software development to driving implementation in drug development science. Journal of
Pharmacokinetics and Pharmacodynamics, 32, 161-183.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS. Statistics and Computing Series. New York: Springer-Verlag.
Samson, A., Lavielle, M., & Mentr, F. (2007). The SAEM algorithm for group comparison tests in longitudinal data analysis based on non-linear mixed-effects model. Statistics in Medicine, 30, 4860-4875.
Sheiner, L. B., & Beal, S. L. (1982). Bayesian individualization of pharmacokinetics: Simple implementation
and comparison with non-Bayesian methods. Journal of Pharmaceutical Sciences, 71, 1344-1348.
Sheiner, L. B., & Steimer, J. L. (2000). Pharmacokinetic/pharmacodynamic modeling in drug development. Annual Reviews of Pharmacology and Toxicology, 40, 67-95.
Steimer, J. L., Mallet, A., Golmard, J. L., & Boisvieux, J. F. (1984). Alternative approaches to estimation
of population pharmacokinetic parameters: Comparison with the nonlinear mixed-effect model. Drug
Metabolism Reviews, 15, 265-292.
Vicini, P., & Westhagen, A. (2004, October 2-8). Quantitative modeling and software approaches to the
analysis of complex multilevel systems. Proceedings of the 55th International Astronautical Congress
of the IAF, Vancouver, BC, Canada, IAC-04-IAF-G.1.04
Vonesh, E. F., & Chinchilli, V.M. (1997). Linear and nonlinear models for the analysis of repeated
measurements. New York: Marcel Dekker
Wakefield, J. C., Smith, A. F. M., Racine-Poon, A., & Gelfand, A. E. (1994). Bayesian analysis of linear
and non-linear population models using the Gibbs sampler. Applied Statistics, 43, 201-221.
Wang, Y. (2007). Derivation of various NONMEM estimation methods. Journal of Pharmacokinetics
and Pharmacodynamics, 34, 575-93.
key Terms
Fixed Effects: In mixed effects models, these are features of the parameters that do not change
across the population. Examples are the central tendency of distributions, or their spread around a
certain mean.
Maximum Likelihood: An approach to estimation predicated on the definition of an appropriate
probability density for the experimental data. This probability density (the likelihood) usually depends
on unknown parameters. The maximum likelihood estimate for the parameters is the value that makes
the likelihood the largest. In numerical analysis, the maximum likelihood problem is most often cast as
a minimization problem where the negative log-likelihood is minimized.
570
Mixed Effects Model: A statistical model containing random parameters that vary according to a
hierarchy: an example would be between-subject variation and residual unknown variation in a population kinetic model. The mixed effects model is comprised of fixed effects and random effects, where
the random effects are usually sampled from distributions whose central tendency and spread are fixed
effects.
Open Source: With reference to software development, the open source development model implies
that the source code of a specific software application must be available to users and developers.
Population Kinetic Analysis: The development and identification of kinetic models (i.e., describing
time-dependent phenomena) which, in addition to residual measurement variability, also incorporate a
statistical component describing variation among individuals of a population.
Random Effects: In mixed effects models, these are features of the parameters that change across
the population. An example is the subject-to-subject variation of a certain parameters with respect to
its mean.
Web Service: In its simplest definition, a web service is a set of computer codes that enables communication between software applications.
571
Section IX
Systems Biology in
Photochemical Processes
573
Chapter XXXIV
Photosynthesis:
How Proteins Control Excitation

Energy Transfer
Julia Adolphs
Freie Universitt Berlin, Germany
abstract
This chapter introduces the theory of optical spectra and excitation energy transfer of light harvesting
complexes in photosynthesis. The light energy absorbed by protein bound pigments in these complexes
is transferred via an exciton mechanism to the photosynthetic reaction center where it drives the photochemical reactions. The protein holds the pigments in optimal orientation for excitation energy transfer
and creates an energy sink by shifting the local transition energies of the pigments. In this way, the
excitation energy is directed with high efficiency (close to 100 %) to the reaction center. In the present chapter, this energy transfer is studied theoretically. Based on crystal structure data, the excitonic
couplings are calculated taking into account also the polarizability of the protein. The local transition
energies are obtained by two independent methods and are used to predict the orientation of the FMO
protein relative to the reaction center.
INTRODUCTION
In photosynthesis energy from the sunlight is converted to chemical energy (Figure 1). The photons
of the sunlight are absorbed by so-called antenna pigments (chlorophylls, bacteriochlorophylls and
carotenoids) and the excitation energy is transferred to the photosynthetic reaction centre (RC), where
transmembrane charge transfer reactions are driven.
In the oxygenic photosynthesis water is used as an electron source and the electron transfer is accompanied by proton gradients, which drive the production of ATP (Adenosine triphosphate), the universal
energy currency, from ADP (Adenosine diphosphate). In this way light energy is converted to chemical
energy. As a by-product of the water-fission, oxygen is released, which forms a basis of our life.
Photosynthesis
Figure 1. Cartoon of the photosynthesis
The oxygenic photosynthesis is performed by higher plants, algae and cyanobacteria. The water
splitting of oxygenic photosynthesis requires a relatively high redox potential, which can only be
achieved with two RCs connected in series. These two RCs are called photosystem (PS) I and II. Both
PSs receive energy from antenna pigments or in principal from direct optical excitation. PS II is the
first one in the serial connection and it is the water splitting part while PS I is the second part and the
one where the proton gradient drives the ADP to ATP synthesis. The well known overall equation for
the oxygenic photosynthesis reads:
6 CO2
12 H 2O
C6 H12O6
6 H 2O
6 O2
where h is the energy of a photon with frequency and h is Plancks constant and C6H12O6 is the chemical
formula for glucose. Another, in the sense of evolution older process1, is the anoxygenic photosynthesis,
performed by anaerobic bacteria, such as green sulfur bacteria. In contrast to the organisms performing oxygenic photosynthesis, they have only one RC. It is called bacterial reaction centre (bRC) and is
structurally similar to PS I. It is able to oxidize H2S and similar compounds. Its reaction is described
in simplified form by:
CO2
2 H2S
CH 2O
2S
H 2O
where CH2O is the chemical formula of formaldehyde. Although the scheme of the primary photosynthetic reaction is in the main well understood, the molecular mechanisms are still unclear in many
cases. A combined approach by high-resolution structure determination, optical spectroscopy and
theory is necessary to understand the building principles of photosynthetic systems und how function
and structure of these nano-machines are related. This progress was initiated by the first high-resolution x-ray structure (2.8 Angstrom) determination of a photosynthetic pigment-protein complex (PPC)
574
Photosynthesis
Figure 2. (left) Sketch of a BChl a (the phytol chain is truncated at the marked position in the figure).
Colour code: Carbon atoms (light grey), Oxygen (dark grey), Magnesium and Nitrogen labeled in the
figure. The four Nitrogen atoms are labelled according to standard nomenclature. (centre) Sketch of the
FMO trimer. (right) View of the pigments (dark grey) embedded in the protein (light grey), atoms are
depicted as spheres with van der Waals radii.
Sketch of a BChl a molecule

(phytol chain truncated)
Sketch of the FMO trimer
Dielectric volume, protein (light

grey) and BChls (dark grey).
by Fenna and Matthews in 1975 (Fenna & Matthews, 1975) and 1988 the Nobel Prize in chemistry was
awarded to Deisenhofer, Huber and Michel (Deisenhofer et al., 1985) for the determination of the first
three-dimensional structure of a photosynthetic reaction centre, namely the RC of the purpur bacteria
Rhodopseudomonas viridis, which proceeds anoxygenic photosynthesis.
Photosynthetic Pigments and their Function

The most important photosynthetic pigments are chlorophyll-a (Chl a), appearing in green plants, oxygen
producing algae and cyanobacteria, and bacteriochlorophyll-a (BChl a), appearing in anaerobic bacteria.
The name chlorophyll is derived from Greek: chloros = green and phyllon = leaf. Chl a absorbs most
strongly in the blue and red, but poorly in the green part of the electromagnetic spectrum. This is the
origin of the green colour of chlorophyll containing tissues like plant leaves.
(Bacterio)chlorophyll is a chlorin pigment, which is structurally similar to other porphyrin pigments
such as heme, appearing for instance in hemoglobin, the oxygen-transporting metallo-protein in red
blood cells. In the case of (B)Chl, at the centre of the chlorin ring a magnesium ion is located (Figure
2, left), in contrast to an iron ion in the case of heme. The chlorin ring can have several different side
chains, usually including a long phytol chain 2. There are a few different forms that occur naturally: In
all oxygen producing organisms Chl a is present, furthermore Chl b in higher plants and green algae.
BChl a is present in the most anoxygenic bacteria, BChl b, c, d, e and g appears in different types of
bacteria.
The absorption of (B)Chls is due to the * transitions of the delocalised electronic -system of
the tetrapyrrol3 ring system. In many cases this optical transition can be approximated by a transition
dipole strength, whose direction is determined by two of the four nitrogen atoms (NB and ND), see
Figure 2 (left).
575
Photosynthesis
The RC, where the charge separation takes place, is composed either of Chl a (higher plants, algae,
cyanobacteria) or BChl a (anoxygenic bacteria).
However, the major part of (B)Chls (more than 99.5 %) acts as light absorbing antennae, funnelling
excitation energy to the RC. This is to increase the efficiency of photosynthesis up to nearly 100 %.
The FMO Complex

The Fenna-Matthews-Olson (FMO) protein, is a water soluble complex and was the first PPC that could
be crystallised and analysed by x-ray spectroscopy (Fenna & Matthews, 1975). The resolution of the
electron density map has been refined to 1.9 Angstrom meanwhile. The structure of the FMO complex
has a 3-fold rotational symmetry, i.e. it is a trimer (Figure 2, centre). Each of the three monomers contains seven BChl a molecules, as shown in Figure 3 (upper part, left). The BChl a molecules are bound
to the protein scaffold by ligation of their central magnesium atom to histidine, leucine or water bridged
oxygen atoms. The original numbering (Figure 3, upper part, right) of the BChls, chosen by Fenna and
Matthews, is used throughout this chapter.
The FMO complex appears in green sulfur bacteria and mediates the transfers of excitation energy
between the chlorosomes4, which are the main light-harvesting antennae of green sulfur bacteria, and
the membrane-embedded bRC (Figure 3).
Figure 3. (left, top) BChls of the FMO trimer. The BChls 1-7 of one monomeric subunit are highlighted,
numbering as in (Fenna & Matthews, 1975). (right, top) Sketch of the mutual arrangement of the FMO
complex and the reaction centre, as obtained from an electron-microscopic study (Remigy et al., 1999;
Remigy et al., 2002). Additional the outer antennas (chlorosomes) are indicated. (bottom) Spatial and
temporal relaxation of excitons from the top to the bottom of the complex. As initial condition the main
part of the excitation energy was on BChl 1 and 6. The colour of the -system of the BChls is varied
between light and dark grey according to the population of the excited states. The enclosed areas mark
the delocalised exciton states that are populated.
576
Photosynthesis
This energy transfer from the antennas to the reaction centre is controlled by the protein with systematic
changes of the local optical transition energies (so-called site energies) of the pigments. The determination of these site energies was a problem with partly contradictory solutions for about 30 years.
The FMO-complex of green sulfur bacteria represents an important model protein for the study of
elementary pigment-protein couplings, because it is the simplest PPC, appearing in nature. Methods
developed on the relatively simple FMO protein will be applied to more complex systems, for instance
PS I, in the future.
THEORY OF OPTICAL SPECTRA

Since atom coordinates of PPCs are available from high-resolution x-ray spectroscopy, it is with the
help of more or less realistic theories - possible to calculate structure based optical spectra, and to compare them with measured optical spectra. A realistic theory has to describe two quantities: The coupling
between pigments (pigment-pigment coupling, Figure 4, centre) and the coupling between each pigment
and the protein (pigment-protein coupling, Figure 4, right). Unfortunately both coupling strengths are
in the same range, which is a challenge for theory. In standard theories one uses perturbation theory for
one of the two types of couplings, resulting in Frster theory of excitation energy transfer and Kubo/Lax
theory of optical spectra in the case of weak inter-pigment couplings and Redfield theory for transfer
and spectra for strong pigment-pigment coupling.
The present theory of optical spectra (Renger & Marcus, 2002) includes both, the pigment-pigment
and the pigment-protein coupling beyond perturbation theory. Roughly speaking the peak positions of
Figure 4. (left) Two selected BChls in a FMO monomer. Each pigment interacts with the other pigments
via Coulomb pair interactions and each pigment interacts with its individual protein environment. The
pigment-protein coupling is different for each BChl, due to the different surrounding. (centre and right)
The two coupling mechanisms of pigment-protein complexes and their standard descriptions.
Pigment-Pigment
Coupling
Pigment-Protein
Coupling
leads to delocalised
exciton states.
includes a static and a

dynamic part.
577
Photosynthesis
optical lines are determined by the pigment-pigment coupling and the lineshape is determined by the
pigment-protein coupling, including lifetime broadening and vibrational sidebands. Here a short sketch
of the theory of optical spectra is given.
The theory is based on a standard Hamiltonian for PPCs that describes the pigments as coupled
two-level systems interacting with vibrational degrees of freedom of the pigments and the protein. The
excitonic couplings between the pigments are calculated as Coulomb couplings between their optical
transition dipoles.
The exciton part of the Hamiltonian H ex = Em m m + Vmn m n contains the local transition
mn
energies Em (site energies) of the pigments, and the excitonic couplings Vmn. For the calculation of optical spectra, the PPC-Hamiltonian is expressed in terms of delocalised exciton states M = cm( M ) m ,
m
M
where cm( ) describes the probability for the m-th pigment to be excited when the PPC is in the M-th
2
(M )
exciton state. The exciton coefficients cm and excitation energies M are obtained from the solution of
the eigenvalue problem of the exciton Hamiltonian H ex M = M M . The eigenvalue problem can be
solved by diagonalisation of the Hamiltonian in matrix representation, where the diagonal elements are
the site energies Em and the non-diagonal elements are the excitonic couplings Vmn. The resulting eigenvalues are the excitation energies M, which give (in a first approach) the positions of the spectral lines,
(M )
and the eigenvector elements cm , which
determine
the redistribution of the local transition dipoles to
(M )
a collective exciton transition dipole: M = cm m.
m
Expressing the Hamiltonian of exciton-vibrational coupling in terms of exciton eigenstates, one

finds that it contains diagonal and off-diagonal elements, where the former gives rise to vibrational
sidebands of exciton transitions in optical spectra and the latter leads to relaxation between different
exciton states.
2
The linear absorption reads: ( ) M DM ( ) , where DM () is the lineshape function and dis
M
dis
denotes an average over static disorder in site energies. A Gaussian distribution function of width (fwhm)
dis is assumed for these energies, and the disorder average is performed by a Monte Carlo method.
The lineshape function reads DM ( ) = dt ei ( M )t eGM (t )GM (0)e t / M . It contains vibrational

0
sidebands, described by the time dependent function GM (t), and lifetime broadening described by the
dephasing time M (R denotes the real part of the integral). The essential property appearing in GM (t) is
the spectral density J() that describes how strong a vibrational mode with frequency modulates the
transition energy of a pigment. J() can be extracted from optical spectra (Renger & Marcus, 2002) and
describes, beside the vibrational sidebands, the dissipation of excess energy by the protein that occurs
during exciton relaxation.
In Figure 5 (left) the simulated spectra for different degrees of theory are shown:
1.
2.
Stick-spectra (dark grey bars, top): The sticks obtained from solution of the eigenvalue problem
of the exciton Hamiltonian only. The stick position is given by the eigenvalues M and the stick
height is given through the squared exciton transition dipole moment M.
Gauss-dressed stick-spectra (light grey curve, top): The sticks of 1. are multiplied with Gaussian
bell curves and the spectrum is obtained by summing over all contributions from the seven gauss 2
ians, i.e. ( ) M D ( ) with a Gaussian distribution of width (equal for all M).
M
578
Photosynthesis
Figure 5. (left) Resulting spectra for different levels of theory. Top: Only the exciton part has been considered. Bottom: Pigment-protein coupling is taken into account in two different levels of theory. (right)
Absorption (OD), circular dichroism (CD) and linear dichroism (LD) spectra, experiment (Wendling et
al., 2002) and simulated spectra with site energies obtained from fit.
wavenumber / cm-1
3.
4.
wavenumber / cm-1
Markov approximation (bottom, dark grey curve): Additionally to the solution of the exciton
eigenvalue problem it includes a shift of the peak positions and lifetime broadening due to pigment-protein coupling.
Non Markovian Theory (bottom, light grey curve): Additionally to the Markov approximation it
includes also vibrational sidebands.
The experiment (black circles) is taken from literature (Wendling et al., 2002).
Excitonic Couplings in a Dielectric Environment

The excitonic couplings between pigments can be calculated via the Coulomb couplings between the
transition dipoles. Our calculations (Adolphs & Renger, 2006) show that a description in point dipole
approximation is valid for the FMO protein, because the distances between the pigments is relatively
large compared to the pigments extensions. The Coulomb coupling in point dipole approximation is

e e n 3 (e m e mn )(e n e mn ), where e is the unit vector of transition dipole n, e mn is oriented
Vmn = f vac
3 m
n
Rmn
along the connection of the centres of pigment m and n, Rmn is the distance of the centres of pigment m
579
Photosynthesis
and n and f is a factor that describes the screening of the Coulomb coupling by the dielectric environment with dielectric constant .
As shown in Figure 2 (right), the BChls are embedded in the protein which has a dielectric constant
of around = 2, due to the polarizability of the protein. This polarizability must be included in the calculation of the Coulomb couplings, usually it is done by introduction of the dielectric constant .
Regarding this screening factor f, the Coulomb couplings considering the dielectricity of the protein
vacuum
.
and the Coulomb couplings in vacuum (i.e. = 1) are related by: Vmnprotein = f Vmn
In the case of large distances between the pigments (these distances must be larger than the distances for valid point dipole coupling!), the screening factor is simply f = 1 , for smaller distances, f
becomes distance dependent. This means that the calculation of the excitonic couplings in PPCs is in
principal non-trivial and must be done numerically with a software that is able to solve the so-called
Poisson equation.
We did this for the FMO protein and used instead of the transition dipoles so-called transition monopole
charges (Chang, 1977), which are point charges, describing the transition density of the optical transition of the pigments. The numerical solution of the Poisson equation was done by the program MEAD
(Bashford, 1990). The interesting result of these calculation was, that in case of the FMO complex, the
factor f can be estimated by f = 0.8 (instead of 0.5 for 1 ), independent from the distance. This means
that for the FMO complex all couplings can be calculated very easily by point dipole approximation
together with a screening factor of f = 0.8.
CALCULATION OF LOCAL TRANSITION ENERGIES

In vacuum each of the seven BChl a molecules of one FMO subunit has the same transition energy. In
the protein each pigment has an individual local transition energy, due to electrostatic interactions with
the local protein environment. In the spectra simulation program, the site energies are input parameters,
which means that there are two independent ways to calculate site energies:
1.
2.
Experimental optical spectra can be utilized and the site energies can be treated as fit parameters.
For this procedure one has to use a suitable fit algorithm, in our case an evolutionary algorithm.
The known atomic details (position of pigment and protein atoms, charges of the amino acids,
dipole moments of the BCls) can be used to calculate the change of pigment transition energies by
their protein environment.
Both methods complement one another. It is not absolutely certain that the genetic algorithm really
ends in the global minimum, i.e. it could be possible that it finds a good set of site energies but not the
best. Furthermore the site energies from a fit are somehow indirect, because we do not put in information about the pigment surrounding, responsible for different site energies. (Remark: indirect does not
mean that we fit the spectra with arbitrary functions like polynomials. The functions we use for the fit
are based on a physical and reasonable theory!).
Therefore it is a good approach to use a second method for the calculation of site energies, which is
completely independent from the fit, and which is based directly on the structural data of the protein
surroundings of the pigments. If we can calculate the site energies in a direct way, we can understand
580
Photosynthesis
how the protein manipulates the pigment transition energies and at last how the excitation energy transfer
through the FMO protein works.
In addition it is a critical test, if both procedures give similar results. If so, these results will be much
more convincing than those achieved by only one of the two methods.
Considering larger PPCs, we hope to develop the direct calculation method so far, that we can use
it without the comparism with the fit, because in larger complexes like PS I, we have 96 Chls, which is
a far too large number for a fitting procedure.
Genetic Algorithm: Optimisation Copied from Nature

An often successful way to treat non-linear multidimensional optimisation problems (in the case of
FMO there are seven dimensions) is to use a genetic algorithm as fit routine. Compared to traditional
methods like gradient descent or Newtons method, the genetic algorithm prevents being trapped in a
local minimum and is relatively easy to implement.
The working scheme of it is shown on the left side of Figure 6. The starting point are N sets of site
energies, a so-called population of N chromosomes, where 1 set is provided (start set) and N-1 are randomly created. In our case N =100 is large enough, i.e. we have 100 sets of 7 site energies. For these N
sets of site energies N spectra are calculated and compared with the experimental spectra. According
to the deviation from the experimental spectra, a fitness value is assigned to each of the N sets of site
energies. The higher the fitness value, the better the simulated spectrum fits to the experimental one.
We simply use the inverse quadratic deviation from experiment as fitness value. After determining the
fitness value, a ranking of the chromosomes is done. A random selection of chromosomes is performed,
which considers the ranking with the help of a suitable distribution function, i.e. chromosomes with
higher rank are selected with higher probability. The important point is that chromosomes with low
rank can be selected also, which helps to avoid local minima. With the selected chromosomes so-called
genetic operations are performed:
Figure 6. (left) Schematic illustration of the working scheme of an evolutionary algorithm. (right) The
principle of the two main genetic operations.
Scheme of a genetic algorithm
Genetic operations
Mutation
Recombination
random choice
random choice
581
Photosynthesis
1.
2.
3.
Reproduction means that some chromosomes are copied to the next cycle without any change.
In our case only one chromosome, namely the one with the highest rank, was copied to the next
cycle.
Recombination (Figure 6, right) is according to the sexual reproduction in nature: Two selected
chromosomes are truncated at a random position and then crosswise interchanged two children
arise, which are a mixture of their parents.
Mutation (Figure 6, centre) is performed by the random selection and random variation (in a suitable range) of a site energy.
Selection and genetic operation has to be repeated as many times as necessary to produce N new
chromosomes. After checking the break condition whether the cycle is continued with the N new
chromosomes or the algorithm ends with the highest ranked chromosome as result. In the case of the
FMO-complex, the cycle has to be executed around 100 times to reach a satisfying result.
Calculation of Site Energies from Atomic Detail

For a microscopic interpretation of the site energies, electrostatic shifts in site energies due to partial
charges of amino acids can be calculated. In a classical
picture a pigment has different permanent dipole moments in its ground and excited states ( gr and ex). These permanent dipole moments interact
coulombic with any charge (distribution) around them.
The well known Coulomb interaction of a dipole with a point charge is:
ECoul
1
=
4
q
0

r
r3
where q is the point charge, is the dipole moment and r is the vector between the point charge and the
centre of the dipole.
in the energy difference between ground and excited state
Since we
are interested
E = E ( ex ) E ( gr )= E ( ), where = ex gr, we get:

i r ij
1 1 N
Ei =
q j r3
4 0 eff j =1
ij
where eff is an effective dielectric constant, i is the difference of the permanent dipole moments of
ground and excited state of the i-th pigment, r ij is the vector connecting the centre of the i-th pigment
with the j-th point charge, and N is the number of point charges. In principle with this method charged
and uncharged amino acid groups can be treated, by using their partial charges, for instance from the
CHARMM force field
& Boxer, 1987) the
(Brooks et al., 1983). From Stark spectra of BChl a (Lockhart

absolute value of i is known to be between 1.6 D and 2.4 D, we used i = 2D, and to be orientated
approximately along the NB ND axis of BChl a.
The site energy Ei of pigment i is given by Ei = E0 + Ei, where the constant E0 is assumed to be
equal for all pigments and will be determined from the overall position of the spectrum and comparism
with experiment.
In our first investigation we represented the pigments as electric dipoles and found that it did not
make a big difference whether we took into account charged and uncharged amino acids or only charged
582
Photosynthesis
amino acids, which could even be replaced by +e or e point charges at the position of the amino acid
charge center.
Results
The convincing aspect of our results is, that for the fit and the electrostatic method, pigment number
3 has the lowest site energy. We performed the calculation for two different FMO complexes (from
Chlorobium tepidum and Prosthecochloris aestuarii) and for both complexes with both methods pigment number three has the lowest site energy; in accordance with results from literature (Wendling et
al., 2002; Vulto et al, 1998). The highest deviations between the fit results and the electrostatic results
are found for pigment number 2 and 5, which might be due to the fact that pigments 2 and 5 are the two
pigments which are not ligated by histidine, and therefore might require a different parameter E0.
The spectra simulated with the site energies from fit (Adolphs & Renger, 2006) are shown in the
right part of Figure 5, together with the experimental spectra (Wendling et al., 2002).
The resulting site energies from atomic detail are not yet satisfying, but encouraging to continue with
a more detailed method. In this method the dipole moments of ground and excited state of BChl a will
be replaced by atomic partial charges of ground- and excited states, calculated with quantum chemical
methods (Madjet et al., 2006). Furthermore the influence of the whole protein will be taken into account.
This is a much more realistic description and our hope to get better results is not unrealistic.
CONCLUSION
From electron microscopy (Remigy et al., 1999 & 2002) the relative arrangement of the FMO complex
between chlorosomes, and RC is known (Figure 3, right, top). The FMO complex is located between
chlorosomes and RC and functions as energy conductor from the chlorosomes to the RC. From linear
dichroism spectroscopy it is known that two orientations of the FMO complex are possible, but a decision for one of them could not be made.
With the site energies from fit, we were able to perform exciton relaxation dynamics calculations
(Adolphs & Renger, 2006), which indicate the temporal excitation energy relaxation of each exciton.
It is possible to assign the pigments contribution to an exciton state, so one can translate the exciton
relaxation into excitation energy population of the pigments.
In the lower part of Figure 3, the colour of the -system of the BChls is varied between light and
dark red according to the excitation energy population of the pigments. The enclosed areas mark the
delocalised exciton states that are populated. The depicted situations are at time t = 0 fs, 200 fs, 1 ps
and 5 ps. After 5 ps the relaxation process is finished.
As one can see the excitation energy is first located mainly on pigment 1 and 6 (t = 0), then distributes
more or less over all seven pigments (t = 200 fs), then the excitation energy from pigment 5-7 disappears
(t = 1 ps) and at last the excitation energy is located on pigment 3 and 4. Apparently the spatial evolution
of the excitation energy is from pigment 1 and 6 to pigment 3 and 4, which makes clear how the FMO
must be orientated for efficient energy transfer, in relation to the chlorosomes and the RC: BChl 1 is on
the chlorosome side and BChl 3 is the linker pigment to the RC.
The direct electrostatic method to calculate local transition energies presented here, can be applied
to other complexes, although we have to improve it before on the example of FMO.
583
Photosynthesis
By use of a new calculation method and comparison with the results of an evolutionary algorithm,
an unambiguous solution was found, describing the path of the excitation energy through the FMO
complex precisely.
REFERENCES
Adolphs, J., & Renger, T. (2006). How proteins trigger energy transfer in the FMO complex of green
sulfur bacteria. Biophys. J., 91, 2778-2797.
Bashford, D. (1990-1998). MEAD. La Jolla, CA: The Scripps Research Institute.
Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M. (1983).
CHARMM: A program for macromolecular energy, minimization and dynamics calculations. J. Comput. Chem. 4, 187-217.
Chang, J.C. (1977). Monopole effects on electronic excitation interaction between large molecules. I.
Application to energy transfer in chlorophylls. J. Chem. Phys., 67, 3901-3909.
Deisenhofer, J., Epp, O., Miki, K., Huber, R., & Michel, H. (1985). Structure of the protein subunits in the
photosynthetic reaction centre of Rhodopseudomonas viridis at 3A resolution. Nature, 318, 618-624.
Fenna, R.E., & Matthews, B.W. (1975). Chlorophyll arrangement in a bacteriochlorophyll protein from
Chlorobium limicola. Nature, 258, 573-577.
Knox, R.S., & Spring, B.Q. (2003). Dipole strength in the chlorophylls. Photochem. Photobiol., 77,
497-501.
Lockhart, D.J. & Boxer, S.G. (1987). Magnitude and direction of the change in dipole moment associated with excitation of the primary electron donor in Rhodopseudomonas sphaeroides reaction centers.
Madjet, M.E., Abdurahman, A., Renger, T. (2006). Intermolecular Coulomb couplings from ab initio
electrostatic potentials: Application to optical transitions of strongly coupled pigments in photosynthetic
antennae and reaction centres. J. Phys. Chem., B, 110, 17268-17281.
Mh, F., Madjet, M.E., Adolphs, J., Abdurahman, A., Rabenstein, B., Ishikita, H., Knapp, E.W., &
Renger, T. (2007). -helices direct excitation energy flow in the Fenna-Matthews-Olson protein. Proc.
Natl. Acad. Sci. USA, 104(43), 16862-16867.
Remigy, H.W., Stahlberg, H., Fotiadis, D., Wolpensinger, B., Engel, A., Hauska, G., & Tsiotis, G. (1999).
The reaction centre complex from green sulfur bacterium Chlorobium tepidum: A structural analysis
by scanning transmission electron microscopy. J. Mol. Biol., 290, 851-858.
Remigy, H.W., Hauska, G., Mller, S.A., & Tsiotis, G. (2002). The reaction centre from green sulfur
bacteria: Progress towards structural elucidation. Photosynth. Res., 71, 91-98.
Renger, T., & Marcus, R.A. (2002). On the relation of protein dynamics and exciton relaxation in pigment-protein complexes: An estimation of the spectral density and a theory for the calculation of optical
spectra. J. Chem. Phys., 116(22), 9997-10019.
584
Photosynthesis
Vulto, S.I.E., de Baat, M.A., Louwe, R.J.W, Permentier, H.P., Neef, T., Miller, M., van Amerongen, H., &
Aartsma, T.J. (1998). Exciton simulations of optical spectra of the FMO complex from the green sulfur
bacterium Chlorobium tepidum at 6 K. J. Phys. Chem. B, 102, 9577-9582.
Wendling, M., Przyjalgowski, M.A., Glen, D., Vulto, S.I.E., Aartsma, T.J., van Grondelle, R., & van
Amerongen, H. (2002). The quantitative relationship between structure and polarized spectroscopy in
the FMO complex of Prosthecochloris aestuarii: Refining experiments and simulations. Photosynth.
Res., 71, 99-123.
Key Terms
Aldehyde Group: A functional group, which consists of a carbon atom which is bonded to a hydrogen
atom and double-bonded to an oxygen atom.
Angstrom: 1 Angstom = 1A = 10-10 m, distance unit.

Chlorosomes: Large photosynthetic antenna complex found in green sulfur bacteria. They are
ellipsoidal bodies, their length is around 100 to 200 nm, width of 50 to 100 nm and height of 15 - 30
nm. They are mostly composed of BChl (c, d or e) with small amounts of carotenoids and quinones
surrounded by a galactolipid monolayer.
Chromosomes (in the context of genetic algorithms): A set of values for the quantities that have to
be fitted are called a chromosome. In our case a chromosome is a set of seven values for the seven site
energies we are fitting.
Debye: 1 Debye = 1 D, CGS-unit for the electric dipole moment. Definition: Two charges +e and e,
separated by 1 Angstrom, have a dipole moment of 4.8 D; i.e. 1 D = 3.33564.10-30 Cm.
e = 1.60217 10-19 C, elementary charge, charge of a proton.
Electron Volt (eV): 1 eV = 1.60217 10-19 J, energy unit.
Exciton State: Delocalized excited state of pigments.
Formaldehyde: CH2O. The simplest aldehyde: one hydrogen atom bonded to an aldehyde group.
G: Difference in Gibbs (free) energy.
1 2
1
exp of width (fwhm) .
2
2
Genetic Algorithm (evolutionary algorithm): Optimization algorithm that imitates the biological
evolution to find the global minimum of a non-linear multi-dimensional optimization problem. Genetic
algorithms are used in multitudinous optimization problems in mathematics, physics, for technical applications and so on, so it is a very usefull concept.
Gaussian Distribution (bell curve): D (
)=
Glucose: C6H12O6. A monosaccharide (also known as sugar) and an important carbohydrate in biology.
585
Photosynthesis
h = 6.6260755 10-34 J s, Plancks Constant.

Hydrogen Sulfide: H2S. A colorless, toxic and flammable gas and responsible for the foul odor of
rotten eggs. It often results from the bacterial break down of organic matter in the absence of oxygen,
such as in swamps and sewers. It also occurs in volcanic gases and natural gas.
Lifetime Broadening: The linewidth of the Lorentzian shaped spectral lines of the 0-0 transition is
determined by the dephasing time of exciton relaxation M . A longer dephasing time causes narrower
spectral lines.
Monte Carlo Methods: A common type of computational algorithms used in many fields, such
as simulation of the behaviour of physical systems. In contrast to other simulation methods, they are
stochastic, i.e. based on (pseudo-) random numbers. In our case we vary the site energies with the help
of a gaussian random distribution of a suitable width (100 cm-1) and calculate N (= 5000-10000) slightly
different spectra. The resulting spectrum ist the sum over these random varied spectra. This procedure
is done to simulate the natural line broadening, caused by disordered pigment motions.
Phytol: C20H40O. A natural linear diterpene alcohol. It is an oily liquid that is nearly insoluble in
water, but soluble in most organic solvents.
r . With electrostatic potential r , charge density

Poisson Equation: r r = 4

r , dielectric constant r , nabla operator =
, , .
x y z
Pyrrole: C4H5N. An aromatic organic compound, arranged in a pentagon.
()
() ()
()
()
()
Site Energy: Local transition energy (of a pigment) due to its environment. The so-called vacuum
transition energy is the same for identical pigments. It can be estimated from the pigment transition
energy in solution (Knox & Spring, 2003). We are interested in the local transition energies, which
deviate from each other due to different electrostatic surroundings (protein, water, the other pigments).
There is no way to directly measure the site energy!
Tetrapyrroles: Compounds containing four pyrrole rings.
Vibrational Sidebands: Due to the Franck-Condon principle, the electronic excitation takes place
from the electronic ground state (which is vibrationally equilibrated, i.e. is also in the vibrational
ground state) to vibrational ground and higher states of the excited electronic state. The transition from
the vibrational groundstate of the electronic groundstate to the vibrational groundstate of the excited
electronic state is called 0-0 transition. The energy gap between the electronic groundstate and vibrational excited states of the excited electronic state is larger than that of the 0-0 transition, therefor the
0-0 spectral line is accompanied by spectral lines with higher energy due to 0-1, 0-2, transitions. In
case of the FMO protein, the 0-0 transition is dominating and the sidebands just broaden the spectral
lines on the high energy side.
Wavenumber: Reciprocal wavelength. Unit: cm-1, 8065.54 cm-1 = 1 eV, energy unit used typically
in spectroscopy.
586
Photosynthesis
Abbreviations
ADP: Adenosine diphosphate
ATP: Adenosine triphosphate
BChl: Bacteriochlorophyll
bRC: Bacterial reaction centre
CHARMM: Chemistry at HARvard Molecular Mechanics www.charmm.org
Chl: Chlorophyll
FMO: Fenna-Matthews-Olson (Complex)
fwhm: Full width at half maximum
MEAD: Macroscopic Electrostatics with Atomic Detail www.scripps.edu/mb/bashford/
PPC: Pigment protein complex
PS: Photosystem
RC: Reaction centre
ENDNOTES
1
The ability to convert light energy into chemical energy is a huge advantage in evolution. Photosynthesis came up very early in the history of life on earth, which began around 3.5 billion years
ago. Oxygenic photosynthesis arose approximately 2 billion years ago, as geological evidence
suggests. Anoxygenic photosynthesis came up even earlier.
Phytol is a natural linear diterpene alcohol. It is an oily liquid that is nearly insoluble in water, but
soluble in most organic solvents. Its chemical formula is C20H40O.
Tetrapyrroles are compounds containing four pyrrole rings. Pyrrole, is an aromatic organic compound, arranged in a pentagon with the chemical formula C4H5N.
Large photosynthetic antenna complex found in green sulfur bacteria. They are ellipsoidal bodies,
their length is around 100 to 200 nm, width of 50 to 100 nm and height of 15 - 30 nm. They are
mostly composed of BChl (c, d or e) with small amounts of carotenoids and quinones surrounded
by a galactolipid monolayer.
587
588
Chapter XXXV
Photodynamic Therapy:
A Systems Biology Approach
Michael R. Hamblin
Massachusetts General Hospital - Boston, USA; Harvard Medical School, USA;
and Harvard-MIT Division of Health Sciences and Technology, USA
abstract
Photodynamic therapy (PDT) is a rapidly advancing treatment for multiple diseases. PDT involves the
administration of a nontoxic drug or dye known as a photosensitizer (PS), either systemically, locally,
or topically, to a patient bearing a lesion (frequently but not always cancer), followed after some time
by the illumination of the lesion with visible light; in the presence of oxygen, leads to the generation of
cytotoxic species and consequently to cell death and tissue destruction. The light is absorbed by the PS
molecule and the excited state PS transfers energy to ground state molecular oxygen, forming a reactive
oxygen species that oxidize lipids, proteins, and nucleic acids. The resulting damage to essential biomolecules kills target cells by necrosis, apoptosis, or autophagy. When used as a cancer treatment PDT
is known to cause direct tumor cell killing, severe damage to tumor blood vessels, and also produce an
acute inflammatory reaction that can stimulate the immune system to recognize, track down, and even
kill distant tumor cells that could cause metastases. This chapter focuses on studies of PDT that have
employed a systems biology approach. These experiments have been frequently carried out using geneexpression micro-arrays. We will cover protective responses induced by PDT that include activation of
transcription factors, heat shock proteins, antioxidant enzymes, and antiapoptotic pathways. Elucidation
of these mechanisms might result in the design of more effective combination strategies to improve the
antitumor efficacy of PDT. Specific pathways shown to be activated after PDT are heat shock proteins
90, 70, and 27, heme oxygenase, and cyclooxygenase-2.
Photodynamic Therapy
INTRODUCTION
Photodynamic therapy (PDT) is a rapidly advancing treatment for multiple diseases. PDT involves the
administration of a nontoxic drug or dye known as a photosensitizer (PS) either systemically, locally, or
topically to a patient bearing a lesion (frequently but not always cancer), followed after some time by the
illumination of the lesion with visible light, which, in the presence of oxygen, leads to the generation of
cytotoxic species and consequently to cell death and tissue destruction. The light is absorbed by the PS
molecule and the excited state PS transfers energy to ground state molecular oxygen to form reactive
oxygen species that oxidize lipids, proteins and nucleic acids. The resulting damage to these essential
biomolecules kills target cells by processes characterized by necrosis, apoptosis or autophagy. When
used as a cancer treatment PDT is known to cause combinations of direct tumor cell killing, together
with severe damage to tumor blood vessels. In addition PDT can produce an acute inflammatory reaction
that can stimulate the immune system to recognize, track down and kill distant untreated tumor cells
that could cause metastases. This chapter will focus on studies of PDT that have employed a systems
biology approach. Many cell pathways and signaling systems are engaged after PDT and although many
of these cellular changes have been elucidated by traditional biochemical and cell biology techniques,
the newer technologies of omics are increasingly being brought to bear on this problem. In particular
these technologies involve the use of gene-expression micro-arrays. We will cover protective responses
induced by PDT that include activation of transcription factors, heat shock proteins, antioxidant enzymes and antiapoptotic pathways. Elucidation of these mechanisms might result in the design of more
effective combination strategies to improve the antitumor efficacy of PDT.
OVERVIEW OF PHOTODYNAMIC THERAPY

PDT dates from the early days of the twentieth century when workers used dyes such as eosin together
with light to treat skin cancer (Jesionek, 1903). Hematoporphyrin (HP) was also first used at this time
(Hausman, 1911) and sporadic reports (Figge, 1948) of both selective localization of porphyrins in tumors and regression after exposure to visible light appeared until the 1960s. The modern explosion of
interest in PDT dates from the discovery of hematoporphyrin derivative (HPD) by Lipson and Baldes in
1960 (Lipson, 1960), and was fueled by pioneering studies in both basic science and clinical application
(Dougherty, 1974, Dougherty, 1978, Dougherty, 1979) by Dougherty et al. (notable among many groups).
A semi-purified preparation of HPD known as Photofrin (PF) was the first PS to gain regulatory approval for treatment of various cancers in many countries throughout the world, including the United
States. After experience of treating tumors with HPD-PDT was accumulated, it was realized that this
compound had significant disadvantages, including prolonged skin sensitivity necessitating avoidance
of sunlight for many weeks (Baas, 1995), sub-optimal tumor selectivity (Orenstein, 1996), poor light
penetration into the tumor due to the relatively short wavelength used (630 nm) (Spikes, 1990) and the
fact that it was a complex mixture of uncertain structure (Kessel, 1987).
In recent times much work has been done on developing new PS (Gaullier, 1995, Gomer, 1991a),
and at the present time there is such a great number of potential PS for PDT that it is difficult to decide
which ones are suitable for which particular disease or application. Some PS can easily be prepared
by partial syntheses starting from abundant natural starting materials, such as heme, chlorophyll and
bacteriochlorophyll. This route leads to both economical and environmental advantages compared to
589
complicated total chemical synthesis (Nyman, 2004). In parallel with the advances in chemistry there has
also been much activity in developing new light sources. These include user-friendly lasers frequently
based on solid state laser diodes, as well as inexpensive light-emitting diodes and filtered broad-band
lamps (Brancaleon, 2002) Advances in knowledge of tissue optics has allowed great improvements
to be made in treatment planning and predicting how the light is distributed within the target tissue
or organ and therefore to optimizing the clinical outcome. The recognition that different tissue types
have very different optical properties, and even that the same tissue type or organ can vary markedly
between individuals in how light is absorbed and scattered. The fact that most PS are also fluorescent
(as well as photochemically active) means that imaging and detection strategies can be applied in PDT
protocols. These techniques are sometimes known as photo(dynamic) detection or diagnosis. They may
be carried out to detect otherwise hidden disease such as dysplasia, to delineate tumor borders, or to
visualize disease in inaccessible areas such as the esophagus, bronchus or colon that can, however, be
reached endoscopically.
Another application of fluorescent imaging and quantification is in its ability to improve PDT dosimetry. For instance, fluorescence measurements can be made to quantify the actual amount of PS in the
patients lesion before deciding on the appropriate illumination parameters. Fluorescence measurements
can be also be made to measure photobleaching (see later) of the PS in the tissue, which under some
circumstances, can be a surrogate marker for optimal completion of the treatment. Although PS are usually selected based on photochemical and pharmacokinetic considerations, in the future there may also
be an additional factor to be taken into consideration involving the need for fluorescence imaging.
In order to make rational choices from among the myriad available PS and light sources available,
it is necessary to understand some of the mechanistic aspects of how PS behave upon illumination and
what happens to the PS when they are put in contact with mammalian cells in tissue culture. The precise
way that PDT influences cellular pathways (see later), is largely governed by where in the cell the PS is
located. This subcellular localization in turn is governed by the chemical nature of the PS (molecular
weight, lipophilicity, amphiphilicity, ionic charge and protein binding characteristics), the concentration
of the PS, the incubation time, the serum concentration and the phenotype of the target cell. One of the
chief attractions of PDT as a therapy is the concept of dual selectivity. Collateral damage to normal tissue can be minimized by increasing the selective accumulation of the PS in the tumor or other diseased
tissue, and by delivering the light in a spatially confined and focused manner. Nevertheless PDT can
have side effects including long-lasting skin photosensitivity, occasional systemic and metabolic disturbances, and excessive tissue destruction at the treated site. It is hoped that advances in mechanistic
understanding of PDT will minimize the risk reward ratio and extend the number of disorders both
serious and more minor that can be treated.
Photosensitizers
Hematoporphyrin derivative or Photofrin was the first PS to be studied in detail. However, it proved
highly frustrating for scientists who attempted to determine its chemical structure and to identify its
components (Kessel, 1982, Kessel, 1989a, Kessel, 1989b). There was significant variation between batches
and attempts to fractionate it into its individual component molecules frequently yielded mixtures as
complicated as the starting material (Kessel, 1987). Although there is good evidence for the presence
of hematoporphyrin oligomers, it is uncertain whether they are predominantly ethers or esters, and the
whether the side chains are predominantly vinyl or hydroxy ethyl groups (Kessel, 1986). When these
590
uncertainties were combined with other significant deficiencies of the preparation, enthusiasm for its
widespread use was decreased. These deficiencies include a long-lasting skin photosensitivity so that
patients may have to avoid sunlight for as long as eight weeks, the lack of a reasonably-sized absorption band > 650 nm, and the fact that its tumor-localizing properties were not as pronounced as first
thought.
These considerations spurred a large effort amongst organic chemists to develop novel PS that could in
theory be candidates for mediating PDT. The net result is a collection of probably hundreds of compounds
and it can be bewildering to try to choose among them. The characteristics of the ideal PS have been
discussed in recent reviews (Abels, 2004, Allison, 2004). They should have low levels of dark toxicity to
both humans and experimental animals and low incidence of administrative toxicity (i.e. hypotension or
allergic reaction). They should absorb light in the red or far-red wavelengths in order to penetrate tissue
(see later). Absorption bands at shorter wavelengths have less tissue penetration and are more likely to
lead to skin photosensitivity (the power in sunlight drops off at > 600 nm). Absorption bands at high
wavelengths (> 800 nm) mean that the photons will not have sufficient energy for the PS triplet state to
transfer energy to the ground state oxygen molecule to excite it to the singlet state (see later).
They should have relatively high absorption bands (>20,000 -30,000M1 cm1) to minimize the dose
of PS needed to achieve the desired effect. Synthesis of the PS should be relatively easy and the starting
materials readily available to make large scale production feasible. The PS should be a pure compound
with a constant composition and a stable shelf life, and be ideally water soluble or soluble in a harmless
aqueous solvent mixture. It should not aggregate unduly in biological environments as this reduces its
photochemical efficiency. The pharmacokinetic elimination from the patient should be rapid, i.e. less
than one day to avoid the necessity for post-treatment protection from light exposure and prolonged skin
photosensitivity. A short interval between injection and illumination is desirable to facilitate outpatient
treatment that is both patient-friendly and cost-effective. Pain on treatment is undesirable, as PDT does
not usually require anesthesia or heavy sedation. Although high PDT activity is thought to be a good
thing, it is possible to have excessively powerful PS that are somewhat unforgiving. With limitations in
effectiveness of both PS and light dosimetry, highly active PS may easily permit treatment overdosage.
It is at present uncertain whether it is better to have a PS tailored to a specific indication and to have
families or portfolios of PS for various diseases or patient types, or to seek one PS that works against
most diseases. Lastly a desirable feature might be to have an inbuilt method of PS dosimetry monitoring
and following response to treatment by measuring in vivo fluorescence and its loss by photobleaching.
The majority of PS used both clinically and experimentally, are derived from the tetrapyrrole aromatic
nucleus found in many naturally occurring pigments such as heme chlorophyll and bacteriochlorophyll.
Tetrapyrroles usually have a relatively large absorption band in the region of 400nm known as the Soret
band, and a set of progressively smaller absorption bands as the spectrum moves into the red wavelengths
known as the Q-bands. Naturally occurring porphyrins are fully conjugated (non-reduced) tetrapyrroles
and vary in the number and type of side groups particularly carboxylic acid groups (uroporphyrin has
eight, coproporphyrin has four and protoporphyrin has two). Porphyrins have the longest wavelength
absorption band in the region of 630-nm and it tends to be small. Chlorins are tetrapyrroles with the
double bond in one pyrrole ring reduced. This means that the longest wavelength absorption band shifts
to the region of 650-690nm and increases several-fold in height; both these factors are highly desirable
for PDT. Bacteriochlorins have two pyrrole rings with reduced double bonds, and this leads to the absorption band shifting even further into the red, and increasing further in magnitude. Bacteriochlorins
may turn out to be even more effective PS than chlorins, but with relatively few candidate molecules
591
and some questions about the stability of these molecules upon storage this remains to be seen. There
are a set of classical chemical derivatives generally obtained from naturally occurring porphyrins and
chlorins that include such structures as purpurins, pheophorbides, pyropheophorbides, pheophytins
and phorbins some of which have been studied (a few extensively) as PS for PDT. A second widely
studied structural group of PS is the phthalocyanines (PC), and to a lesser extent, their related cousins
the naphthalocyanines. Again their longest absorption band is in > 650nm and usually has a respectable
magnitude. As can be imagined the presence of four phenyl groups (or even worse four naphthyl groups)
causes solubility and aggregation problems. PCs are frequently prepared with sulfonic acid groups to
provide water solubility and with centrally coordinated metal atoms. It was found that the asymmetrically substituted disulfonic acids acted as the best PS (compared to mono-, symmetrically di-, tri- and
tetra-substituted sulfonic acids) in both the zinc (Fingar, 1993) and aluminum (Peng, 1990) series of PC
derivatives. Another broad class of potential PS includes completely synthetic, non-naturally-occurring, conjugated pyrrolic ring systems. These comprise such structures as texaphyrins (Sessler, 2000),
porphycenes (Stockert, 2007), and sapphyrins (Kral, 2002). A last class of compounds that have been
studied as PS are non-tetrapyrrole derived naturally occurring or synthetic dyes. Examples of the first
group are hypericin (from St Johns wort) (Agostinis, 2002) and from the second group are toluidine
blue O (Stockert, 1996) and Rose Bengal [31]. As yet these compounds have perhaps been more often
studied as agents to mediate antimicrobial photoinactivation (Bottiroli, 1997) rather than as PS designed
to kill mammalian cells for applications such as cancer.
Photophysics
When a PS molecule absorbs a photon of visible light the result is depicted in Figure 1.
This is a diagram originally named after the Polish physicist Aleksander Jablonski that graphically
illustrates the processes of light absorption and energy transfer that are at the heart of PDT. The ground
state PS has two electrons with opposite spins (this is known as singlet state) in the low energy molecular
orbital. Following the absorption of light (photons), one of these electrons is boosted into a high-energy
orbital but keeps its spin (first excited singlet state). This excited state is a short-lived (nanoseconds) species and can lose its energy by emitting light (fluorescence) or by internal conversion into heat. The fact
that most PS are fluorescent has led to the development of sensitive assays to quantify the amount of PS
in cells or tissues, and allows in vivo fluorescence imaging in living animals or patients to measure the
pharmacokinetics and distribution of the PS. The excited singlet state PS may also undergo the process
known as intersystem crossing whereby the spin of the excited electron inverts to form the relatively
long-lived (microseconds) excited triplet-state that has electron spins parallel. The long lifetime of the
PS triplet state is explained by the fact that the loss of energy by emission of light (phosphorescence) is
a spin forbidden process as the PS would move directly from a triplet to a singlet-state.
Photochemistry
The PS excited triplet can undergo three broad kinds of reactions that are usually known as Type I,
Type II and Type III (Figure 2). Firstly, in a Type I reaction, the triplet PS can gain an electron from a
neighboring reducing agent. In cells these reducing agents are commonly either NADH or NADPH. The
PS is now a radical anion bearing an additional unpaired electron. Alternatively two triplet PS molecules
can react together involving electron transfer to produce a pair consisting of a radical cation and a radical
592
Figure 1. Jablonsky diagram illustrating absorption of a photon by the ground state singlet photosensitizer,
that gives rise to the short-lived excited singlet state. This can lose energy by fluorescence (negligible in
case of fullerenes), internal conversion to heat, or by intersystem crossing to the long-lived triplet state.
The triplet state can undergo photochemistry as shown in Figure 2.
anion. Radical anions may further react with oxygen with electron transfer to produce reactive oxygen
species in particular superoxide anion. In a Type II reaction, the triplet PS can transfer its energy directly
to molecular oxygen (itself a triplet in the ground state), to form excited-state singlet oxygen. Both Type
I and Type II reactions can occur simultaneously, and the ratio between these processes depends on the
type of PS used, the concentrations of substrate and oxygen. A less common pathway is know as Type
III and here the triplet state PS reacts directly with a biomolecule thus destroying the PS and damaging
the biomolecules. Type III is likely to be oxygen-independent in nature. Type II processes are thought
to best conserve the PS molecular structure in a photoactive state and in some circumstances a single
PS molecule can generate 10,000 molecules of singlet oxygen. The PS can in some circumstances also
react with the singlet oxygen it produces in a process known as oxygen-dependent photobleaching.
Type 1 pathways frequently involve initial production of superoxide anion by electron transfer from
the triplet PS to molecular oxygen (monovalent reduction) (Bilski, 1993, Ma, 2001). Superoxide is not
particularly reactive in biological systems and does not by itself cause much oxidative damage, but can
react with itself to produce hydrogen peroxide and oxygen, a reaction known as dismutation that can
be catalyzed by the enzyme superoxide dismutase (SOD). Hydrogen peroxide is important in biological systems because it can pass readily through cell membranes and cannot be excluded from cells.
Hydrogen peroxide is actually necessary for the function of many enzymes, and thus is required (like
oxygen itself) for health.
593
Figure 2. Schematic representation of the Type I, Type II and Type III photochemical mechanisms thought
to operate in PDT.
Superoxide is also important in the production of the highly reactive hydroxyl radical (HO). In
this process, superoxide actually acts as a reducing agent, not as an oxidizing agent. This is because
superoxide donates one electron to reduce the metal ions (such as ferric iron or Fe3+) that act as the
catalyst to convert hydrogen peroxide (H2O2) into the hydroxyl radical (HO). This reaction is called
the Fenton reaction, and was discovered over a hundred years ago. It is important in biological systems
because most cells have some level of iron, copper, or other metals, which can catalyze this reaction.
The reduced metal (ferrous iron or Fe2+) then catalyzes the breaking of the oxygen -oxygen bond of
hydrogen peroxide to produce a hydroxyl radical (HO) and a hydroxide ion (HO ). Superoxide can react
with the hydroxyl radical (HO) to form singlet oxygen, or with nitric oxide (NO) (also a radical) to
produce peroxynitrite (OONO-), another highly reactive oxidizing molecule. Like H2O2, HO passes
easily through membranes and cannot be kept out of cells. Hydroxyl radical damage is diffusion ratelimited. This highly reactive radical can add to an organic (carbon containing) substrate (represented
by R below), this could be, for example, a fatty acid which would form a hydroxylated adduct that is
itself a radical. The hydroxyl radical can also oxidize the organic substrate by stealing or abstracting
an electron from it. The resulting oxidized substrate is again itself a radical, and can react with other
molecules in a chain reaction. For example, it could react with ground-state oxygen to produce a peroxyl radical (ROO). The peroxyl radical again is highly reactive, and can react with another organic
substrate in a chain reaction. This type of chain reaction is common in the oxidative damage of fatty
acids and other lipids, and demonstrates why radicals such as the hydroxyl radical can cause so much
more damage than one might have expected.
594
These ROS, together with singlet oxygen produced via Type 2 pathway, are oxidizing agents that can
directly react with many biological molecules. Amino acid residues in proteins are important targets that
include cysteine, methionine, tyrosine, histidine, and tryptophan (Grune, 2001, Midden, 1992). Due to
their reactivity, these amino acids are the primary target of an oxidative attack on proteins. The reaction
mechanisms are rather complex and as a rule lead to a number of final products. Cysteine and methionine
are oxidized mainly to sulfoxides, histidine yields a thermally unstable endoperoxide, tryptophan reacts
by a complicated mechanism to give N-formylkynurenine, tyrosine can undergo phenolic oxidative
coupling. Unsaturated lipids typically undergo ene-type reactions to give lipid hydroperoxides (LOOHs
derived from phospholipids and cholesterol) (Bachowski, 1994, Girotti, 1985, Girotti, 1983). DNA can
be oxidatively damaged at both the nucleic bases (the individual molecules that make up the genetic
code) and at the sugars that link the DNA strands by oxidation of the sugar linkages, or cross-linking
of DNA to protein (a form of damage particularly difficult for the cell to repair). Although all cells have
some capability of repairing oxidative damage to proteins and DNA, excess damage can cause mutations or cell death. Of the four bases in nucleic acids guanine is the most susceptible to oxidation by 1O2.
The reaction mechanism has been extensively studied in connection with oxidative cleavage of DNA
(Buchko, 1995). The first step is a [4 + 2] cycloaddition to the C-4 and C-8 carbons of the purine ring
leading to an unstable endoperoxide (Buchko, 1993). The subsequent complicated sequence of reactions
and the final products depend on whether the guanine moiety is bound in an oligonucleotide or a double
stranded DNA (Ravanat, 1995). Because of the high reactivity and short half-life of singlet oxygen and
hydroxyl radicals, only molecules and structures that are proximal to the area of its production (areas of
PS localization) are directly affected by PDT. The half-life of singlet oxygen in biological systems is <
40 ns, and, therefore, the radius of the action of singlet oxygen is of the order of 20nm (Moan, 1991).
Light Delivery
In PDT it is important to be able to predict the spatial distribution of light in the target tissue. Light is
either scattered or absorbed when it enters tissue and the extent of both processes depends on tissue type
and light wavelength. Tissue optics involves measuring the spatial/temporal distribution and the size
distribution of tissue structures and their absorption and scattering properties. This is rather involved
because the biological tissue is inhomogeneous and the presence of microscopic inhomogenities (macromolecules, cell organelles, organized cell structure, interstitial layers, etc.) makes it turbid. Multiple
scattering within a turbid medium leads to spreading of a light beam and loss of directionality. Absorption is largely due to endogenous tissue chromophores such as hemoglobin, myoglobin and cytochromes.
Complete characterization light transport in tissue is a formidable task; therefore, heuristic approaches
with different levels of approximations have been developed to model it. An effort for modeling light
transport also requires accurate values for the optical properties of the tissue.
Scattering is generally the most important factor in limiting light penetration into most tissues and
is measured by s (which for soft tissues is in the range 100 -1000 cm-1). Absorption is usually of lesser
importance and measured by a (values in the range of 0.1 -5 cm-1 for most tissue at green and longer
wavelengths). The third parameter necessary to define tissue optical properties is the anisotropy factor
that measures the direction of scattering of light. It is possible to use mathematical approaches such
as diffusion theory or Monte Carlo modeling to predict how light will travel into target tissue and the
illumination parameters (fluence, fluence rate, wavelength, angle of incidence) may then be adjusted to
maximize the light dose.
595
The combination of absorption of lower wavelength light by the important tissue chromophores
(oxy and deoxyhemoglobin and melanin) together with reduced light scattering at longer wavelengths
and the occurrence of water absorption at wavelengths greater than 1300-nm has led to the concept of
the optical window in tissue (see Figure 3). In terms of PDT the average effective penetration depth
(intensity reduced to 37%) is about 1 -3mm at 630 nm, the wavelength used for clinical treatment with
PF, while penetration is approximately twice that at 700 -850nm (Svaasand, 1984, Wilson, 1985). The
increased penetration depth of longer wavelength light is a major incentive for the development of PS
absorbing at such wavelengths, and a naphthalocyanine (776 nm) (Firey, 1987) and bacteriochlorin (780
nm) (van Leengoed, 1993) fall into this category. The absorption of light by the PS itself can limit tissue
light penetration. This phenomenon has been termed self-shielding and is particularly pronounced
with PS that absorb very strongly at the treatment wavelength (Dougherty, 1991). Many PS are prone
to photo-destruction during light exposure; a process called photobleaching (Spikes, 1993). This
thought to happen when the singlet oxygen or other ROS produced upon illumination reacts with the
PS molecule itself in a manner that reduces its efficiency for further photosensitization processes. PS of
different chemical structures have widely varying photobleaching rates and in some cases (particularly
that of PPIX) the first product of photobleaching is actually a better PS than the starting molecule.
Nevertheless photobleaching usually means loss of PDT reactivity but this may still have beneficial
effects regarding treatment differential. These are based on the following considerations: there exists
a threshold PDT dose to produce tissue necrosis (Grossweiner, 1997). If photobleaching occurs (which
does not have such a threshold) before this threshold is achieved, no tissue damage is incurred. This is
Figure 3. Optical window in tissue. Absorption spectra of important tissue chromophores such as water,
oxy- and deoxyhemoglobin and melanin are plotted on a logarithmic scale.
100
Optical Window
10
water
Hb
HbO2
Melanin
0.1
0.01
400
600
800
1000
1200
1400
wavelength (nm)
596
1600
1800
2000
desirable for normal tissue exposed to therapeutic light but not for the tumor tissue to be treated. Thus,
the net result is that one can achieve greater depth of tumor necrosis while sparing the normal skin.
SUBCELLULAR LOCALISATION OF PS
PS uptake by cancer or other cells is crucial for effective PDT. ROS have a short half-life and act close
to their site of generation, therefore to a certain degree the type of photodamage that occurs in cells
loaded with a PS and illuminated depends on the precise subcellular localization of the PS within
the cell. The understanding of PS localization principles is therefore important for choosing the most
effective PS for each application. Confocal laser scanning fluorescence microscopy has made the determination of intracellular location of PS much easier, and gives more sensitivity and better spatial
resolution than earlier non-confocal techniques. Colocalization of subcellular organelle specific probes
with differing fluorescence emission maxima to that of the PS can be used to more closely identify the
site of localization (Woodburn, 1991) and these probes can also be used to identify sites of damage after
illumination (Kessel, 1997). Fluorescence resonance energy transfer (FRET) (Morris, 2003) can also
be used to determine intracellular location of PS. Intracellular distributions in cultured cells have been
determined for a range of PS with widely differing structures. The important structural features are (a)
the net ionic charge, which can range from -4 (anionic) to +4 cationic, (b) the degree of hydrophobicity expressed as the logarithm of the octanol/ water partition coefficient, (c) the degree of asymmetry
present in the molecule. PS which are hydrophobic and have two or less negative charges can diffuse
across the plasma membrane, and then relocate to other intracellular membranes. These PS also tend
to have the greatest uptakes into cells in vitro, especially when present in relatively low concentrations
in the medium (<1 M). Those PS which are less hydrophobic and have >2 negative charges tend to be
too polar to diffuse across the plasma membrane, and are therefore taken up by endocytosis. Some PS
distribute very broadly in various intracellular membranes. An example is pyropheophorbide-a methyl
ester that was reported to be localized in endoplasmic reticulum, Golgi apparatus, lysosomes and mitochondria, in NCI-h446 cells (Sun X., 2002b).
Lysosomes
In 1993 lysosomes were proposed to be a critical intracellular target for localization of PS (Geze, 1993).
However, succeeding studies (Berg, 1994) have found that although lysosomally localized PS can lead to
cell killing upon illumination, the relative efficacy is significantly lower than that seen with PS localized
in mitochondria and other organelles (MacDonald, 1999). This may be due to the tendency of PS with
greater degrees of aggregation to accumulate in lysosomes. Woodburn et al. (Woodburn, 1991) studied
intracellular localization, in V79 Chinese hamster lung fibroblasts and C6 glioma cells, of a series of
porphyrins derived from HP and PPIX with side chains chemically modified to give hydrophobic and
anionic or cationic residues at physiological pH. Compounds were selected to represent all combinations of these characteristics and it was found that those with a net cationic character localized in mitochondria, while those with net anionic character localized in lysosomes. As the anionic porphyrins all
carried two negative charges, these results are in accord with previous work suggesting that sensitizers
with a net charge of -2 or greater accumulate in lysosomes. Nagata et al. (Nagata, 2003) showed that the
chlorin-based PS, ATX-S10 (Na) had a primary site of accumulation in lysosomes but cells underwent
597
apoptosis upon illumination doses leading to 70% cell death, suggesting that apoptotic pathways may
be activated via mitochondrial destabilization following the damage of lysosomes by PDT. The initial
intracellular localization of PS in lysosomes may redistribute due to photodynamic action after only a
small amount of light has been delivered. It was found that exposure of cells preincubated with anionic
porphyrins, to light doses that inactivated 20% of the cells resulted in relocalization of the sensitizers
from the lysosomes to the cytoplasm in general, and, more specifically, the nucleus (Berg, 1991). This
behavior was attributed to photodynamic permeabilization of the lysosomal membrane, thus allowing
small molecules, including the PS to leak out into the cytoplasm.
Mitochondria
Mitochondria have been found to be a very important subcellular target for many PS used in PDT
(Morgan, 2001). This is related to the tendency of many PS to produce apoptosis by mitochondrial
damage after illumination (see Section 5). Benzoporphyrin derivative (BPD) is one of the well-studied
mitochondrial-localized PS (Runnels, 1999), however, cellular localization depends on cell type and
BPD formulation (free BPD, liposomal or encapsulated in polycationic liposomes) used [71,72]. Some
endothelial cells (ECV304) preferentially accumulated BPD in perinuclear region, others (HUVEC)
in cytoplasm; polycationic liposomal BPD was mostly deposited in the mitochondria while free BPD
was also found in perinuclear region (Takeuchi, 2003). Two meso-tetraphenylporphyrin derivatives
bearing adjacent: 5,10-di[4-N-trimethylaminophenyl]- 15,20-diphenylporphyrin (DADP-a) or opposite:
5,15-di[4-(N-trimethylaminophenyl)-10,20-diphenylporphyrin (DADP-o) cationic-N-(CH3)3+ groups
on two of the para-phenyl positions were compared in study by Kessel et al. (Kessel, 2003). DADP-a
localized in mitochondria, while DADP-o (a much more symmetric molecule) localized in lysosomes,
and led to extensive lysosomal photodamage after irradiation. PS with cationic charges and which are
also hydrophobic can localize in mitochondria (Dummin, 1997); this is thought to be due to the influence
of the mitochondrial membrane potential as well as the lipid bilayer of the membrane (Rashid, 1990).
It is known that carcinoma cell mitochondria preferentially accumulate and retain certain cationic dyes
to a much greater extent than most normal cells. Oseroff et al. (Oseroff, 1986) evaluated 10 rhodamine
and cyanine dyes as carcinoma-specific mitochondrial PS in vitro. The most effective, N,N-bis(2-ethyl1,3- dioxolane)kryptocyanine, caused marked, light dependent killing of human bladder, squamous, and
colon carcinoma cell lines after 30-min incubations at 1 -0.01 M but was minimally toxic to human
keratinocytes and to normal monkey kidney epithelial cells. Dummin et al. (Dummin, 1997) prepared
cationic zinc (II) phthalocyanines with lipophilic side-chains and showed they specifically accumulated
in the inner mitochondrial membranes. On irradiation of the incubated HeLa cells, the cristae were
affected and finally completely destroyed. The respiration stopped and the energy metabolism was
shut down. It was known previously that Pc4 localized in mitochondria and Golgi complexes and ER
(Trivedi, 2000). At early times (0 -1 h) after introduction of Pc 4 to LY-R cells, the dye was found in the
mitochondria, lysosomes and Golgi apparatus, as well as other cytoplasmic membranes, but not in the
plasma membrane or the nucleus. Over the next 2 h, there was some loss of Pc 4 from the lysosomes
but an accumulation in the Golgi apparatus and the mitochondria. The exact binding site of Pc4 was
discovered only recently. Pc 4-PDT photodamaged Bcl-2 and Bcl-xL, antiapoptotic proteins interacting
with the permeability transition pore complex that forms at contact sites between the inner and outer
mitochondrial membranes. These complexes and the inner membrane are unique in containing the
phospholipid cardiolipin. Nonyl-acridine orange (NAO) is a specific probe of cardiolipin and Morris et
598
al. (Morris, 2003) showed evidence for fluorescence resonance energy transfer from NAO to Pc 4, thus
defining an intracellular binding site for the PS.
Plasma Membrane
Compounds that localize in plasma membranes of cultured cells are relatively uncommon in the PDT
field. Aveline and Redmond (Aveline, 1999) used confocal fluorescence microscopy to show that
deuteroporphyrin IX (DP) and its monobromo and dibromo derivatives localized preferentially in
the plasma membrane of L1210 cells. PF shows a dynamic distribution in human carcinoma cells: the
plasma membranes are the main target sites of PF after a brief (3 h) incubation, while the Golgi complex
is affected after prolonged (24 h) incubation (Hsieh, 2003). The effects of PDT on cells with plasma
membrane-localized PF was found to be a cessation of proliferation post PDT at Photofrin dose less than
7 g/ml, and at higher dose (28 g/ml) plasma membrane disruption and cell swelling were observed
immediately after PDT. Characteristics typical of apoptosis such as phosphatidylserine externalization
and DNA fragmentation were not detected in the cell death process caused by this PDT regime.
Golgi Apparatus and Endoplasmic Reticulum

Teiten et al. (Teiten, 2003) studied Foscan subcellular localization in the MCF-7 human adenocarcinoma
cell line by means of confocal microscopy and microspectrofluorometry. The fluorescence topographic
profiles recorded after cells co-stained with Foscan and organelle-specific fluorescent probes revealed
that Foscan presents low localization in lysosomes and a weak accumulation in mitochondria. However,
the Foscan fluorescence topographic profile turned out to co-localize perfectly with that obtained for
the endoplasmic reticulum (ER) and the Golgi apparatus. The patterns of fluorescence derived from
confocal microscopy studies were consistent with predominant localization of Foscan in these organelles. Furthermore, evaluation of enzymatic activity of selected organelles immediately after laser light
irradiation (650 nm) indicated the Golgi apparatus and ER as the primary damaged sites resulting from
Foscan-mediated PDT in the MCF-7 cell line.
ALA-Induced PPIX
Recently there has been much interest in a different approach to PDT where, instead of the PS being
administered in a pre-synthesized form, a metabolic precursor is administered and the PS is synthesized
in situ in tumors or cells of the target tissue (Peng, 1997).
The precursor is 5-aminolevulinic acid (ALA) which interacts with the heme biosynthetic pathway
as shown in Figure 4. Almost all types of cells of the human body, with the exception of mature red
blood cells, are equipped with this metabolic machinery. In the first step of the pathway ALA is formed
from glycine and succinyl CoA. The synthesis of ALA by ALA-synthetase is under feedback regulation
by the amount of heme in the cell. The last step in the pathway is incorporation of iron into PPIX catalyzed by the enzyme ferrochelatase and this is rate limiting. By adding exogenous ALA, the feedback
inhibition is bypassed, and PPIX will accumulate because of the limited capacity of ferrochelatase to
transform PPIX to heme.
PPIX is formed in the mitochondria of cells, but rapidly diffuses to other intracellular membrane sites.
Gaullier et al. (Gaullier, 1995) found early staining in mitochondria but at later time points the plasma
599
Figure 4. ALA-induced PPIX. Schematic diagram illustrating the interaction of the heme biosynthesis
pathway with exogenous ALA to give intracellular PPIX. Abbreviations are ALA-D: ALA dehydratase;
ALA-S: ALA synthetase; Coprogen III: coproporphyrinogen III; CPO: coproporphyrinogen oxidase;
FCH: ferrochelatase; HMB: hydroxymethylbilane, PBG-D: porphobilinogren deaminase; protogen
III: protoporphyrinogen; PPO: protoporphyrinogen oxidase; Urogen III: uroporphyrinogen III; UCS:
uroporphyrinogen cosynthase, UGD: uroporphyrinogen decarboxylase.
membrane showed strong staining, and fluorescent spots (shown to be lysosomes by co-localization
experiments with lysosomal probes) were observed within the cytoplasm especially in the perinuclear
region. Fluorescence spectra suggested that the PPIX microenvironments were quite different at short
and long incubation times. In vivo the ALA may be administered orally (van den Boogert, 1998), intravenously (Svanberg, 1996), or topically (Calzavara-Pinton, 1995). The reasons why cancer cells tend
to synthesize more PPIX than normal cells, has been much investigated. Hypotheses include greater
expression of heme biosynthesis enzymes, porphobilinogen deaminase (Gibson, 1998), coproporphyrinogen oxidase (Ortel, 1998), or reduced expression of ferrochelatase (Van Hillegersberg, 1992), but
increased delivery of ALA to the tumor may play a role especially in topical application (Szeimies,
1994). Recent attempts to increase the efficacy of ALA-mediated PDT include the use of iron chelators
to decrease the amount of PPIX converted to heme by ferrochelatase by removing the free iron that is
the necessary for the enzyme to work (Curnow, 1998). Another approach is to administer ALA as one
of various alkyl esters (methyl, pentyl, hexyl or benzyl) in order to increase cellular uptake by making
the molecule more lipophilic (Fotinos, 2006). Since ALA is frequently applied topically to the skin, the
ALA methyl ester that penetrates through the skins natural permeability barrier much better than the
polar ALA, recently received approval to treat basal cell skin cancers (Lehmann, 2007).
CELL PATHWAYS ACTIVATED BY PDT

One of the largest areas of research in the field of mechanisms in PDT in recent years has paralleled the
explosion of knowledge in cell biology about signal transduction pathways, transcription factors and
600
regulation of cell cycle control, inflammation and cell death. Figure 5 gives a graphical representation
of some of these signal transduction pathways activated by PDT.
Calcium
PDT of various cell types in vitro has been shown to raise the levels of free calcium within cells and
this has been associated with cell death, and in certain occasions and conditions, with cell survival. The
Ca2+ rise upon PDT has been proposed to occur via the influx of Ca2+ through ion channels (Joshi, 1994,
Penning, 1992), release of Ca2+ sequestered in internal stores in the endoplasmic reticulum (ER) and
mitochondria (Hubmer, 1996), and/or activation of ion exchange mechanisms (Specht, 1991). Exposure
of Chinese hamster ovary (CHO) cells and T24 human bladder transitional carcinoma cells treated with
the PS aluminum phthalocyanine (Alps) and respectively to red light caused an immediate increase of
cytoplasmic free calcium, [Ca2+]i, reaching a peak within 5 -15 min after exposure and then returning
to basal level (approximately 200 nM) (Penning, 1992). Loading the cells with the intracellular calcium
chelators, quin2 or BAPTA prior to light exposure, enhanced cell killing. This indicates that increased
[Ca2+]i, caused by extracellular Ca2+ influx after PDT, contributed to survivability of the treated cells
by triggering a cellular rescue response.
Ding et al. (Ding, 2004) showed that PDT with hematoporphyrin monomethyl ether (HMME)induced cell death by apoptosis and necrosis and using sodium azide (the singlet oxygen quencher)
or d-mannitol (the hydroxyl radical scavenger) they could protect HeLa cells from the death. Sodium
azide or d-mannitol also inhibited HMME-PDT-mediated [Ca2+]i elevation. Cytochrome c release from
mitochondria into cytosol, and caspase 3 activation after HMME-PDT were inhibited by BAPTA/AM
(an intracellular calcium chelator). These results demonstrated that ROS generated in HeLa cells by
HMME-PDT-induced apoptosis may be through [Ca2+]i elevation which mediates cytochrome c release
and caspase 3 activation and initiates the apoptosis.
The effects of aminolevulinic acid (ALA)-PDT (induction with 1mM ALA for 4 h followed by a
blue light dose of 18 J/cm 2) were studied (Grebenova, 2003) on the human promyelocytic leukemia cell
Figure 5. Signal transduction pathways activated after PDT. Events occur at receptors located at the
plasma membrane and lead to alterations in cellular metabolism. These may tend towards increasing
apoptosis or increasing cell survival.
601
line HL60 using biochemical and electron microscopy methods. It was seen that ALA-PDT activates
the mitochondrial apoptotic pathway and the level of endoplasmic reticulum Ca2+-binding chaperones
ERp57 and ERp72 and of anti-apoptotic proteins Bcl-2 and Bcl-xL was decreased whereas that of Ca2+binding protein calmodulin and the stress protein HSP60 was elevated following ALA-PDT. Inhibition
of the initiator caspase 9, execution caspase 3 and Ca2+-dependent protease m-calpain, did not prevent
DNA fragmentation.
PDT using the photosensitizer benzoporphyrin derivative (BPD) has been previously shown to induce rapid apoptosis via a mitochondrial-caspase activation pathway. Granville et al. (Granville, 2001)
analyzed the impact of PDT on other cellular organelles such as the endoplasmic reticulum (ER). The
effect of PDT on [Ca2+]i in control and Bcl-2-overexpressing HeLa cells was assessed. A greater [Ca2+]i
transient was observed for Bcl-2 overexpressing cells in response to PDT. The PDT-induced Ca2+ release
was due to the emptying of Ca2+ from ER and possibly mitochondrial stores and was not due to an influx
of Ca2+ from the medium. Studying PDT effects in cell cultures of rat bladder epithelial cells for the
hydrophilic tetrasulfonated aluminum phthalocyanine (AlPcS4) and using confocal microscopy, Ruck
et al. (Ruck, 2000) showed transient calcium elevation during the irradiation process, especially in the
cell nuclei, followed by a more sustained increase.
The activation of the membrane-localized enzymes phospholipase C (PLC) and phospholipase A2
(PLA2) is a very early event in the induction of PDT-induced apoptosis because intracellular Ca2+ acts
as a second messenger in cellular signaling in response to a wide range of stimuli and may link activation of PLA2 to activation of PLC. PLC hydrolyzes phosphatidyl inositol-4,5-diphosphate after activation and subsequently produces inositol-3-phosphate (IP3) and diacylglycerol (DAG). It is known that
DAG activates protein kinase C (PKC) and that IP3 promotes increases in intracellular Ca2+ (Berridge,
1984). Ca2+ binds to calmodulin in cells, which therefore acts as an intermediary protein that senses
calcium levels and relays signals to various calcium-sensitive enzymes, ion channels and other proteins.
This frequently happens via a complex with a second protein calcineurin. Once the complex is formed,
Ca2+/calmodulin/calcineurin can in turn act to dephosphorylate the transcription factor nuclear factor of activated T cells (NFAT). Activated NFAT can regulate transcription through binding its own
cognate DNA-binding site. One marker of keratinocyte differentiation, the p21 gene, is activated by
NFAT by a different mechanism, with NFAT activating the p21 promoter by acting as a coactivator
for the transcription factors Sp1 and Sp3 (Santini, 2001). In general, it is possible to think that Ca2+ is
a link between many of the pathways activated by PDT and plays an important role in the effect PDT
has on many cellular functions.
Lipid Metabolism
As discussed in the previous section, PDT can lead to rapid changes in intracellular Ca2+ and there are
many interconnected pathways between Ca2+ and lipid metabolism largely due to the activation of phospholipases. The rapid release of arachidonic acid metabolites seen in many cases after PDT (Agarwal,
1993, Henderson, 1989, Penning, 1993) may result from the activation of PLA2, an enzyme activated by
Ca2+. Penning et al. (Penning, 1993) found that HPD-PDT of human bladder cancer cells led to release
of prostaglandin E2 (PGE2) and thromboxane B2 (TBX2), and this was reduced by calcium chelation
with EGTA resulting in inhibition of PLA2, or by using indomethacin to inhibit cyclooxygenase. These
treatments also increased the cell death suggesting that arachidonic acid metabolism can protect cells
from PDT killing. Nevertheless, the role of cyclooxygenases in PDT may depend on the cell type and/or
602
the PDT dose, since preincubation of C6 glioma cells with indomethacin increased the number of cells
surviving after PDT with HPD, whereas the survival rate for endothelial cells was decreased in the
presence of the inhibitor when higher HPD concentrations were used (Foultier, 1992).
Fingar et al. (Fingar, 1990) showed that there was a PDT dose-dependent increase in serum thromboxane
after Photofrin (PF)-mediated PDT of rats bearing chondrosarcoma tumors. These authors went on to
show (Fingar, 1993) that a thromboxane synthetase inhibitor, a thromboxane receptor antagonist and an
inhibitor of platelet shape change in combination with PF-PDT reduced vasoconstriction, inhibited vessel
permeability and reduced tumor cure. Ferrario et al. (Ferrario, 2002) showed that both porphyrin- and
chlorine-based PS were able to elicit PDT-mediated cyclooxygenase-2 (COX-2) expression after PDT in
vitro and in vivo. COX-2 (but not COX-1) mRNA and protein levels were increased in radiation-induced
fibrosarcoma (RIF-1), BA (mouse mammary carcinoma), LLC (Lewis lung carcinoma) cells and in RIF1 tumors in mice together with release of PGE2 after PF-PDT. These authors also combined PDT with
the selective COX-2 inhibitor NS-398 in RIF-1 tumors and demonstrated enhanced PDT responsiveness
and decreased induction of both PGE2 and vascular endothelial growth factor in treated tumors.
Ceramide is a stress-induced second messenger that is generated from sphingophospholipids (which
are part of the cell membrane) by sphingomyelinases. These enzymes cleave sphingophospholipids such
as sphingomyelin to yield ceramide and phosphorylcholine, and in addition ceramide can be generated
by de novo synthesis by a ceramide synthase (Bektas, 2004, Yang, 2004). The sphingolipid ceramide has
proven to be a powerful second-signal effector molecule that regulates diverse cellular processes including apoptosis, cell senescence, the cell cycle, and cellular differentiation. Ceramide has been shown to
activate a number of enzymes involved in stress signaling cascades including both protein kinases and
protein phosphatases (Separovic, 1998). Dolgachev et al. (Dolgachev, 2004) showed that the oxidative
stress induced by phthalocyanine (Pc4)-PDT in Jurkat human T lymphoma and CHO cells was accompanied by increases in ceramide with a concomitant decrease in sphingomyelin. Sphingomyelin synthase,
as well as glucosylceramide synthase, was inactivated in a dose-dependent manner and the activity of
serine palmitoyltransferase (the enzyme catalyzing the initial step in the sphingolipid biosynthesis) was
profoundly inhibited after treatment. Niemann- Pick disease lymphoblasts, which are deficient in acid
sphingomyelinase (ASMase) activity, failed to respond to Pc4-PDT with ceramide accumulation and
apoptosis, suggesting that ASMase may be a Pc4-PDT target (Separovic, 1999). However, this finding
appears to be cell-type specific because in mouse embryonic fibroblasts isolated from ASMase knockout
and wild-type mice Pc4-PDT led to increased caspase 3 activity and subsequent apoptosis in both cells
(Chiu, 2000). Similarly, ceramide levels were elevated in both cell types post-PDT.
Tyrosine Kinases
Signal transduction networks provide mechanisms for cells to receive external stimuli and respond to
the signals in an appropriate manner. The mutagen activated protein kinase (MAPK) signaling pathways
play an important role in signal transduction in eukaryotic cells, where they modulate many cellular
events including: mutagen-induced cell cycle progression through the G1 phase, regulation of embryonic development, cell movement and apoptosis, as well as cell and neuronal differentiation (Murray,
1998). These evolutionarily conserved pathways are organized in three-kinase modules consisting
of a MAP kinase, an activator of MAP kinase (MAP kinase kinase or MEK) and a MAP kinase kinase
kinase (MEK kinase or MEKK). There are at least three distinct MAP kinase signal transduction pathways in mammalian cells, each named after the particular MAPK associated with it. These include the
603
extracellular signal-regulated kinases, ERK1/2, the c-Jun N-terminal kinases/stress-activated protein

kinases (JNK/SAPK) and the p38 kinases (analogs of HOG1 yeast protein).
MAPK family activation following BPD-PDT was found in a transformed murine keratinocyte cell
line, Pam212 (Tao, 1996). PDT caused a strong dose- and time-dependent activation of both SAPK and
p38 HOG1 but not ERK. Both l-histidine and N-acetylcysteine showed a significant inhibitory effect
on PDT-induced SAPK and p38 HOG1 activation indicating the effect was partially mediated by reactive oxygen intermediates (ROI). Western blot analysis performed on the proteins of LY-R and CHO
cells at various times following lethal (90-99% cell kill) doses of Pc4-PDT found of the three MAPK
types, only the p46 and p54 SAPK/JNKs were activated (Xue L., 1999). PDT did not affect ERK and
p38/HOG activation in LY-R cells. In the case of CHO cells, however, ERK2 was slightly activated at
5 min post-PDT, then declined, and p38/HOG was strongly activated from 5 to 60 min post-PDT. This
study suggests that PDT can stimulate SAPK and p38/HOG cascades and that the latter participates in
both rapid and slow PDT-induced apoptosis.
Hypericin-PDT of human cancer cells led to upregulation of the inducible COX-2 enzyme and the
subsequent release of PGE2 (Hendrickx, 2003). The activation of p38 MAPK alpha and beta mediated
COX-2 upregulation at the protein and messenger levels. The p38 MAPK inhibitor, PD169316, abrogated
COX-2 expression in PDT-treated cells, whereas overexpression of the drug-resistant PD169316-insensitive
p38 MAPK alpha and beta isoforms restored COX-2 levels in the presence of the kinase inhibitor. The
half-life of the COX-2 messenger was drastically shortened by p38 MAPK inhibition in transcriptionally
arrested cells, suggesting that p38 MAPK mainly acts by stabilizing the COX-2 transcript. Hence, the
combination of PDT with pyridinyl imidazole inhibitors of p38 MAPK may improve the therapeutic
efficacy of PDT by blocking COX-2 up-regulation, which contributes to tumor growth by the release of
growth and pro-angiogenic factors, as well as by sensitizing cancer cells to apoptosis.
The role of ERKs in the cell survival after PDT was studied by Tong et al. (Tong, 2002). They
examined the response of ERK1/2 in PF-PDT-resistant (LFS087) and PDT-sensitive (GM38A) cells.
ERK1/2 activity was induced rapidly in both cell types after PDT but was transient in GM38A cells
and by 3 h had returned to a level significant lower than basal levels, whereas the induction of ERK1/2
was sustained in LFS087 cells and lasted for at least 11 h. Blocking of the sustained ERK activity with
PD98059, an inhibitor of MAPK/ERK significantly decreased cell survival of LFS087 after PDT. PDT
also induced the expression of MAPK phosphatase (MKP-1). ALA-PDT of HaCaT cells led to a six-fold
elevation of cellular JNK activity; phosphorylation of p38 MAPK was enhanced to a similar extent. p38
was also phosphorylated by ALA-PDT in the human melanoma cell lines Bro and SkMel-23, applying
doses that lead to 80 -95% cell death after 24 h. The effects of ALA-PDT on MAPKs are similar to
stresses such as UV irradiation or exposure to hydrogen peroxide with respect to activation of JNK and
p38 MAPKs (Klotz, 1998). The epidermal growth factor receptor (EGFR) is a tyrosine kinase involved
in the initiation and progression of various cancers especially their proliferative, angiogenic, invasive,
and metastatic aspects (Castillo, 2004).
Wong et al. (Wong, 2003) used ALA- and PFPDT on human cancer cell lines: hypopharyngeal
carcinoma FaDu; cervical adenocarcinoma HeLa; and hepatocellular carcinoma HepG2, and studied
the cells response to cytokines, IL-6 and EGF, after PDT. PDT-induced the complete loss of EGFR on
the cells membrane.
Another study reported (Ahmad, 2001) the involvement of the EGFR-pathway during antiproliferative responses of Pc4-PDT in A431 cells and during ablation of murine skin papillomas. Pc4-PDT of
A431 cells was found to result in a time-dependent down-modulation of the protein expression and
604
phosphorylation of EGFR and Shc (an immediate downstream molecule in EGFR-pathway). In chemically as well as ultraviolet B radiation-induced squamous papillomas in SENCAR and SKH-1 hairless
mice, Pc4-PDT resulted in a time-dependent: inhibition of protein expressions of EGFR; and tyrosine
phosphorylation of EGFR and Shc; and induction of apoptosis, during the regression of these tumors.
JNK are group of MAPK comprising three protein kinases: JNK1, JNK2, and JNK3, whose genes
are alternatively spliced to create 10 isoforms. JNK binds and phosphorylates the N-terminal activation
domain of transcription factor c-Jun, resulting in increased transcription activity of c-Jun. JNK1 and
JNK2 are ubiquitously expressed, whereas JNK3 is restricted to brain, heart, and testis. JNK can be
activated by treatment of cells with cytokines tumor necrosis factor- (TNF-) and interleukin 1 (IL-1)
or by exposure of cells to a wide variety of environmental stresses (Nakano, 2004). A study (Assefa,
1999) reported that PDT with hypericin induced a strong and persistent activation of the JNK and p38
MAPK signaling pathways while inhibiting ERK2 activity. There was a protective role for the JNK/p38
MAPK pathways during PDT-induced apoptosis.
Protein kinase B (Akt/PKB) itself is a serine/ threonine protein kinase and phosphorylates a variety
of substrates involved in regulation of key cellular functions such as growth control and survival. Akt/
PKB can phosphorylate ATM, and ATM can phosphorylate BRCA1 and p53, and has also been suggested to interact with DNA mismatch repair proteins. Akt/PKB is thought to mediate various insulindependent biological processes (Scheid, 2003). The distinct functions of individual Akt/PKB isoforms
still remain to be fully elucidated. Photosensitization of the murine fibroblast cell line NIH 3T3 with
Rose Bengal (RB) increased the phosphorylation of Akt, which is taken as a sign of kinase activation
(Zhuang, 2003). This effect was mediated by activation of PI3- K, but was independent of activation
of growth factor receptors and of focal adhesion kinase (FAK). Indeed, photosensitization with RB
decreased FAK phosphorylation activity, which may explain the reduction in cell adhesion. The effects
of singlet oxygen on ERK and Akt/PKB pathways were analyzed in human dermal fibroblasts (Schieke,
2004). While basal ERK phosphorylation was lowered in cells exposed to either UVA or RB-PDT, Akt
was moderately activated by PDT in a phosphoinositide 3-kinase-dependent fashion. Likewise, both
singlet oxygen and UVA-induced ceramide generation in human skin fibroblasts. Epidermal growth
factor (EGF)-induced tyrosine phosphorylation of the EGF receptor was strongly attenuated by PDT
but unimpaired by C2-ceramide.
Transcription Factors
Transcription factors are proteins that bind to the enhancer or promoter regions of genes and interact
such that transcription occurs from only a small group of promoters in any cell. Most transcription factors can bind to specific DNA sequences, and these trans-regulatory proteins can be grouped together in
families based on similarities in structure. Within such a family, proteins share a common framework
structure in their respective DNA binding sites, and slight differences in the amino acids at the binding
site can alter the sequence of the DNA to which it binds. Transcription factors act as intracellular third
messengers that couple receptor-generated signals to activation-associated changes in gene expression, often forming large transcriptional complexes with a variety of other transcription factors and
accessory proteins at response elements within the promoters of the genes the transcription of which
they modulate.
Activator protein-1 (AP-1) is a homo- or heterodimeric protein complex composed of the products
from the proto-oncogenes c-jun and c-fos or related family members. The association between c-jun
605
and c-fos is required for binding to DNA and involves a structural motif known as a leucine zipper.
Hydrophobic interactions between leucines located every 7th amino acid in an alpha-helix region of
each sub-unit, hold the two sub-units together. AP1 binds to DNA sequences (transcription response
elements) in the promoter region of many genes, which are involved in regulating cell proliferation.
AP1 transcription factors are activated by a variety of physical and chemical stresses and have been
related to both induction and prevention of apoptosis, depending on the tissue and on its developmental
stage (Shaulian, 2002). Gomers laboratory (Luna, 1994) studied the PDT-mediated induction of the
early response genes, c-fos, c-jun, c-myc, and egr-1, in murine RIF-1 cells. Incubation of exponentially
growing cells with porphyrin based PS in the dark also induced an increase in mRNA levels of early
response genes. Nevertheless, the xanthine PS RB, produced increased c-fos mRNA levels only following illumination. PDT with PF also increased transiently c-jun, c-myc and egr-1 mRNA in human
adenocarcinoma HeLa cells (Kick, 1996). Furthermore, mRNA stability experiments showed an increased
half-life of c-fos and c-jun transcripts in HeLa cells sensitized with PF, and a concomitant increase in
AP-1-DNA-binding activity was also observed. The AP-1 pathway was found to be responsible for the
rapid increase of IL-6 expression observed after PDT (Kick, 1995) Using the spontaneous transformed
murine keratinocyte cell line, Pam212, it was shown that PDT with BPD could induce a time- and dosedependent activation of stress-activated protein kinase (SAPK) and p38 high osmolarity glycerol protein
kinase (p38 HOG1) (Tao, 1996). SAPK and p38 HOG1 pathways are implicated in the transduction of
stress signals and stimulation by inflammatory cytokines and are responsive to stimuli such as TNF-,
IL-1, UV irradiation, heat, change in osmolarity, and increase in intracellular reactive oxygen species
(ROS). Stress kinases can induce activation of AP-1, and possibly a related transcription factor AP-2,
thus enabling ROS-mediated gene expression. Depending on the pattern of gene expression induced,
cellular responses may range from inflammation, degradation and immmunosuppression to triggering
of apoptosis.
The transcription factor, nuclear factor kappa B, (NF-B) was initially discovered as a factor in
the nucleus of B cells that binds to the enhancer of the kappa light chain of immunoglobulin NF-B
is present in the cytoplasm as homo- or heterodimers, formed by association of sub-units belonging
to the Rel protein family. These complexes are sequestered in the cytoplasm by proteins belonging to
the inhibitor of NF-B (IB) family. Stimuli leading to NF-B activation typically initiate a specific
signal transduction cascade leading to phosphorylation of IBs. Once phosphorylated, IB is selectively ubiquitinated and degraded by the 26S proteosome, releasing NF-B that is then translocated
to the nucleus where it participates in transcriptional activation. NF-B has been linked to the regulation of many cellular genes including those encoding a number of cytokines and growth factors such
as IL-1, IL-2, IL-6, IL-8, granulocyte macrophage colony stimulating factor (GM-CSF), and TNF-.
Other genes include adhesion molecules such as intracellular adhesion molecule, E-selectin, and many
other proteins involved in various processes, including immune responses, acute phase reaction and
inflammation such as inducible nitric oxidase synthase. Activation of NF-B upon photosensitization
was first shown in studies using mouse leukemia L1210 cells and PF as a sensitizer (Ryter, 1993). PDT
of the lymphocytic ACH-2 cell line with methylene blue led to the degradation of IBa and increased
NF-B DNA-binding activity [60] and similar results were found using proflavine (a PS that intercalates
into DNA) that did not cause AP-1 activation (Piret, 1995). The activation of NF-B and its role in cell
death upon PDT was investigated using the PS pyropheophorbide methyl ester (PPME) in HCT-116
human colon carcinoma cells which resulted in NF-B activation by triggering the signaling pathway
mediated by the IL-1 receptor. This involved degradation of the cytoplasmic IBa pool that contributed
606
to the increase in NF-B DNA binding (Matroule, 1999). NF-B activation also occurred in human
ECV 304 and HMEC-1 endothelial cells after PDT with PPME and this was prevented by antioxidants.
The activation of NF-B was proposed to occur by a mechanism independent of the activation of IB
kinases and required the activity of a tyrosine kinase (Volanti, 2002)]. NF-B has been shown to either
promote or inhibit apoptosis, depending on the cell type and the type of inducer (Kucharczak, 2003).
Apoptosis and NF-B activation was observed with verteporfin PDT in HL60 cells as shown by transient
transfection with a NF-B-luciferase reporter construct (Granville, 2000). However, less intensive PDT
regimens can also affect NF-B, increasing the antiapoptotic mechanisms of survival.
E2F is a transcription factor that controls the transition from G1 to S phase in the cell cycle. The
induced genes encode DNA replication activities such as DNA polymerase, proliferating cell nuclear
antigen, nucleotide biosynthetic activities including thymidine kinase, thymidylate synthase and ribonucleotide reductase, and cell cycle regulatory activities (Bracken, 2004). E2F also directs the synthesis of both cyclin E and cyclin-dependent kinase 2 (cdk2), creating the kinase activity responsible
for activation of replication (Stevens, 2003). The retinoblastoma (Rb) gene was initially identified as a
genetic locus associated with the development of an inherited eye tumor and the realization that it was
a loss of function of Rb that was associated with disease established the tumor suppressor paradigm.
Subsequent work identified the E2F transcription factor activity as a key target for the growth suppressing action of the Rb protein. Additional work demonstrated that Rb function, including the ability
to interact with E2F, was regulated by phosphorylation and that the primary kinase responsible was
the D-type cyclin-dependent kinases. Studies have shown (Ahmad, 1998) that Pc4-PDT results in an
induction of the cyclin kinase inhibitor WAF1/CIP1/p21 which, by inhibiting cyclins (E and D1), cdk2
and cdk6, results in a G0/G1-phase arrest followed by apoptosis in human epidermoid carcinoma cells
A431. This group went on to show a decrease in the hyper-phosphorylated form of pRb at 3, 6 and 12
h post-PDT with a relative increase in hypo-phosphorylated pRb, which provided evidence for the involvement of pRb-E2F/DP machinery in PDT-mediated cell cycle arrest leading to apoptosis (Ahmad,
1999). Upregulation in WAF1/CIP1/p21 protein levels was also observed upon PDT of human ovarian
carcinoma (OVCAR-3)-bearing athymic nude mice with Pc4 (Colussi, 1999) but not in SW480 colon
cancer xenografts subjected to PDT with the same PS suggesting that this mechanism is tumor-type
dependent (Whitacre, 2000).
Cell Adhesion
Mammalian cells adhere to the extracellular matrix and to each other through specific membrane protein
receptors. These are classified into the following groups: integrins, immunoglobulin G superfamily,
selectins and cadherins. Integrins are ubiquitous trans-membrane adhesion molecules that mediate the
interaction of cells with the extracellular matrix and also control cell-cell interactions. Recent research
indicates that integrins also function as signal transduction receptors triggering a number of intracellular signaling pathways that regulate cell behavior and development (Hynes, 2002). The selectins
mediate transient interactions between leukocytes and endothelial cells or blood platelets. There are
three members of the selectin family: L-selectin, which is expressed on leukocytes; E-selectin, which is
expressed on endothelial cells; and P-selectin, which is expressed on platelets. The selectins recognize
cell surface carbohydrates. One of their critical roles is to initiate the interactions between leukocytes
and endothelial cells during the migration of leukocytes from the circulation to sites of tissue inflammation. The selectins mediate the initial adhesion of leukocytes to endothelial cells (Patel, 2002). This
607
is followed by the formation of more stable adhesion complexes, in which integrins on the surface of
leukocytes bind to ICAMs, which are members of the Ig superfamily expressed on the surface of endothelial cells. The firmly attached leukocytes are then able to penetrate the walls of capillaries and enter the
underlying tissue by migrating between endothelial cells. Other members of the Ig superfamily mediate
homophilic interactions that lead to selective adhesion between cells of the same type. There are more
than 100 members of the Ig superfamily, which mediate a variety of cell-cell interactions. The fourth
group of cell adhesion molecules, the cadherins, also displays homophilic-binding specificities. They
are not only involved in selective adhesion between embryonic cells but are also primarily responsible
for the formation of stable junctions between cells in tissues. For example, E-cadherin is expressed
on epithelial cells and homophilic interactions between E-cadherins lead to the selective adhesion of
epithelial cells to one another; during cancer progression, however, E-cadherin mediated adhesion is
frequently lost (Cavallaro, 2004). Other members of the cadherin family mediate selective adhesion of
other cell types (George, 2004).
The alterations in the attachment of cancer cells to the substratum and each other are amongst important consequences of PDT. These changes are largely caused by the damage of adhesion molecules
located in cell membranes. Several adhesion molecules have been reported to be involved in the PDT
response, however specific molecular pathways depend on cell line, fluence rate and PS used (Ball, 2001).
BPD-PDT mediated changes in adhesive properties of cells were studied by several groups. Decreased
adhesion to collagen IV, fibronectin, laminin and vitronectin, as well as loss of 1 integrin-containing
focal adhesion plaques were detected in ovarian cancer cells (Runnels, 1999). A decreased expression
of CD44V6, its lectins (aHt1.3, PNA, SNA) and MHC-I molecules were observed in colon cancer (Rousset, 1999). In the later case BPD- and HPD-mediated PDT were compared, however no difference was
detected. Vonarx et al. (Vonarx, 1995) investigated the effect of HPD-PDT on colon cancer cells with a
difference in vivo metastatic potential. HPD-PDT increased the adhesiveness rate of both cell lines on
plastic and decreased it on extracellular matrix, which correlated with decreased metastatic potential.
ALA-PDT of human adenocarcinoma (WiDr) cells inhibited the attachment of suspended cells to a
plastic substratum. It also induced the increase of intracellular space in dense colonies, formation of
lamellipodia between the cells and redistribution of V3 integrin. The distribution of another adhesion
molecule, E-cadherin was not changed under these conditions [81]. Pyridinium zinc (II) phthalocyanine and polyhaematoporphyrin were found to significantly decrease the efficiency of trypsinization of
RIF-1, HT29 human colonic carcinoma, and ECV304 human umbilical vein endothelial cells adhering
to plastic when PDT was carried out (Ball, 2001). Meta-tetrahydroxyphenylchlorin (mTHPC, Foscan)
however did not show this effect. This observation was partly explained by an increased activity of the
enzyme tissue-transglutaminase in the cells. Similar results were reported by Uzdensky et al. (Uzdensky,
2004) who showed that sublethal PDT of human WiDr adenocarcinoma cells and D54Mg glioblastoma
cells with ALA-PPIX or disulfonated tetraphenylporphyrin (TPPS2a) inhibited their trypsin-induced
detachment from a plastic substratum.
Studies discussed above show that PDT-induced changes in adhesion could lead to a decrease of
tumor metastatic potential. However, in one report BPD-mediated PDT of orthotopic rat prostate cancer
increased the level of metastasis (Momma, 1998)]. The adhesion changes induced by PDT are different
for cancer and normal cells and their substrates. Neutrophil adhesion to endothelial cells is enhanced
by PF-PDT with alphaL-, alphaM- or alphaX-_2 receptors of the neutrophils involved in this process
(de Vree, 1996). Platelet adhesion to the extracellular matrix and fibrinogen was significantly decreased
after PDT of these substrates, however PDT of collagen resulted in significantly increased platelet adhe-
608
sion, with large aggregate formation (Fungaloi, 2002). Volanti et al. (Volanti, 2004) used pyropheophorbide-a methyl ester (PPME) to induce the expression of ICAM-1 and vascular cell adhesion molecule
(VCAM)- 1 via activation of NF-B in HMEC-1 cells. Increased ICAM-1 and VCAM-1 expression at
the protein level was not observed, although IL-6 was secreted. Using specific chemical inhibitors, they
showed that the lack of ICAM-1 and VCAM-1 expression was the consequence of their degradation by
lysosomal proteases. The proteosome and calpain pathways were not involved. All these observations
were consistent with the fact that no adhesion of granulocytes was observed in these conditions.
Cytokines
Cytokines are small-secreted proteins, which mediate and regulate immunity, inflammation, and hematopoiesis. They must be produced de novo in response to a stimulus. They generally (although not
always) act over short distances and short time spans and at very low concentration. They act by binding to specific membrane receptors, which then signal the cell via second messengers, often tyrosine
kinases, to alter its behavior (gene expression). Responses to cytokines include increasing or decreasing
expression of membrane proteins (including cytokine receptors), proliferation, and secretion of effector molecules. The vascular effect that causes hemorrhagic tumor necrosis after PDT, was originally
thought to be mediated by TNF- as administration of this cytokine had been shown to cause similar
vascular effects and direct tumor effect in mouse models. The first description of cytokine production
by PDT was reported by Evans et al., (Evans, 1990) who measured the energy-dependent production of
TNF- by macrophages treated with PDT using the L929 assay, and TNF- production was inhibited
at the highest PDT doses. TNF- gene transcription increased in keratinocytes treated with Pc4 and
light (Anderson, 1997), and photodynamic activation of the monocyte cell line U937 differentiated
into macrophages with mTHPC, also induced a dose-dependent production of TNF- (Coutier, 1999).
However, although anti-TNF- antibodies and pentoxifylline, an inhibitor of cytokine transcription,
prevented cutaneous photosensitivity in adult C3H/HeN mice injected with Pc4, none of these agents
affected Pc4 PDT-induced tumor (RIF-1) regression (Coutier, 1999). PF can induce immune responses
in the absence of light; a research group compared its effect with PPIX and showed these porphyrins
produced lymphocyte proliferation and secretion of IL-2, IL-3, TNF- and interferon (IFN), by human or murine mononuclear cells without any activating light (Herman, 1996). Combined stimulation
of cells by mitogens and porphyrins maintained optimal vital ionic balance of potassium, sodium and
chlorine in the lymphocytes. In the cells, thus, treated, there was a significant increase in intracellular
calcium, important for lymphokine secretion. They proposed that the effect of PF on the immune system involves enhanced cytokine secretion, which may account for the subsequent tumor eradication by
PDT (Herman, 1996). Treating LLC cells with mono-l-aspartyl chlorine e6 (NPe6) and light increased
expression of the mRNA of IL-2, IL-6, and TNF- 6 h later. Cytokine gene- transfected cells, namely
LLC-IL-2, LLC-IL-6, and LLC-TNF-cells were then treated with PDT. IL10 gene transfected LLC-IL-6
cells were significantly more sensitive and showed higher levels of apoptosis than the parent LLC cells
and other cytokine gene-transfected cells, demonstrating that IL-6 expression plays a role in cellular
sensitivity to PDT (Usuda, 2001).
The effects of PDT on the activity of the IL-10 gene promoter and on IL-10 mRNA stability was
studied using the murine keratinocyte line, Pam212 (Gollnick, 2001). In vitro PDT-induced IL-10 mRNA
and protein expression from Pam212 cells, which was correlated with an increase in AP-1 DNA-binding
activity and activation of the IL-10 gene promoter by PDT. Deletion of an AP-1 response element from
609
the IL-10 gene promoter was shown to abrogate the PDT-induced promoter activity indicating that the
AP-1 response element is critical to IL-10 induction by PDT. In addition, PDT resulted in an increase
in IL-10 mRNA stability, which may also contribute to the increased IL-10 expression in Pam212 cells
following PDT (Gollnick, 2001).
Neutrophils have become recognized as important contributors to the effectiveness of tumor eradication by PDT. The ability of PDT using the PS 2-[1-hexyloxyethyl]-2-devinyl pyropheophorbide-a
(HPPH) to induce proinflammatory cytokines and chemokines, as well as adhesion molecules known
to be involved in neutrophil migration, was examined in mice. In this study the researchers found that
HPPH-PDT induced neutrophil migration into the treated tumor, which was associated with a transient,
local increase in the expression of macrophage inflammatory protein (MIP)-2 (the murine equivalent of
IL-8). A similar increase was detected in functional expression of adhesion molecules, i.e. E-selectin
and ICAM-1, and both local and systemic expression of IL-6 was detected. The kinetics of neutrophil
immigration mirrored those observed for the enhanced production of chemokines, IL-6 and adhesion
molecules. Subsequent studies showed that PDT-induced neutrophil recruitment is dependent upon the
presence of MIP-2 and E-selectin, but not on IL-6 (Gollnick, 2003). Cecic et al. used a mouse SCCVII
squamous cell carcinoma model to investigate the activity of neutrophils in tumors treated by PDT
(Cecic, 2001). They saw a massive and sustained sequestration of these cells in PDT-treated tumors
but also revealed their activated state evidenced by the presence of released myeloperoxidase. Among
the adhesion molecules expressed on tumor vascular endothelium, ICAM-1 appears to be of primary
importance in the invasion of neutrophils into PDT-treated tumors, because its functional blocking with
monoclonal antibodies reduced the tumor cure rate. A marked upregulation of its ligands CD11b/CD18
and CD11c/CD18 found on neutrophils associated with PDT-treated tumors supports this assumption.
IL-1 activity was critical for the therapeutic outcome, since its neutralization diminished the cure rates
of PDT-treated tumors. No significant effect was observed with anti-IL-6 and anti-TNF- treatment
(Sun J., 2002a).
In a BALB/c mouse model, PDT delivered to normal and tumor tissue in vivo caused marked changes
in the expression of cytokines IL-6 and IL-10 but not TNF-. IL-6 mRNA and protein were strongly
enhanced in the PDT-treated EMT6 tumor. PDT also increased IL-6 mRNA in exposed spleen and skin.
These data suggest that the general inflammatory response to PDT may be mediated at least in part by
IL-6. In addition, IL-6 may modulate the local antitumor immune response. In contrast, IL-10 mRNA
in the tumor decreases following PDT. IL-10 is markedly induced in the skin of mice exposed to a PDT
regime and it plays a role in the observed suppression of cell-mediated responses seen following PDT
(Gollnick, 1997). PF-based PDT of mouse EMT6 tumors induces neutrophilia. In addition to complement fragments (direct mediators) released as a consequence of PDT-induced complement activation,
there are many secondary mediators that all arise as a result of complement activity: IL-1, TNF-,
IL-6, IL- 10, G-CSF and KC, thromboxane, prostaglandins, leukotrienes, histamine, and coagulation
factors (Cecic, 2002).
PDT-induced cytokines have been measured in patients. IFN, IL-1, IL-2, and TNF- were assayed
for in the urine of four patients treated with PDT for bladder cancer, in seven control patients undergoing transurethral surgical procedures, and in five healthy control subjects. Quantifiable concentrations
of all cytokines, except gamma interferon were measured in urine samples from the PDT patients with
the highest light energies, while no urinary cytokines were found in the PDT patient who received the
lowest light energy nor in any of the control subjects (Nseyo, 1990). Serum samples from patients treated
on a Phase I clinical trial of PDT for mesothelioma were examined at the maximally tolerated dose of
610
Foscan for evidence of a cytokine-mediated inflammatory response. Patients underwent pleurectomy

or extrapleural pneumonectomy followed by intraoperative PDT of the thorax. IFN, TNF- and IL-12
showed no elevation, but IL-1, IL-6, IL-8 and IL-10 levels were elevated after surgery and PDT. IL-1
showed a statistically significant variation from baseline after surgery and IL-6, after PDT. The results
suggest a systemically mediated inflammatory response resulting from thoracic surgery followed by
PDT (Yom, 2003).
Hypoxia and Angiogenesis

Tumor hypoxia is associated with malignant progression; resistance to chemotherapy, PDT and radiotherapy; increased metastasis and poor prognosis. Many of these effects in hypoxic tumor cells are
mediated by oxygen-regulated transcriptional activation of a specific set of genes whose relation to
therapy resistance is only poorly understood. The hypoxia-inducible factor (HIF)-1 is a master transcriptional activator of oxygen-regulated genes, which are involved in anaerobic energy metabolism,
angiogenesis, cell survival, cell invasion, and drug resistance. HIF-1 is a heterodimer composed two
sub-units HIF-1 bound to HIF-1, but whereas the latter sub-unit is constitutively expressed, HIF-1 is
rapidly degraded under normoxic conditions (Semenza, 2004). Hypoxia stabilizes HIF-1, allowing the
formation of the transcriptionally active HIF-1 complex, which binds to the hypoxia response element
found in the promoter region of specific genes, including the vascular endothelial growth factor (VEGF)
gene (Mazure, 2004). HIF-1 is a positive factor for tumor growth and is constitutively upregulated in
several tumor types (Yeo, 2004). Because PDT is capable of rapidly consuming significant amounts of
tissue oxygen and also shutting down the blood supply to the tumor that delivers oxygen, the treatment
may itself produce severe levels of hypoxia. The possibility of PDT producing an angiogenic response
in which new blood vessel formation occurs was first observed in 1989 (Benstead, 1989). The tails of
mice injected with meso-tetra (sulphonatophenyl) porphine and illuminated showed an increase in
number of blood vessels at a low light dose but not at a dose that led to widespread necrosis. Deininger
et al. (Deininger, 2002) used Western blotting to analyze secretion of regulators of angiogenesis to the
supernatants of cell lines following Hypocrellin-A and -B PDT. Both proangiogenic factors (VEGF)
and antiangiogenic factors (sFlt-1, angiostatin, p43, allograft inflammatory factor-1, and connective
tissue growth factor) were detected. PF-mediated PDT-induced expression of the HIF-1 sub-unit of
the heterodimeric HIF-1 transcription factor and also increased protein levels of the HIF- 1 target gene,
VEGF within treated transplantable BA mouse mammary carcinoma (Ferrario, 2000). PDT treatment
of BA tumor cells grown in culture resulted in a small increase in VEGF expression above basal levels,
indicating that PDT-mediated hypoxia and oxidative stress could both be involved in the overexpression of VEGF.
Tumor-bearing mice treated with combined antiangiogenic therapy (IM862 or EMAPII) and PDT
had improved tumoricidal responses compared with individual treatments. This study showed that PDTinduced VEGF expression in tumors decrease when either the antiangiogenic compounds IM862 or
EMAP-II was included in the PDT treatment protocol and suggested that combination procedures using
antiangiogenic treatments could improve the therapeutic effectiveness of PDT. A synthetic inhibitor of
matrix metalloproteinases (MMP), Prinomastat, also increased the percentage of long-term cures after
PDT with PF (Ferrario, 2004). PDT increased the expression and activation of MMP-1, -3, -8 and -9 in
BA tumors subjected to PDT with PF, and also changed the expression of MMP regulators. Jiang et al.
(Jiang, 2004) studied angiogenesis produced in normal rat brains by PF PDT. Angiogenesis induced by
611
PDT in normal rat brains. VEGF expression increased within the PDT-treated hemisphere 1 week after
treatment and remained elevated for 6 weeks. Three-dimensional morphologic analysis of vasculature
within PDT-treated and contralateral brain demonstrated PDT-induced angiogenesis that continued
for 4 weeks after PDT. Similar studies found increased VEGF expression also of PCNA (a marker of
proliferation) in tumor vessels of mice bearing NR-S1 squamous cell carcinomas up to 2 days post-PDT
(Uehara, 2001). Hypoxia-induced genes have been studied in human tumors treated with PDT. PF-mediated PDT induced expression of HIF-1 sub-unit of the heterodimeric HIF-1 transcription factor and its
target gene, VEGF, within tumors from patients receiving PF-PDT for early-stage esophageal cancers.
High HIF-1 expression was associated with a poor response to treatment (Koukourakis, 2001b).
MECHANISMS OF CELL DEATH IN PDT

Although PDT can mediate many signaling events in cells, its main purpose as generally employed is
nevertheless to kill undesirable and especially cancerous cells. Recent research has elucidated many
pathways whereby mammalian cells can die, and some of the ways that PDT can initiate these processes.
The concentration, physicochemical properties and subcellular location of the PS, the concentration of
oxygen, the appropriate wavelength and intensity of the light, and the cell type specific properties may
all influence the mode and extent of cell death (see Figure 6).
Modes of Cell Death

Kerr was the first in 1972 to provide evidence (Kerr, 1972) that cells may undergo at least two distinct
types of cell death: The first type is known as necrosis, a violent and quick form of degeneration affecting extensive cell populations, characterized by cytoplasm swelling, destruction of organelles and
disruption of the plasma membrane, leading to the release of intracellular contents and inflammation.
Necrosis has been referred to as accidental cell death, caused by physical or chemical damage and has
generally been considered an unprogrammed process. It is characterized by a pyknotic nucleus, cytoplasmic swelling, and progressive disintegration of cytoplasmic membranes, all of which lead to cellular
fragmentation and release of material into the extracellular compartment. In necrosis, decomposition is
principally mediated by proteolytic activity, but the precise identities of proteases and their substrates
are poorly known.
A different type of cell death was termed apoptosis, identified in single cells usually surrounded by
healthy-looking neighbors, and characterized by cell shrinkage, blebbing of the plasma membrane, the
organelles and plasma membrane retain their integrity for quite a long period. In vitro, apoptotic cells
are ultimately fragmented into multiple membrane-enclosed spherical vesicles. In vivo, these apoptotic
bodies are scavenged by phagocytes, inflammation is prevented, and cells die in immunological control of necrosis resides within or outside cells. Apoptosis, requires transcriptional activation of specific
genes, include the activation of endonucleases, consequent DNA degradation into oligonucleosomal
fragments, and activation of caspases.
Since these first descriptions of necrosis and apoptosis, it has become evident that the situation
is somewhat more complicated with alternative modes of cell death being described. These include
mitotic cell death (Castedo, 2004), cells exhibit multiple aberrations including retardation at G (2)-M,
increased cell volume, and multinucleation; programmed necrosis (Bizik, 2004), involving the processes
612
Figure 6. Cellular signaling pathways leading to apoptosis in cells after PDT. Initial targets of PDTgenerated ROS depend on PS localization and include mitochondria, lysosomes, endoplasmic reticulum,
plasma membrane and PS binding to Bcl-2.
of induction, commitment, and execution of necrosis triggered solely by the biological stimulus induced
by three-dimensional arrangement of the culture, the cysteine cathepsin-(cathepsins B or L) mediated
lysosomal death pathway (Leist, 2001), and autophagic cell death (Yu, 2004), in which a normal function
used to degrade components of the cytoplasm is involved and which is characterized by autophagosomes,
autolysosomes, electron-dense membranous autophagic vacuoles, myelin whorls, multivesicular bodies,
as well as engulfment of entire organelles.
Caspases are intracellular endopeptidases that employ cysteine at the active site and cleave their
targets at aspartic acid residues. The caspases are synthesized as zymogens and these precursors are
converted into active enzymes via oligomerization-induced autoprocessing for initiator caspases (nos.
1, 2, 4, 5, 8, 9, 10 and 14) while effector caspases (nos. 3, 6 and 7) are activated by other proteases,
including initiator caspases and enzyme B (Shi, 2004a). Proteolytic cleavage of cellular substrates by
effector caspases determines the features of apoptotic cell death (Shi, 2004b) and can be initiated by
three different pathways involving caspase 8 (death receptor activation), the endoplasmic reticulum
stress pathway involving activation of caspase 12, and the mitochondrial pathway, in which release of
proteins (including cytochrome c) by mitochondria into the cytoplasm leads to activation of caspase 9 and
downstream cleavage of caspase 3, 7 or 6. Effector caspases cleave and inactivate proteins that protect
613
living cells from apoptosis, such as the DNA repair protein, poly(ADP-ribose) polymerase, ICAD/DFF45
(inhibitor of caspase-activated DNase, the nuclease responsible for DNA fragmentation), or the antiapoptotic Bcl-2 or Bid (a Bcl-2 homolog), which then promotes apoptosis via the mitochondria. At least
18 Bcl-2 proteins have been isolated with either pro- or anti-apoptotic activity (Martinou, 2001). The
antiapoptotic members (such as Bcl- 2 and Bcl-xL) prevent the release of cyto c from the mitochondria
and the subsequent procaspase activation. Caspases can be activated by an intrinsic pathway, triggered
by various environmental insults and developmental programs, which occurs in mitochondria (Green,
1998). It involves release of cytochrome c and other apoptogenic proteins from the mitochondrial intermembrane space into the cytosol. Cytosolic cyto c acts as a cofactor in the formation of a complex
with Apaf-1, procaspase 9, dATP/ATP termed the apoptosome that leads to the activation of caspase 9
and subsequent activation of executioner caspases and cell death commitment. A mechanism has been
described in which caspase 8 (activated by ligation of death receptors) cleaves Bid (a BH3 only member of
the Bcl-2 family) leading to mitochondrial release of cytochrome c and thence to activation of procaspase
9 and thereby amplifying apoptotic signaling (Bossy-Wetzel, 1999). There have been recent suggestions
that non-caspase proteases such as leukocyte elastase inhibitor (LEI)-DNase II (Torriglia, 2000) can
trigger a form of programmed cell death different from the traditional pathways of apoptosis.
Apoptosis and Necrosis after PDT

Because of the intense interest involving cell death mechanisms, workers in the field of PDT have
looked at the occurrence of apoptosis and necrosis both in vitro and in vivo (Agostinis, 2004). Figure 6
illustrates some of the cellular and molecular signaling pathways that have been determined to occur in
cells treated with PDT in vitro. Agarwal et al. were the first to report apoptosis after PDT with chloroaluminum phthalocyanine in mouse lymphoma L5178Y cells, and found a rapid induction of apoptosis
mediated by phospholipase C activation (Agarwal, 1991). The crucial factors in determining the type
of cell death (e.g. apoptosis or necrosis) that occurs after PDT are: cell type, the subcellular localization
of the PS; and the light dose applied to activate it locally. In general, it is believed that lower dose PDT
leads to more apoptosis, while higher doses lead to proportionately more necrosis (Plaetzer, 2002). Nagata
et al. (Nagata, 2003) used the amphiphilic PS ATX-S10(Na) and human malignant melanoma cells and
found that light doses that led to less than 70% cytotoxicity induced mainly apoptosis; by contrast, most
cells appeared necrotic with doses that induced 99% cytotoxicity. A common feature of the apoptotic
program initiated by PDT is the rapid release of mitochondrial cytochrome c into the cytosol followed
by activation of the apoptosome and procaspase 3. With PS localized in the plasma membrane, the photosensitization process can rapidly switch the balance towards necrotic cell death likely due to loss of
plasma membrane integrity and rapid depletion of intracellular ATP (Kessel, 2000). It is also possible
that high doses of PDT can photochemically inactivate essential enzymes and other components of the
apoptotic cascade such as caspases. For instance, Lavie et al. (Lavie, 1999) used the perylenequinones
(hypericin and dimethyl tetrahydroxyhelianthrone) and found high dose PDT inhibited apoptosis by
interfering with lamin phosphorylation, or by photodynamic cross-linking of lamins.
Oleinicks laboratory (Xue L. Y., 2001) compared Pc4-mediated PDT of MCF7 cells that lack caspase
3 with the same cell line with caspase 3 transfected back in. They found apoptotic indicators only in the
caspase expressing cells which also showed more loss of viability by an assay involving reduction of a
tetrazolium dye; however both cell lines showed an equal degree of cytotoxicity by a clonogenic assay.
Pc4 is a PS that localizes in intracellular membranes, especially mitochondria. Pc4-PDT photodamages
614
Bcl-2 and Bcl-xL that are antiapoptotic proteins interacting with the permeability transition pore complex that forms at contact sites between the inner and outer mitochondrial membranes (Morris, 2003).
These complexes and the inner membrane are distinctive in containing the phospholipid, cardiolipin
and it is suggested that Pc4 resides near cardiolipin-containing sites, which are primarily on the inner
mitochondrial membrane (Kessel, 1999) and the consequent photodamage of Bcl-2 and Bcl-xL explains
the induction of apoptosis by this PS. The importance of Bcl-2 as a target of PDT was emphasized in the
study by Usada et al. (Usuda, 2003), who used transfection of wild-type Bcl-2 or certain deletion mutants
in either a transient or a stable mode. Overexpression of Bcl-2 decreased apoptosis and cell death, and
inhibited the activation-associated conformational change of the proapoptotic protein Bax, and higher
doses of Pc4 and light were required to activate Bax in cells expressing high levels of Bcl-2.
Treatment with BPD (verteporfin) and broad spectrum fluorescent light rapidly produced apoptosis in
murine P815 mastocytoma cells (Granville, 1998). Fragmentation of DNA, a fundamental characteristic
of cells undergoing apoptosis, was evident within 3 h following the PDT. Western immunoblot analysis
using the specific antiphosphotyrosine monoclonal antibody 4G10 indicated that molecular species of
>200 kDa were phosphorylated on tyrosine residues during or immediately following the irradiation
of cells loaded with BPD. Increased tyrosine phosphorylation of a 15 kDa protein was evident by 15
min post irradiation. In the absence of light, BPD did not affect the status of cellular protein tyrosine
phosphorylation or cause DNA fragmentation. The status of IB and NF-B proteins was evaluated
for promyelocytic leukemia HL-60 cells treated at different intensities of PDT (Granville, 2000). The
action of BPD-MA, and visible light irradiation were assessed. At a BPD concentration that produced
the death of a high proportion of cells after illumination, evidence of caspase 3 and caspase 9 processing and of poly(ADP-ribose) polymerase cleavage was present within whole cell lysates. The general
caspase inhibitor Z-Val-Ala-Asp-fluoromethylketone (ZVAD.fmk) effectively blocked these apoptosis
related changes.
The participation of the mitochondrial permeability transition pore (MPT) in PDT-induced apoptosis
is mainly based on the experimental evidence that pharmacological inhibitors of the MPT can counteract PDT-mediated cell death induced by compounds that preferentially accumulate in mitochondria
(Salet, 1997). Certain PS, such as PPIX and other porphyrins, exhibit high affinity for some components of the MPT pore, in particular for the outer membrane peripheral benzodiazepine receptor (PBR)
that was proposed to be a crucial target for mitochondrial PDT (Kessel, 2001). However, subsequent
studies showed that binding to the PBR was not correlated with PDT efficacy of different classes of PS
(Dougherty, 2002, Morris, 2002). BPD has been shown to target the adenine nucleotide translocator in
the mitochondrion, and to induce in a light-dependent fashion the permeabilization of the MPT pore
encapsulated into liposomes, which reconstitute the function of the MPT-pore (Belzacq, 2001). Studies
with lutetium texaphyrin (Lutex) have shown that this PS induces apoptosis through a selective modulation of members of the Bcl-2 family (Renno, 2000). In a different study it has been shown that Lutex
binds to lysosomes of EMT6 cells in vitro and to produce apoptosis in EMT6 tumors in vivo, indicating
that lysosomally bound PS can induce apoptosis upon photoactivation (Woodburn, 1997). Canete et al.
(Canete, 2004) showed that the PS palladium(II)-tetraphenylporphycene caused only necrotic cell death
in A-549 cells even at low doses despite the cells undergoing apoptosis in response to serum starvation,
and that this PS could produce apoptosis in a different cell line (HeLa cells). Blank et al. demonstrated
that hypericin could induce mitotic cell death by targeting Hsp90 for ubiquitinylation and thus lowering
levels of its client proteins mutant p53, Cdk4, Raf-1, and Plk and disturbing multiple cellular functions
(Blank, 2003). Thibaut et al. (Thibaut, 2002) used apoptosis inhibitors (BAPTA-AM, Forskolin, DSF,
615
and Z-VAD-fmk) to study PDT of murine melanoma B16-A45 cells and mTHPC. Although all inhibitors tested blocked PDT-induced apoptosis, none produced a significant modification of the phototoxic
effect of mTHPC on B16 cells. It has been suggested that apoptosis and necrosis share common initiation
pathways and that the final outcome is determined by the presence of an active caspase. This implies
that apoptosis inhibition reorients cells to necrosis, i.e. those cells sufficiently damaged by PDT appear
to be killed, regardless of the mechanism involved.
Mitochondria were shown to be the main targets of mTHPC in some reports (Chen, 2000). PDT of
mTHPC-sensitized murine leukemia cells (M1 and JCS) caused rapid appearance of the apoptogenic
protein cytochrome c in the cytosol (especially pronounced in JCS cells), and well correlated with the
extent of apoptotic cell death. Electron microscopy revealed the loss of integrity of the mitochondrial
membrane and the appearance of chromatin condensation as early as 1 h after light irradiation. PF, a
representative of the first generation PS, is a preparation consisting of HPD. Studies described in several
reports revealed the complicated effects of HPD/PF PDT on cells. Depending on the incubation conditions
PDT can result in apoptosis (He, 1994) or necrosis (Dellinger, 1996) of target cells. Biochemical analysis
indicated that HpD/PF PDT elicited lipid peroxidation and enzyme inactivation on plasma membranes,
as well as mitochondria damage and inactivation of mitochondrial enzymes (Bachowski, 1991).
The role of apoptosis-related proteins was also studied in the response of human malignancies to
PDT (Koukourakis, 2001a). Paraffin-embedded material from 37 patients with early esophageal cancer
treated with PDT (intravenous injection of HPD) was studied immunohistochemically for p53 protein
nuclear accumulation and Bcl-2 cytoplasmic expression. Positive Bcl-2 and p53 expression was noted
in 27% and 39% patients, respectively. 63% of tumors with Bcl-2 expression responded completely to
PDT versus 23% of cases with no Bcl-2 expression (p = 0.02). No association of p53, T-stage and of
histology grade with response to PDT or PDT/RT was noted. Bcl-2 protein expression was proposed
to be associated with favorable response to PDT and could be used as a predictor of cancer response
to PDT. This finding was explained by studies showing that PDT induces selective degradation of the
Bcl-2 protein, leading to apoptosis by decreasing the Bcl-2/bax ratio (Usuda, 2003).
Autophagy in PDT
As mentioned above there has recently been an intense interest in autophagy as a pathway of cell
death, particularly in cancer cells that are deficient is some essential components of the apoptotic
machinery(Kondo, 2006). The proapoptotic proteins, Bax and Bak, act as a gateway for caspase-mediated cell death. Mammalian target of rapamycin, (mTOR), an Akt downstream effector, plays a critical
role in cell proliferation, growth and survival, and inhibition of mTOR induces autophagy. Cell death
induced by ionizing radiation in Bax/Bak-/- double knockout (DKO) MEF cells was more pronounced
than in wild-type cells despite the DKO cells being unable to undergo apoptosis (Kim, 2006). In DKO
cells there was an increase in the pro-autophagic proteins ATG5-ATG12 and Beclin-1. Phosphatase and
tensin homolog deleted from chromosome 10 (PTEN) is a lipid phosphatase that is frequently mutated in
cancer. Loss of PTEN leads to constitutive activation of the phosphatidylinositol 3-kinase/serine-threonine kinase (Akt) signal transduction pathway and has been associated with resistance to chemotherapy.
PTEN can stimulate autophagy but can also be inhibited by Bax/Bak (Moretti, 2007)
Buytaert et al showed (Buytaert, 2006b) that photodamage to the sarco(endo)plasmic-reticulum
2+
Ca -ATPase (SERCA) pump and consequent loss in the ER-Ca2+ homeostasis was led to cell death in
hypericin-photosensitized cells. In Bax/Bak double knockout cells, a nonapoptotic pathway dependent
616
on sustained autophagy is found after PDT (Buytaert, 2007), and suggests that the decision to die in
this paradigm of oxidative stress is taken upstream of Bax-dependent MOMP and that the irreversible
photodamage to the ER acts as a trigger for an autophagic cell death pathway in apoptosis-deficient
cells (Buytaert, 2006a)
Murine leukemia L1210 cells and human prostate Bax-deficient DU-145 cells treated with PDT had
ER damage and loss of Bcl-2 function (Kessel, 2006). Both apoptosis and autophagy occurred in L1210
cells after ER photodamage with the latter predominating after 24 hr. Western blots demonstrated
processing of LC3-I to LC3-II, a marker for autophagy. In DU145 cells, PDT initiated only autophagy.
Phosphatidylinositol-3-kinase inhibitors suppressed autophagy in both cell lines. Kessel and Arroyo
(Kessel, 2007) treated L1210 cells with a porphycene photosensitizer that causes photodamage to the
endoplasmic reticulum (ER) where Bcl-2 was among the PDT targets. In wild-type cells, they observed
a rapid wave of autophagy, presumed to represent the recycling of some damaged organelles, followed
by apoptosis. Using shRNA technology, they created a Bax knockdown line (L1210/Bax(-)) where a
marked decrease in apoptosis after photodamage or pharmacologic inactivation of Bcl-2 function was
observed, but this did not affect PDT efficacy. Loss of viability was associated with a highly-vacuolated
morphology consistent with autophagic cell death. It appears that attempts at extensive recycling of
damaged organelles are associated with cell death, and that this phenomenon is amplified when apoptosis is suppressed.
Xue et al (Xue L. Y., 2007b) investigated the occurrence of autophagy after PDT with the photosensitizer Pc 4 in human cancer cells that are deficient in the pro-apoptotic factor Bax (human prostate
cancer DU145 cells) or the apoptosis mediator caspase-3 (human breast cancer MCF-7v cells) and in
apoptosis-competent cells (MCF-7c3 cells that stably overexpress human pro-caspase-3 and Chinese
hamster ovary CHO 5A100 cells). Further, each of the cell lines was also studied with and without stably
overexpressed Bcl-2. Autophagy was identified by electron microscopic observation of the presence of
double-membrane-delineated autophagosomal vesicles in the cytosol and by immunoblot observation
of the Pc 4-PDT dose- and time-dependent increase in the level of LC3-II, a component of the autophagosomal membrane. Autophagy was observed in all of the cell lines studied, whether or not they
were capable of typical apoptosis and whether or not they overexpressed Bcl-2. The presence of stably
overexpressed Bcl-2 in the cells protected against PDT-induced apoptosis and loss of clonogenicity in
apoptosis-competent cells (MCF-7c3 and CHO 5A100 cells). In contrast, Bcl-2 overexpression did not
protect against the development of autophagy in any of the cell lines or against loss of clonogenicity in
apoptosis-deficient cells (MCF-7v and DU145 cells). Furthermore, 3-methyladenine and wortmannin,
inhibitors of autophagy, provided greater protection against loss of viability to apoptosis-deficient than
to apoptosis-competent cells (Xue L. Y., 2007a).
Bystander Effect
A different mechanism of cell death was described by Dahle et al. (Dahle, 1997), who showed that
during in vitro PDT some cells die by direct effect, but adjacent cells suffer lethal cell damage which
is propagated through a chain of adjacent cells, termed the bystander effect. Treatment of MDCK II
cells with the lipophilic PS tetra(3-hydroxyphenyl)porphyrin and light was found to induce a rapid
apoptotic response in a large fraction of the cells. Furthermore, the distribution of apoptotic cells in
microcolonies of eight cells was found to be different from the expected binomial distribution. There
was an overabundance of microcolonies that had responded to the treatment as a single unit, that is, in
617
which either all or no cells were dead, indicating that the cells are not inactivated independently, but
that the bystander effect is involved in the cell death. This observation disagrees with the common view
that cells are inactivated only by direct damage and indicates that communication between cells in a
colony plays a role in PDT induction of apoptosis. The degree of bystander effect was higher for cells
dying by necrosis than for cell dying by apoptosis. Initially it was thought that the process was mediated
through gap junctional intercellular communication during or shortly after irradiation, (Dahle, 2000,
Dahle, 1999) but when this hypothesis was tested by treatment of microcolonies with 30 M dieldrin, an
inhibitor of gap junctional intercellular communication, there was no reduction of the bystander effect.
However, workers in the field of ionizing radiation where the bystander effect is also observed, showed
(Shao, 2003) that it may be mediated both by gap-junctional communication and also by generation of
diffusible ROS that can be released into the medium and act on neighboring cells.
DNA Damage
Sunlight gives rise to DNA damage by two mechanisms. On the one hand, DNA directly absorbs light
in the UVC and UVB range of the spectrum (up to 320 nm). The absorption gives rise to characteristic
photoproducts, especially the formation of pyrimidine dimers. Their pre-mutagenic properties have
been well established. On the other hand, some so far unidentified cellular constituents (probably porphyrins or flavins) act as endogenous PS that react directly with DNA or give rise to the formation of
reactive oxygen species. These reactions result in oxidative DNA damage, which is also known to be
pre-mutagenic. The contribution of the indirect (PS-mediated) mechanisms to the cancer risk induced
by direct sunlight is not very well known. It is anticipated that the indirect mechanisms will not be as
effective as direct DNA excitation, but that they will make an important contribution to the genotoxicity
of sunlight in the longer wavelength range where DNA has little or no absorption.
The photodynamic alteration of DNA is a singlet oxygen-mediated process, while the ionizing radiation degradation is mediated by hydroxyl radicals generated by ionization of water. Damage to DNA
has been shown in several studies with in vitro PDT (Oleinick, 1998), however this DNA damage has
not been directly linked to lethal effects. PDT has been shown to cause DNA base oxidative damage,
strand breaks and cross-links. The mutagenic potential varies between cell types, possibly reflecting
differences in repair capacity or damage surveillance mechanisms. Porphyrins and/or metalloporphyrins
mediated cleavage of nuclei acids occurs via oxidative attack on the sugar moiety, nucleobase modifications which lead to strand scission or by photo-induced mechanism involving either the porphyrin
excited state or singlet oxygen.
Free radical reactions have been suggested to be involved at several points in the multistep process
of chemically induced carcinogenesis. Singlet oxygen can be readily generated inside cells and reacts
efficiently with DNA causing single strand breaks. Its preferential reaction with the guanine moiety in
DNA leads mainly to one-G deletions in the DNA sequence. The mutagenicity of singlet oxygen depends
on formation of lipid peroxidation products. In general, the most potent PS are usually lipophilic and
they do not accumulate in the nucleus. Therefore, despite causing mutagenic products in vitro assays
this may not occur in vivo. It has long been known that 5-ALA is capable of causing DNA damage,
but it is not certain if this phenomenon is light dependent or not (Fuchs, 2000). When ferritin or metals
are present, a catalyzed oxidation of ALA produces reactive oxygen species that can damage plasmid
DNA in vitro, and increases the steady state level of 8-oxo-7,8-dihydro-2-deoxyguanosine in liver,
spleen and kidney. The DNA damage could be partially inhibited by SOD, catalase, DTPA, mannitol
618
and melatonin. 4,5-Dioxovaleric acid, the final oxidation product of ALA, alkylates guanine moieties
within both nucleoside and isolated DNA, producing two diastereoisomeric adducts (Di Mascio, 2000,
Douki, 1998). It is possible that these mechanisms could be involved in the increase in liver cancer
observed amongst sufferers of acute intermittent porphyria who have elevated levels of 5-ALA due to
enzyme deficiencies (Onuki, 2002).
Photosensitization of various types of cells by hematoporphyrin or phthalocyanines results in DNA
lesions, such as single strand breaks (Cadet, 1986, Fiel, 1981). The mutagenicity of PF-PDT may be
related to the repair capabilities as well as to the p53 status of the cell line. Woods et al. (Woods, 2004)
studied PDT with human HaCaT keratinocytes using the standard alkaline comet assay protocol to detect DNA strand breaks. They used PF (1 g/mL) and 630nm laser light and showed a dose-dependent
increase in DNA migration (comet tails) starting as low as 1 J/cm 2; however, the breaks produced at the
higher irradiation doses (10 and 25 J/cm2) could have been caused by cell death. PF treatment in the
absence of light did not result in increased DNA migration. A similar comet assay with tail moment
calculation was used to evaluate DNA damage and repair in murine glioblastoma C6 cells after PDT
with m- THPC (Rousset, 2000). There were no changes in tail moment of C6 cells in the absence of light,
whereas m-THPCPDT (1g/mL) induced DNA damage immediately after irradiation. The mean value
increased with the light dose (0, 10 or 25 J/cm2) and incubation time (every hour from 1 to 4 h), but the
cells were capable of significant DNA repair after 4 h, and no residual DNA damage was evident after
24-h post-treatment incubation at 37C. An increase in the light dose appeared to be less genotoxic than
an increase in the m-THPC dose for similar toxicities. Overall, the presently available data indicate that
the risk for secondary skin carcinoma after topical ALA-PDT seems to be low, but further studies must
be carried out to evaluate the carcinogenic risk of ALA-PDT in conditions predisposed to skin cancer.
We can conclude that the DNA-damaging effects of PDT are dependent not only on the all variables
implicit in PDT but also on the cellular mechanisms of repair and survival. For treatments giving equal
levels of cell survival, DNA damage has been shown to be less for PDT-treated cells compared with
those that have been X-irradiated. Although DNA, RNA and protein synthesis are affected following
PDT, recovery occurs suggesting that such damage may not necessarily be lethal.
SYSTEMS BIOLOGY STUDIES IN PDT

System Biology is defined as the ability to obtain, integrate and analyze complex data from multiple
experimental sources using interdisciplinary tools. In popular parlance it has become known as a collection of omics. These omics are:

Genomics: Study of the entire DNA sequence of organisms and fine-scale genetic mapping efforts.
Transcriptomics: Whole cell or tissue gene expression measurements by DNA microarrays.
Proteomics: Complete identification of proteins and protein expression patterns of a cell or tissue
through two-dimensional gel electrophoresis or mass spectrometry.
Metabolomics: Identification and measurement of all small-molecules metabolites within a cell or
tissue
Glycomics: Identification of the entirety of all carbohydrates in a cell or tissue.
Kinomics: The totality of protein kinases in a cell
619
Interactomics: Determining protein-protein interactions (in theory encompassing all interactions

between all molecules within a cell)
Fluxomics: Deals with the dynamic changes of molecules within a cell over time.
Although the discovery of many of the pathways and cell signaling cascades induced after PDT
(discussed previously) have been elucidated by traditional biochemical and cell biology techniques, the
newer technologies of omics are increasingly being brought to bear on this problem.
Gene Expression Studies in PDT

Verwanger et al (Verwanger, 1998) measured transient changes of the expression level of (proto) oncogenes c-myc and bcl-2 in normal and transformed human fibroblasts at different times following
photodynamic treatment with 5-aminolaevulinic acid-stimulated endogenous protoporphyrin IX and
low-dose irradiation using quantitative reverse transcriptase polymerase chain reaction (RT-PCR).
No irreversibly increased (proto) oncogene expression was found, since the over-expression of c-myc
and bcl-2 is transient. They found an interaction of bcl-2 and c-myc associated with an increase of the
proliferative activity of the cell cycle of transformed cells and a possible role of bcl-2 in counteracting
processes that could be at least precursors for apoptosis induction together with a higher constitutive
expression of both genes in transformed than in normal fibroblasts. The first use of DNA gene expression
arrays was by this group in 2002 (Verwanger, 2002). They used ALA-induced PPIX in the squamous
cell carcinoma cell line A-431 after treatment with light. Radioactively labeled cDNAs from untreated
and treated cells were hybridized onto UniGene cDNA array filters containing lysed bacterial colonies
with inserts representing approximately 32000 different human transcripts. Differentially expressed
genes were identified and verified on sub-arrays containing only the candidate genes. They found increased expression of Hsp70 and of the immediate early genes p55-c-fos and c-jun. Increased expression
of heme oxygenase-1 following dark incubation was not further increased by irradiation and therefore
probably caused by the need for heme degradation.
A subsequent study from the same group (Ruhdorfer, 2007) used the same cells and PS (A-431 with
ALA-PPIX). After a 16 h incubation with 100 microg/ml ALA and irradiated with a fluence of 3.5 J/cm2
resulting in 50% survival at 8 h post treatment. RNA was isolated at 1.5, 3, 5 and 8 h post treatment
and from 3 controls (untreated, light only and dark), radioactively labeled by reverse transcription with
33
P-dCTP and hybridized onto macroarray PCR filters containing PCR products of 2135 genes, which
were selected for relevance in carcinogenesis, stress response and signal transduction. Verification of
observed expression changes was carried out by real-time RT-PCR. They found a strong induction of
expression of the immediate early genes c-jun and c-fos as well as decreased expression of genes involved in proliferation such as myc, genes involved in apoptosis such as Fas associated via death domain
(FADD) and the fibronectin gene for cell adhesion.
Gene expression changes were observed in the HEL and HL-60 cell lines after the stimulation of
protoporphyrin IX synthesis by ALA administration and photodynamic process induction (Belickova,
2000). Isolated ribonucleic acids were radiolabeled by reverse transcription, and the cDNA obtained was
hybridized to membrane macroarrays containing 588 gene probes. Besides changes in the activity of
genes supposed to be involved in the programmed cell death and DNA reparation processes, increased
or diminished transcription activity was also observed in several other genes; the reason for this phenomenon was not clear. The activation of programmed-cell-death genes appeared after the ALA load
620
application, indicating the toxic effect of ALA. The gene expression changes observed in the two cell
lines differed substantially, only a few of them were common for both cell lines.
Cekaite and colleagues (Cekaite, 2007) investigated the changes in the transcriptome after hexylALA-mediated PDT by using transcriptional exon evidence oligo microarrays. They confirmed deviations
in the steady state expression levels of previously identified early defense response genes and extend
this to include unreported PDT inducible gene groups, most notably the metallothioneins and histones.
HAL-PDT mediated stress also altered expression of genes encoded by mitochondrial DNA. The ATF3
alternative isoform (deltaZip2) was up-regulated, while the full-length variant was not changed by the
treatment. Results were independently verified by two different technological microarray platforms.
Good microarray, RT-PCR and Western immunoblotting correlation for selected genes supported these
findings.
Zawacka-Pankau et al observed (Zawacka-Pankau, 2007) the induction of p53 target pro-apoptotic
genes, e.g. puma (p53-up-regulated modulator of apoptosis), and bak in protoporphyrin (PpIX) treated
cells in the absence of light. In addition, p53-independent growth suppression by PpIX was detected
in p53-negative cells. PDT treatment (2 J/cm2) of HCT116 cells induced p53-dependent activation of
pro-apoptotic gene expression followed by growth suppression and induction of apoptosis. PpIX binds
to p53 and disrupts the interaction between p53 tumor suppressor protein and its negative regulator
HDM2 in vitro and in cells.
The expression of genes encoding complement proteins C3, C5, and C9 was studied following tumor PDT mediated by photosensitizer Photofrin using the mouse Lewis lung carcinoma (LLC) model
(Stott, 2007). Treated tumors and the livers of host mice were collected at different times after PDT
and the expression of the investigated genes was analyzed by RT-PCR. The results showed a significant
up-regulation of C3, C5, and C9 genes in PDT-treated tumors at 24 h after therapy, while no significant
increase in the expression of these genes was found in the liver tissues. The expression of C3, C5, and
C9 genes also became up-regulated in untreated tumor-associated macrophages (TAMs) co-incubated
in vitro with PDT-treated LLC cells. This effect was abolished or drastically reduced in the presence of
antibodies blocking heat shock protein 70 (HSP70), Toll-like receptor (TLR) 2 and TLR4, and specific
peptide inhibitors of TIRAP adapter protein and transcription factor NF-kappaB.
ALA-PDT was found to lead to the perturbation of the Hsp90/p23 multichaperone complex of which
the Bcr-Abl is the client protein (Pluskalova, 2006). Bcr-Abl protein was suppressed whereas the bcr-abl
mRNA level was not affected. Furthermore several changes were observed in the cytoskeleton organization. ALA-PDT-mediated disruption of filamental actin structure and certain cytoskeleton organizing
proteins involved in the cell response to the treatment. Among these proteins, Septin2, which plays a
role in maintaining actin bundles, was suppressed. Another one, PDZ-LIM domain protein 1 (CLP36)
was altered. This protein acts as an adaptor molecule for LIM-kinase which phosphorylates and thus
inactivates cofilin. Cofilin was indeed dephosphorylated and could thus be activated and operate as an
actin-depolymerizing factor.
Ruiz-Galindo et al (Ruiz-Galindo, 2007) investigated the expression profiles of genes involved in
heme biosynthesis in the human retinoblastoma WERI-Rb-1 and Y79 cells by reverse transcriptasepolymerase chain reaction (RT-PCR). Expression levels were highest in protoporphyrinogen oxidase
(PPOX), uroporphyrinogen synthase and aminolevulinic acid synthase. Ferrochelatase expression showed
a reduction compared to PPOX. PpIX levels were 15- and 18-fold higher in WERI-Rb-1 and Y79 cells,
respectively, following induction by delta-aminolevulinic acid.
621
PDT-Induced Protective Response

In response to many stresses, including heat, oxidizing conditions, and exposure to toxic compounds,
all cells produces a common set of heat shock proteins (Hsps) and glucose-regulated proteins (Grps).
Experiments in E. coli, yeast, fruit flies and mice have shown that increased expression of these proteins can protect the organism against stress-induced damage. Most, but not all, heat shock proteins
are molecular chaperones that bind and stabilize proteins at intermediate stages of folding, assembly,
translocation across membranes and degradation. Heat shock proteins have been classified by molecular
weight, for example, Hsp70 for the 70-kDa heat-shock protein. The transcription of genes belonging
to the Hsp family is regulated by a mechanism involving the binding of heat-shock factors (HSFs) to
specific heat-shock elements (HSEs). In nonstress conditions, the transcription factor HSF is found in
the cytoplasm, in a monomeric form, associated with Hsp70 (Morimoto, 1993). During cellular stress
Hsp70 binds to denatured proteins, freeing HSF that trimerizes and migrates to the nucleus, where it
binds to HSE.
Gomer et al. first showed (Gomer, 1991b) that elevated levels of mRNA encoding Grps as well as
increases in Grp protein synthesis after mouse RIF-1 cells were incubated with PF for 16-h (but not 1
h) and illuminated. In separate experiments, a transient elevation of Grp mRNA levels was observed
in transplanted mouse mammary carcinomas following in vivo PDT treatments. They went on to show
that in vitro PDT with mono-aspartyl chlorine e6 or tin etiopurpurin but not with PF increased both
Hsp70 mRNA and protein levels in RIF-1 cells (Gomer, 1996). PDT of RIF-1 tumors in mice gave an
increased expression of Hsp70 using all three PS. In vitro PDT of RIF cells with Npe6-induced HSF-1
binding to HSE, and transiently induced the expression of a reporter gene containing an inducible Hsp70
promoter (Luna, 2000). This led to a proposal that PDT could be a light activated targeted inducer of
specific gene expression (for instance suicide genes) if the gene of interest could be linked to the heat
shock or Grp promoters (Luna, 2002). Hanlon et al. reported that PF-PDT of HT29 and RIF-1 cells and
their PDT resistant sublines led to increase of the mitochondrial Hsp60 and suggested that this protein
may contribute to PDT resistance (Hanlon, 2001). The same group subsequently implicated Hsp27 in
mediating this PDT-resistance as shown by the creation of cells stably overexpressing Hsp27 (Wang,
2002).
In vivo and in vitro studies, using a mouse mammary sarcoma (EMT6) cell line stably transfected
with a plasmid consisting of the gene for green fluorescent protein (GFP) under the control of an Hsp70
promoter, showed that sublethal doses of mTHPC-PDT-induced Hsp70- driven GFP expression (Mitra,
2003). Recently Jalil and a group from Poland demonstrated upregulation of Hsp27, Hsp60, Hsp72/73,
Hsp90, and Grp78 after PF-PDT of mouse C26 cells and linked this protein expression to the effectiveness of immature dendritic cell mediated immunotherapy (Jalili, 2004).
Heme oxygenases (HO) degrade heme to carbon monoxide, iron and biliverdin, which is subsequently
reduced to bilirubin by biliverdin reductase. Not only does HO catalyze the removal of the dangerous
heme molecules, which can generate harmful radicals when in the free form, but also the products of
HO activity can act as neurotransmitters, regulate vascular tone and protect cells from various insults.
The HO gene contains binding sites for several transcription factors, including an AP-1 consensus
sequence (Alam, 1992) that may contribute to an up-regulation of gene expression since this transcription factor may be activated in PDT (see above). PDT of Chinese hamster fibroblast cells (V-79) with
PF or with RB increased HO protein levels (Gomer, 1991c). Other workers have shown that HO can
be induced in the dark after incubation of cells with HPD or zinc phthalocyanine (ZnPC) (Bressoud,
622
1992). Lin and Girotti showed that pre-incubation of cells with hemin could cause resistance to PDT
by inducing expression of HO1 (Das, 2000). Increases in expression of anti-oxidant enzymes may also
be caused by PDT. Studies using human adenocarcinoma HeLa cells showed induction of manganese
superoxide dismutase (MnSOD) mRNA following photosensitization with PF (Das, 2000). Studies from
Poland using murine colon-26 (C26) cells showed that PF-PDT increased the protein levels of MnSOD,
but not of Cu,Zn-SOD (Golab, 2003). Transient transfection of the T24 bladder cancer cell line with
the MnSOD gene, but not with the Cu,Zn-SOD gene, or pretreatment of C26 and T24 cells with a cell
permeable SOD mimetic, resulted in a considerable decrease in the effectiveness of PDT with PF. These
results suggest that inhibition of SOD activity may be effective in potentiating the antitumor effectiveness of PDT (Golab, 2003). They then tested 2-methoxyestradiol (2ME), a SOD-inhibitor capable of
potentiating the antitumor effects of PDT. The combination produced retardation of tumor growth and
prolongation of the survival of tumor-bearing mice. A study (Lu, 2004) showed that human glutathione
S-transferase (GST) isoforms GSTP1-1 (P1-1) and GSTA1-1 (A1-1) bind with high affinity to hypericin
(HYP) and differentially quench its PDT properties and this antioxidant effect was attributed to classic ligandin activity of GSTs, wherein non-substrate planar aromatic anions are sequestered on, and
inhibit, the enzyme.
Studies with PDT-Resistant Cells

A group in Canada has devoted some effort to comparing PDT resistant HT29 human colon cancer
cells (selected by regrowing surviving cells from sequential PDT treatments), with their wild-type
counterparts, with the aim of discovering which genes and pathways are important in susceptibility
to PDT induced cell death. They compared gene expression profiles between the Photofrin-resistant
cell line (HT29-P14) and its parental cell line HT29 using DNA-microarray analysis (Wang, 2002).
A significant up-regulation of Hsp27 was found in HT29-P14 cells. They then transfected HT29 cells
with human Hsp27 cDNA and stable transfected cells (H13) showed an increased survival after Photofrin-PDT, suggesting that the up-regulation of Hsp27 is related to the induced resistance to PhotofrinPDT. Phosphorylation of Hsp27 may play an important role in cytoprotection, and an increased Hsp27
phosphorylation level was found in both resistant and overexpressing cells after PDT. The activation of
the phosphorylation of Hsp27 induced by PDT was not mediated by the p38 mitogen-activated protein
kinase. Next they used messenger RNA (mRNA) differential display to identify genes that were differentially expressed in the parental HT29 cells compared with their resistant variants (Shen, 2005). In
comparison with parental HT29 cells, mRNA expression was increased in the PDT-resistant cell variants
for BNIP3, estrogen receptor-binding fragment-associated gene 9, Myh-1c, cytoplasmic dynein light
chain 1, small membrane protein I and differential dependent protein. In contrast, expression in the
PDT-resistant variants was downregulated for NNX3, human HepG2 3 region Mbol complementary
DNA, glutamate dehydrogenase, hepatoma-derived growth factor and the mitochondrial genes coding for 16S ribosomal RNA (rRNA) and nicotinamide adenine dinucleotide (NADH) dehydrogenase
subunit 4. The reduction for mitochondrial 16S rRNA in the PDT-resistant variants was confirmed by
Northern blotting, and the elevated expression of the proapoptotic BNIP3 in the PDT-resistant variants
was confirmed by Northern and Western blotting analysis. They also examined the expression of some
additional apoptosis-regulating genes using Western blotting, and showed an increased expression of
Bcl-2 and Hsp27 and a downregulation of Bax in the PDT-resistant variants. In addition, the mutant p53
levels in the parental HT29 cells were reduced substantially in the PDT-resistant variants.
623
They next tested the hypothesis that since BNip3 is a potent inducer of apoptosis, whether these
PDT-resistant cells were cross-resistant to other cytotoxic agents (Dregoesc, 2007). PDT-resistant HT29
cell lines showed a significant increase in cisplatin sensitivity and an increase in both spontaneous and
cisplatin-induced apoptosis compared to parental HT29 cells. In addition, the cisplatin sensitivity of the
PDT-resistant HT29 variants and several other clonal variants of HT29 cells correlated with increased
BNip3 and decreased mutant p53 protein levels, but not Hsp27 protein levels. Finally they investigated
whether the PDT-resistant cells were cross-resistant to ultraviolet light (UV) treatment (Zacal, 2007).
The HT29 PDT-resistant variants showed cross-resistance to long-wavelength UVA (320-400 nm) but
not to short-wavelength UVC (200-280 nm) light. Cell sensitivity to UVA or UVC was then correlated
with Hsp27, BNip3 and mutant p53 protein levels in the PDT-resistant variants as well as in several
clonal variants of HT29 cells that express different levels of Hsp27, BNip3 and mutant p53. Increased
expression of Hsp27 and BNip3 and decreased expression of mutant p53 correlated with increased
resistance to UVA. In contrast, increased expression of Hsp27 and BNip3 correlated with increased
sensitivity to UVC, whereas increased expression of mutant p53 showed no significant correlation with
sensitivity to UVC.
CONCLUSION
Systems biology includes a set of powerful new techniques capable of generating large amounts of data
on molecules, proteins, gene expression, signaling, molecular interactions and many other processes
occurring inside cells. Since PDT induces many complicated and interlocking processes in treated cells,
a systems biology approach would appear ideal to tackle this otherwise daunting investigation. As the
new techniques used in various disciplines of omics become more familiar to scientists working in
the field of PDT, we expect to see an increasing number of studies using high throughput technology
to address the effect of PDT on both cancer and normal cells.
References
Abels, C. (2004). Targeting of the vascular system of solid tumours by photodynamic therapy (PDT).
Photochem Photobiol Sci., 3, 765-771.
Agarwal, M. L., Clay, M. E., Harvey, E. J., Evans, H. H., Antunez, A. R., & Oleinick, N. L. (1991).
Photodynamic therapy induces rapid cell death by apoptosis in L5178Y mouse lymphoma cells. Cancer
Res., 51, 5993-6.
Agarwal, M. L., Larkin, H. E., Zaidi, S. I., Mukhtar, H., & Oleinick, N. L. (1993). Phospholipase activation triggers apoptosis in photosensitized mouse lymphoma cells. Cancer Res., 53, 5897-902.
Agostinis, P., Buytaert, E., Breyssens, H., & Hendrickx, N. (2004). Regulatory pathways in photodynamic therapy induced apoptosis. Photochem. Photobiol. Sci., 3, 721-729.
Agostinis, P., Vantieghem, A., Merlevede, W., & de Witte, P. A. (2002). Hypericin in cancer treatment:
More light on the way. The International Journal of Biochemistry & Cell Biology, 34, 221-241.
624
Ahmad, N., Feyes, D. K., Agarwal, R., & Mukhtar, H. (1998). Photodynamic therapy results in induction
of WAF1/CIP1/P21 leading to cell cycle arrest and apoptosis. Proc Natl Acad Sci USA, 95, 6977-82.
Ahmad, N., Gupta, S., & Mukhtar, H. (1999). Involvement of retinoblastoma (Rb) and E2F transcription factors during photodynamic therapy of human epidermoid carcinoma cells A431. Oncogene, 18,
1891-6.
Ahmad, N., Kalka, K., & Mukhtar, H. (2001). In vitro and in vivo inhibition of epidermal growth factor
receptor-tyrosine kinase pathway by photodynamic therapy. Oncogene,20, 2314-2317.
Alam, J., & Den, Z. (1992). Distal AP-1 binding sites mediate basal level enhancement and TPA induction of the mouse heme oxygenase-1 gene. J.Biol.Chem., 267, 21894-21900.
Allison, R. R., Downie, G. H., Cuenca, R., Hu, X.-H., Childs, C. J., & Sibata, C. H. (2004). Photosensitizers in clinical PDT. Photodiag Photodynam Ther., 1, 27-42.
Anderson, C., Hrabovsky, S., McKinley, Y., Tubesing, K., Tang, H. P., Dunbar, R. et al. (1997). Phthalocyanine photodynamic therapy: Disparate effects of pharmacologic inhibitors on cutaneous photosensitivity and on tumor regression. Photochem Photobiol., 65, 895-901.
Assefa, Z., Vantieghem, A., Declercq, W., Vandenabeele, P., Vandenheede, J. R., Merlevede, W. et al.
(1999). The activation of the c-Jun N-terminal kinase and p38 mitogen-activated protein kinase signaling
pathways protects HeLa cells from apoptosis following photodynamic therapy with hypericin. J.Biol.
Chem., 274, 8788-8796.
Aveline, B. M., & Redmond, R. W. (1999). Can cellular phototoxicity be accurately predicted on the
basis of sensitizer photophysics? Photochem Photobiol., 69, 306-16.
Baas, P., van Mansom, I., van Tinteren, H., Stewart, F. A., & van Zandwijk, N. (1995). Effect of N-acetylcysteine on Photofrin-induced skin photosensitivity in patients. Lasers Surg Med., 16, 359-67.
Bachowski, G. J., Korytowski, W., & Girotti, A. W. (1994). Characterization of lipid hydroperoxides
generated by photodynamic treatment of leukemia cells. Lipids, 29, 449-459.
Bachowski, G. J., Pintar, T. J., & Girotti, A. W. (1991). Photosensitized lipid peroxidation and enzyme
inactivation by membrane-bound merocyanine 540: Reaction mechanisms in the absence and presence
of ascorbate. Photochem.Photobiol, 53, 481-491.
Ball, D. J., Mayhew, S., Vernon, D. I., Griffin, M., & Brown, S. B. (2001). Decreased efficiency of trypsinization of cells following photodynamic therapy: evaluation of a role for tissue transglutaminase.
Photochem.Photobiol., 73, 47-53.
Bektas, M., & Spiegel, S. (2004). Glycosphingolipids and cell death. Glycoconj. J., 20, 39-47.
Belickova, M., Bruchova, H., Cajthamlova, H., Hrkal, Z., & Brdicka, R. (2000). Genes involved in the
destruction of leukaemic cells by induced photosensitivity. Folia Biol (Praha), 46, 131-5.
Belzacq, A. S., Jacotot, E., Vieira, H. L., Mistro, D., Granville, D. J., Xie, Z. et al. (2001). Apoptosis
induction by the photosensitizer verteporfin: Identification of mitochondrial adenine nucleotide translocator as a critical target. Cancer Res. 61, 1260-1264.
625
Benstead, K., & Moore, J. V. (1989). Quantitative histological changes in murine tail skin following
photodynamic therapy. Br. J. Cancer, 59, 503-509.
Berg, K., Madslien, K., Bommer, J. C., Oftebro, R., Winkelman, J. W., & Moan, J. (1991). Light induced
relocalization of sulfonated meso-tetraphenylporphines in NHIK 3025 cells and effects of dose fractionation. Photochem Photobiol, 53, 203-10.
Berg, K., & Moan, J. (1994). Lysosomes as photochemical targets. Int. J. Cancer, 59, 814-822.
Berridge, M. J., & Irvine, R. F. (1984). Inositol trisphosphate, a novel second messenger in cellular signal
transduction. Nature, 312, 315-321.
Bilski, P., Motten, A. G., Bilska, M., & Chignell, C. F. (1993). The photooxidation of diethylhydroxylamine by rose bengal in micellar and nonmicellar aqueous solutions. Photochemistry and Photobiology,
58, 11-18.
Bizik, J., Kankuri, E., Ristimaki, A., Taieb, A., Vapaatalo, H., Lubitz, W. et al. (2004). Cell-cell contacts trigger programmed necrosis and induce cyclooxygenase-2 expression. Cell Death Differ., 11,
183-195.
Blank, M., Mandel, M., Keisari, Y., Meruelo, D., & Lavie, G. (2003). Enhanced ubiquitinylation of heat
shock protein 90 as a potential mechanism for mitotic cell death in cancer cells induced with hypericin.
Cancer Res., 63, 8241-8247.
Bossy-Wetzel, E., & Green, D. R. (1999). Caspases induce cytochrome c release from mitochondria by
activating cytosolic factors. J.Biol.Chem., 274, 17484-17490.
Bottiroli, G., Croce, A. C., Balzarini, P., Locatelli, D., Baglioni, P., Lo Nostro, P. et al. (1997). Enzymeassisted cell photosensitization: A proposal for an efficient approach to tumor therapy and diagnosis.
The rose bengal fluorogenic substrate. Photochem Photobiol, 66, 374-83.
Bracken, A. P., Ciro, M., Cocito, A., & Helin, K. (2004). E2F target genes: unraveling the biology.
Trends Biochem. Sci., 29, 409-417.
Brancaleon, L., & Moseley, H. (2002). Laser and non-laser light sources for photodynamic therapy.
Lasers Med. Sci., 17, 173-186.
Bressoud, D., Jomini, V., & Tyrrell, R. M. (1992). Dark induction of haem oxygenase messenger RNA
by haematoporphyrin derivative and zinc phthalocyanine; agents for photodynamic therapy. J. Photochem. Photobiol. B., 14, 311-318.
Buchko, G. W., Cadet, J., Ravanat, J. L., & Labataille, P. (1993). Isolation and characterization of a
new product produced by ionizing irradiation and type I photosensitization of 2-deoxyguanosine in
oxygen-saturated aqueous solution: (2S)-2,5-ANHYDRO-1-(2-deoxy-beta-D-erythro-pentofuranosyl)5-guanidin ylidene- 2-hydroxy-4-oxoimidazolidine. Int J Radiat Biol. 63, 669-76.
Buchko, G. W., Wagner, J. R., Cadet, J., Raoul, S., & Weinfeld, M. (1995). Methylene blue-mediated
photooxidation of 7,8-dihydro-8-oxo-2-deoxyguanosine. Biochim Biophys Acta., 1263, 17-24.
626
Buytaert, E., Callewaert, G., Hendrickx, N., Scorrano, L., Hartmann, D., Missiaen, L. et al. (2006a). Role
of endoplasmic reticulum depletion and multidomain proapoptotic BAX and BAK proteins in shaping
cell death after hypericin-mediated photodynamic therapy. Faseb J., 20, 756-8.
Buytaert, E., Callewaert, G., Vandenheede, J. R., & Agostinis, P. (2006b). Deficiency in apoptotic effectors Bax and Bak reveals an autophagic cell death pathway initiated by photodamage to the endoplasmic
reticulum. Autophagy, 2, 238-40.
Buytaert, E., Dewaele, M., & Agostinis, P. (2007). Molecular effectors of multiple cell death pathways
initiated by photodynamic therapy. Biochim Biophys Acta., 1776, 86-107.
Cadet, J., Berger, M., Decarroz, C., Wagner, J. R., van Lier, J. E., Ginot, Y. M. et al. (1986). Photosensitized reactions of nucleic acids. Biochimie., 68, 813-834.
Calzavara-Pinton, P. G. (1995). Repetitive photodynamic therapy with topical delta-aminolaevulinic
acid as an appropriate approach to the routine treatment of superficial non-melanoma skin tumours. J
Photochem Photobiol B., 29, 53-7.
Canete, M., Ortega, C., Gavalda, A., Cristobal, J., Juarranz, A., Nonell, S. et al. (2004). Necrotic cell death
induced by photodynamic treatment of human lung adenocarcinoma A-549 cells with palladium(II)tetraphenylporphycene. Int.J.Oncol., 24, 1221-1228.
Castedo, M., Perfettini, J. L., Roumier, T., Andreau, K., Medema, R., & Kroemer, G. (2004). Cell death
by mitotic catastrophe: a molecular definition. Oncogene., 23, 2825-2837.
Castillo, L., Etienne-Grimaldi, M. C., Fischel, J. L., Formento, P., Magne, N., & Milano, G. (2004).
Pharmacological background of EGFR targeting. Ann.Oncol., 15, 1007-1012.
Cavallaro, U., & Christofori, G. (2004). Multitasking in tumor progression: signaling functions of cell
adhesion molecules. Ann. N.Y. Acad. Sci., 1014, 58-66.
Cecic, I., & Korbelik, M. (2002). Mediators of peripheral blood neutrophilia induced by photodynamic
therapy of solid tumors. Cancer Lett., 183, 43-51.
Cecic, I., Parkins, C. S., & Korbelik, M. (2001). Induction of systemic neutrophil response in mice by
photodynamic therapy of solid tumors. Photochem Photobiol., 74, 712-20.
Cekaite, L., Peng, Q., Reiner, A., Shahzidi, S., Tveito, S., Furre, I. E. et al. (2007). Mapping of oxidative stress responses of human tumor cells following photodynamic therapy using hexaminolevulinate.
BMC Genomics., 8, 273.
Chen, J. Y., Mak, N. K., Yow, C. M., Fung, M. C., Chiu, L. C., Leung, W. N. et al. (2000). The binding
characteristics and intracellular localization of temoporfin (mTHPC) in myeloid leukemia cells: phototoxicity and mitochondrial damage. Photochem.Photobiol., 72, 541-547.
Chiu, S. M., Davis, T. W., Meyers, M., Ahmad, N., Mukhtar, H., & Separovic, D. (2000). Phthalocyanine
4-photodynamic therapy induces ceramide generation and apoptosis in acid sphingomyelinase-deficient
mouse embryonic fibroblasts. Int.J.Oncol., 16, 423-427.
627
Colussi, V. C., Feyes, D. K., Mulvihill, J. W., Li, Y. S., Kenney, M. E., Elmets, C. A. et al. (1999). Phthalocyanine 4 (Pc 4) photodynamic therapy of human OVCAR-3 tumor xenografts. Photochem Photobiol.,
69, 236-41.
Coutier, S., Bezdetnaya, L., Marchal, S., Melnikova, V., Belitchenko, I., Merlin, J. L. et al. (1999). Foscan
(mTHPC) photosensitized macrophage activation: enhancement of phagocytosis, nitric oxide release
and tumour necrosis factor-alpha-mediated cytolytic activity. Br. J. Cancer., 81, 37-42.
Curnow, A., McIlroy, B. W., Postle-Hacon, M. J., Porter, J. B., MacRobert, A. J., & Bown, S. G. (1998).
Enhancement of 5-aminolaevulinic acid-induced photodynamic therapy in normal rat colon using hydroxypyridinone iron-chelating agents. Br J Cancer., 78, 1278-82.
Dahle, J., Bagdonas, S., Kaalhus, O., Olsen, G., Steen, H. B., & Moan, J. (2000). The bystander effect
in photodynamic inactivation of cells. Biochim. Biophys. Acta., 1475, 273-280.
Dahle, J., Kaalhus, O., Moan, J., & Steen, H. B. (1997). Cooperative effects of photodynamic treatment
of cells in microcolonies. Proc Natl Acad Sci USA, 94, 1773-8.
Dahle, J., Steen, H. B., & Moan, J. (1999). The mode of cell death induced by photodynamic treatment
depends on cell density. Photochem.Photobiol., 70, 363-367.
Das, H., Koizumi, T., Sugimoto, T., Yamaguchi, S., Hasegawa, K., Tenjin, Y. et al. (2000). Induction of
apoptosis and manganese super-oxide dismutase gene by photodynamic therapy in cervical carcinoma
cell lines. Int. J. Clin. Oncol., 5, 97- 103.
de Vree, W. J., Fontijne-Dorsman, A. N., Koster, J. F., & Sluiter, W. (1996). Photodynamic treatment of
human endothelial cells promotes the adherence of neutrophils in vitro. Br. J. Cancer, 73, 1335-1340.
Deininger, M. H., Weinschenk, T., Morgalla, M. H., Meyermann, R., & Schluesener, H. J. (2002). Release
of regulators of angiogenesis following Hypocrellin-A and -B photodynamic therapy of human brain
tumor cells. Biochem. Biophys. Res. Commu, 298, 520-530.
Dellinger, M. (1996). Apoptosis or necrosis following Photofrin photosensitization: influence of the
incubation protocol. Photochem. Photobiol., 64, 182-187.
Di Mascio, P., Teixeira, P. C., Onuki, J., Medeiros, M. H., Dornemann, D., Douki, T. et al. (2000). DNA
damage by 5-aminolevulinic and 4,5-dioxovaleric acids in the presence of ferritin. Arch. Biochem.
Biophys., 373, 368-374.
Ding, X., Xu, Q., Liu, F., Zhou, P., Gu, Y., Zeng, J. et al. (2004). Hematoporphyrin monomethyl ether
photodynamic damage on HeLa cells by means of reactive oxygen species production and cytosolic
free calcium concentration elevation. Cancer Lett., 216, 43-54.
Dolgachev, V., Farooqui, M. S., Kulaeva, O. I., Tainsky, M. A., Nagy, B., Hanada, K. et al. (2004). De
novo ceramide accumulation due to inhibition of its conversion to complex sphingolipids in apoptotic
photosensitized cells. J. Biol. Chem., 279, 23238-23249.
Dougherty, T. J. (1974). Activated dyes as antitumor agents. J Natl Cancer Inst, 52, 1333-6.
628
Dougherty, T. J., Kaufman, J. E., Goldfarb, A., Weishaupt, K. R., Boyle, D., & Mittleman, A. (1978).
Photoradiation therapy for the treatment of malignant tumors. Cancer Res, 38, 2628-35.
Dougherty, T. J., Lawrence, G., Kaufman, J. H., Boyle, D., Weishaupt, K. R., & Goldfarb, A. (1979).
Photoradiation in the treatment of recurrent breast carcinoma. J Natl Cancer Inst, 62, 231-7.
Dougherty, T. J., & Potter, W. R. (1991). Of what value is a highly absorbing photosensitizer in PDT. J
Photochem Photobiol B, 8, 223-5.
Dougherty, T. J., Sumlin, A. B., Greco, W. R., Weishaupt, K. R., Vaughan, L. A., & Pandey, R. K. (2002).
The role of the peripheral benzodiazepine receptor in photodynamic activity of certain pyropheophorbide ether photosensitizers: Albumin site II as a surrogate marker for activity. Photochem. Photobiol,
76, 91-97.
Douki, T., Onuki, J., Medeiros, M. H., Bechara, E. J., Cadet, J., & Di Mascio, P. (1998). DNA alkylation
by 4, 5-dioxovaleric acid, the final oxidation product of 5-aminolevulinic acid. Chem Res Toxicol, 11,
150-7.
Dregoesc, D., Rybak, A. P., & Rainbow, A. J. (2007). Increased expression of p. 53 enhances transcription-coupled repair and global genomic repair of a UVC-damaged reporter gene in human cells. DNA
Repair (Amst), 6, 588-601.
Dummin, H., Cernay, T., & Zimmermann, H. W. (1997). Selective photosensitization of mitochondria
in HeLa cells by cationic Zn (II) phthalocyanines with lipophilic side-chains. J Photochem Photobiol
B, 37, 219-29.
Evans, S., Matthews, W., Perry, R., Fraker, D., Norton, J., & Pass, H. I. (1990). Effect of photodynamic
therapy on tumor necrosis factor production by murine macrophages. J Natl Cancer Inst, 82, 34-9.
Ferrario, A., Chantrain, C. F., von Tiehl, K., Buckley, S., Rucker, N., Shalinsky, D. R. et al. (2004). The
matrix metalloproteinase inhibitor prinomastat enhances photodynamic therapy responsiveness in a
mouse tumor model. Cancer Res, 64, 2328-2332.
Ferrario, A., Von Tiehl, K., Wong, S., Luna, M., & Gomer, C. J. (2002). Cyclooxygenase-2 inhibitor
treatment enhances photodynamic therapy- mediated tumor response. Cancer Res, 62, 3956-61.
Ferrario, A., von Tiehl, K. F., Rucker, N., Schwarz, M. A., Gill, P. S., & Gomer, C. J. (2000). Antiangiogenic treatment enhances photodynamic therapy responsiveness in a mouse mammary carcinoma.
Cancer Re., 60, 4066-9.
Fiel, R. J., Datta-Gupta, N., Mark, E. H., & Howard, J. C. (1981). Induction of DNA damage by porphyrin
photosensitizers. Cancer Res, 41, 3543-3545.
Figge, F. H., Weiland, G. S., & Manganiello, L. O. (1948). Affinity of neoplastic, embryonic and traumatized tissues for porphyrins and metalloporphyrins. Proc Soc Exp Biol Med, 68, 640.
Fingar, V. H., Wieman, T. J., & Doak, K. W. (1990). Role of thromboxane and prostacyclin release on
photodynamic therapy-induced tumor destruction. Cancer Res, 50, 2599-603.
629
Fingar, V. H., Wieman, T. J., Karavolos, P. S., Doak, K. W., Ouellet, R., & van Lier, J. E. (1993). The
effects of photodynamic therapy using differently substituted zinc phthalocyanines on vessel constriction, vessel leakage and tumor response. Photochem Photobiol, 58, 251-8.
Firey, P. A., & Rodgers, M. A. (1987). Photo-properties of a silicon naphthalocyanine: a potential photosensitizer for photodynamic therapy. Photochem Photobiol, 45, 535-8.
Fotinos, N., Campo, M. A., Popowycz, F., Gurny, R., & Lange, N. (2006). 5-Aminolevulinic acid derivatives in photomedicine: Characteristics, application and perspectives. Photochem Photobiol, 82,
994-1015.
Foultier, M. T., Patrice, T., Yactayo, S., Lajat, Y., & Resche, F. (1992). Photodynamic treatment of normal
endothelial cells or glioma cells in vitro. Surg Neuro, 37, 83-8.
Fuchs, J., Weber, S., & Kaufmann, R. (2000). Genotoxic potential of porphyrin type photosensitizers
with particular emphasis on 5-aminolevulinic acid: Implications for clinical photodynamic therapy.
Free Radic. Biol. Med, 28, 537-548.
Fungaloi, P., Statius van Eps, R., Wu, Y. P., Blankensteijn, J., de Groot, P., van Urk, H. et al. (2002).
Platelet adhesion to photodynamic therapy-treated extracellular matrix proteins. Photochem. Photobiol,
75, 412-417.
Gaullier, J. M., Geze, M., Santus, R., Sa e Melo, T., Maziere, J. C., Bazin, M. et al. (1995). Subcellular
localization of and photosensitization by protoporphyrin IXhuman keratinocytes and fibroblasts cultivated with 5-aminolevulinic acid. Photochem. Photobiol, 62, 114-122.
George, S. J., & Dwivedi, A. (2004). MMPs, cadherins, and cell proliferation. Trends Cardiovasc Med,
14, 100-105.
Geze, M., Morliere, P., Maziere, J. C., Smith, K. M., & Santus, R. (1993). Lysosomes, a key target of
hydrophobic photosensitizers proposed for photochemotherapeutic applications. J. Photochem. Photobiol. B, 20, 23-35.
Gibson, S. L., Cupriks, D. J., Havens, J. J., Nguyen, M. L., & Hilf, R. (1998). A regulatory role for porphobilinogen deaminase (PBGD) in delta-aminolaevulinic acid (delta-ALA)-induced photosensitization?
Br J Cancer, 77, 235-43.
Girotti, A. W. (1985). Mechanisms of lipid peroxidation. J Free Radic Biol Med, 1, 87-95.
Girotti, A. W. (1983). Mechanisms of photosensitization. Photochem Photobiol, 38, 745-51.
Golab, J., Nowis, D., Skrzycki, M., Czeczot, H., Baranczyk-Kuzma, A., Wilczynski, G. M. et al. (2003).
Antitumor effects of photodynamic therapy are potentiated by 2-methoxyestradiol. A superoxide dismutase inhibitor. J Biol. Chem., 278, 407-414.
Gollnick, S. O., Evans, S. S., Baumann, H., Owczarczak, B., Maier, P., Vaughan, L. et al. (2003). Role
of cytokines in photodynamic therapy-induced local and systemic inflammation. Br J Cancer, 88,
1772-1779.
630
Gollnick, S. O., Lee, B. Y., Vaughan, L., Owczarczak, B., & Henderson, B. W. (2001). Activation of the
IL-10 gene promoter following photodynamic therapy of murine keratinocytes. Photochem Photobiol,
73, 170-7.
Gollnick, S. O., Liu, X., Owczarczak, B., Musser, D. A., & Henderson, B. W. (1997). Altered expression of
interleukin 6 and interleukin 10 as a result of photodynamic therapy in vivo. Cancer Res, 57, 3904-9.
Gomer, C. J. (1991a). Preclinical examination of first and second generation photosensitizers used in
photodynamic therapy. Photochem Photobiol, 54, 1093-107.
Gomer, C. J., Ferrario, A., Rucker, N., Wong, S., & Lee, A. S. (1991b). Glucose regulated protein induction and cellular resistance to oxidative stress mediated by porphyrin photosensitization. Cancer Res,
51, 6574-9.
Gomer, C. J., Luna, M., Ferrario, A., & Rucker, N. (1991c). Increased transcription and translation of
heme oxygenase in Chinese hamster fibroblasts following photodynamic stress or Photofrin II incubation. Photochem Photobiol, 53, 275-9.
Gomer, C. J., Ryter, S. W., Ferrario, A., Rucker, N., Wong, S., & Fisher, A. M. (1996). Photodynamic
therapy-mediated oxidative stress can induce expression of heat shock proteins. Cancer Res, 56, 235560.
Granville, D. J., Carthy, C. M., Jiang, H., Levy, J. G., McManus, B. M., Matroule, J. Y. et al. (2000). Nuclear
factor-kappaB activation by the photochemotherapeutic agent verteporfin. Blood, 95, 256-262.
Granville, D. J., Levy, J. G., & Hunt, D. W. (1998). Photodynamic treatment with benzoporphyrin derivative monoacid ring A produces protein tyrosine phosphorylation events and DNA fragmentation in
murine P815 cells. Photochem Photobiol, 67, 358-62.
Granville, D. J., Ruehlmann, D. O., Choy, J. C., Cassidy, B. A., Hunt, D. W., van Breemen, C. et al.
(2001). Bcl-2 increases emptying of endoplasmic reticulum Ca2+ stores during photodynamic therapyinduced apoptosis. Cell Calcium, 30, 343-350.
Grebenova, D., Kuzelova, K., Smetana, K., Pluskalova, M., Cajthamlova, H., Marinov, I. et al. (2003).
Mitochondrial and endoplasmic reticulum stress-induced apoptotic pathways are activated by 5-aminolevulinic acid-based photodynamic therapy in HL60 leukemia cells. J Photochem Photobiol B, 69,
71-85.
Green, D. R., & Reed, J. C. (1998). Mitochondria and apoptosis. Science, 281, 1309-1312.
Grossweiner, L. I. (1997). PDT light dosimetry revisited. J Photochem Photobiol B, 38, 258-268.
Grune, T., Klotz, L. O., Gieche, J., Rudeck, M., & Sies, H. (2001). Protein oxidation and proteolysis by
the nonradical oxidants singlet oxygen or peroxynitrite. Free Radic Biol Med, 30, 1243-1253.
Hanlon, J. G., Adams, K., Rainbow, A. J., Gupta, R. S., & Singh, G. (2001). Induction of Hsp60 by
Photofrin-mediated photodynamic therapy. J Photochem Photobiol B, 64, 55-61.
Hausman, W. (1911). Die sensibilisierende wirkung des hematoporphyrins. Biochem Z, 30, 276.
631
He, X. Y., Sikes, R. A., Thomsen, S., Chung, L. W., & Jacques, S. L. (1994). Photodynamic therapy
with photofrin II induces programmed cell death in carcinoma cell lines. Photochem Photobiol, 59,
468-73.
Henderson, B. W., & Donovan, J. M. (1989). Release of prostaglandin E2 from cells by photodynamic
treatment in vitro. Cancer Res, 49, 6896-900.
Hendrickx, N., Volanti, C., Moens, U., Seternes, O. M., de Witte, P., Vandenheede, J. R. et al. (2003).
Up-regulation of cyclooxygenase-2 and apoptosis resistance by p38 MAPK in hypericin-mediated
photodynamic therapy of human cancer cells. J Biol Chem, 278, 52231-52239.
Herman, S., Kalechman, Y., Gafter, U., Sredni, B., & Malik, Z. (1996). Photofrin II induces cytokine
secretion by mouse spleen cells and human peripheral mononuclear cells. Immunopharmacology, 31,
195-204.
Hsieh, Y. J., Wu, C. C., Chang, C. J., & Yu, J. S. (2003). Subcellular localization of Photofrin determines
the death phenotype of human epidermoid carcinoma A431 cells triggered by photodynamic therapy:
when plasma membranes are the main targets. J Cell Physiol, 194, 363-375.
Hubmer, A., Hermann, A., Uberriegler, K., & Krammer, B. (1996). Role of calcium in photodynamically
induced cell damage of human fibroblasts. Photochem Photobiol, 64, 211-215.
Hynes, R. O. (2002). Integrins: bidirectional, allosteric signaling machines. Cell, 110, 673-687.
Jalili, A., Makowski, M., Switaj, T., Nowis, D., Wilczynski, G. M., Wilczek, E. et al. (2004). Effective
photoimmunotherapy of murine colon carcinoma induced by the combination of photodynamic therapy
and dendritic cells. Clin Cancer Res, 10, 4498-4508.
Jesionek, A., & von Tappenier, H. (1903). Zur behandlung der hautcarcinomit mit fluorescierenden
stoffen. Muench Med Wochneshr, 47, 2042.
Jiang, F., Zhang, Z. G., Katakowski, M., Robin, A. M., Faber, M., Zhang, F. et al. (2004). Angiogenesis
induced by photodynamic therapy in normal rat brains. Photochem Photobiol, 79, 494-498.
Joshi, P. G., Joshi, K., Mishra, S., & Joshi, N. B. (1994). Ca2+ influx induced by photodynamic action
in human cerebral glioma (U-87 MG) cells: possible involvement of a calcium channel. Photochem
Photobiol, 60, 244-248.
Kerr, J. F., Wyllie, A. H., & Currie, A. R. (1972). Apoptosis: a basic biological phenomenon with wideranging implications in tissue kinetics. Br J Cancer, 26, 239-257.
Kessel, D. (1982). Components of hematoporphyrin derivatives and their tumor-localizing capacity.
Cancer Res, 42, 1703-6.
Kessel, D. (1989a). On the purity and definition of oligomeric HPD formulations. J Photochem Photobiol
B, 3, 637-8.
Kessel, D. (1989b). Probing the structure of HPD by fluorescence spectroscopy. Photochem Photobiol,
50, 345-50.
632
Kessel, D. (1986). Proposed structure of the tumor-localizing fraction of HPD (hematoporphyrin derivative). Photochem Photobiol, 44, 193-6.
Kessel, D., Antolovich, M., & Smith, K. M. (2001). The role of the peripheral benzodiazepine receptor
in the apoptotic response to photodynamic therapy. Photochem Photobiol, 74, 346-9.
Kessel, D., & Arroyo, A. S. (2007). Apoptotic and autophagic responses to Bcl-2 inhibition and photodamage. Photochem Photobiol Sci, 6, 1290-5.
Kessel, D., Luguya, R., & Vicente, M. G. (2003). Localization and photodynamic efficacy of two cationic
porphyrins varying in charge distributions. Photochem Photobiol, 78, 431-435.
Kessel, D., & Luo, Y. (1999). Photodynamic therapy: a mitochondrial inducer of apoptosis. Cell Death
Differ, 6, 28-35.
Kessel, D., Luo, Y., Deng, Y., & Chang, C. K. (1997). The role of subcellular localization in initiation
of apoptosis by photodynamic therapy. Photochem Photobiol, 65, 422-6.
Kessel, D., & Poretz, R. D. (2000). Sites of photodamage induced by photodynamic therapy with a
chlorin e6 triacetoxymethyl ester (CAME). Photochem Photobiol, 71, 94-6.
Kessel, D., & Thompson, P. (1987). Purification and analysis of hematoporphyrin and hematoporphyrin
derivative by gel exclusion and reverse-phase chromatography. Photochem Photobiol, 46, 1023-5.
Kessel, D., Vicente, M. G., & Reiners, J. J., Jr. (2006). Initiation of apoptosis and autophagy by photodynamic therapy. Autophagy, 2, 289-90.
Kick, G., Messer, G., Goetz, A., Plewig, G., & Kind, P. (1995). Photodynamic therapy induces expression
of interleukin 6 by activation of AP-1 but not NF-kappa B DNA binding. Cancer Res, 55, 2373-2379.
Kick, G., Messer, G., Plewig, G., Kind, P., & Goetz, A. E. (1996). Strong and prolonged induction of
c-jun and c-fos proto-oncogenes by photodynamic therapy. Br J Cancer, 74, 30-36.
Kim, K. W., Mutter, R. W., Cao, C., Albert, J. M., Freeman, M., Hallahan, D. E. et al. (2006). Autophagy
for cancer therapy through inhibition of pro-apoptotic proteins and mammalian target of rapamycin
signaling. J Biol Chem, 281, 36883-90.
Klotz, L. O., Fritsch, C., Briviba, K., Tsacmacidis, N., Schliess, F., & Sies, H. (1998). Activation of JNK
and p38 but not ERK MAP kinases in human skin cells by 5-aminolevulinate-photodynamic therapy.
Cancer Res, 58, 4297-4300.
Kondo, Y., & Kondo, S. (2006). Autophagy and cancer therapy. Autophagy, 2, 85-90.
Koukourakis, M. I., Corti, L., Skarlatos, J., Giatromanolaki, A., Krammer, B., Blandamura, S. et al.
(2001a). Clinical and experimental evidence of Bcl-2 involvement in the response to photodynamic
therapy. Anticancer Res, 21, 663-668.
Koukourakis, M. I., Giatromanolaki, A., Skarlatos, J., Corti, L., Blandamura, S., Piazza, M. et al. (2001b).
Hypoxia inducible factor (HIF-1a and HIF-2a) expression in early esophageal cancer and response to
photodynamic therapy and radiotherapy. Cancer Res, 61, 1830-1832.
633
Kral, V., Davis, J., Andrievsky, A., Kralova, J., Synytsya, A., Pouckova, P. et al. (2002). Synthesis and
biolocalization of water-soluble sapphyrins. J Med Chem, 45, 1073-8.
Kucharczak, J., Simmons, M. J., Fan, Y., & Gelinas, C. (2003). To be, or not to be: NF-kappaB is the
answer--role of Rel/NF-kappaB in the regulation of apoptosis. Oncogene, 22, 8961-8982.
Lavie, G., Kaplinsky, C., Toren, A., Aizman, I., Meruelo, D., Mazur, Y. et al. (1999). A photodynamic
pathway to apoptosis and necrosis induced by dimethyl tetrahydroxyhelianthrone and hypericin in
leukaemic cells: possible relevance to photodynamic therapy. Br J Cancer, 79, 423-432.
Lehmann, P. (2007). Methyl aminolaevulinate-photodynamic therapy: a review of clinical trials in the
treatment of actinic keratoses and nonmelanoma skin cancer. Br J Dermatol, 156, 793-801.
Leist, M., & Jaattela, M. (2001). Four deaths and a funeral: from caspases to alternative mechanisms.
Nat Rev Mol Cell Biol, 2, 589-598.
Lipson, R. L., & Baldes, E. J. (1960). The photodynamic properties of a particular hematoporphyin
derivative. Arch Dermatol, 82, 508.
Lu, W. D., & Atkins, W. M. (2004). A novel antioxidant role for ligandin behavior of glutathione Stransferases: attenuation of the photodynamic effects of hypericin. Biochemistry, 43, 12761-12769.
Luna, M. C., Chen, X., Wong, S., Tsui, J., Rucker, N., Lee, A. S. et al. (2002). Enhanced photodynamic
therapy efficacy with inducible suicide gene therapy controlled by the grp promoter. Cancer Res, 62,
1458-1461.
Luna, M. C., Ferrario, A., Wong, S., Fisher, A. M., & Gomer, C. J. (2000). Photodynamic therapy-mediated oxidative stress as a molecular switch for the temporal expression of genes ligated to the human
heat shock promoter. Cancer Res, 60, 1637-44.
Luna, M. C., Wong, S., & Gomer, C. J. (1994). Photodynamic therapy mediated induction of early response genes. Cancer Res, 54, 1374-80.
Ma, J., & Jiang, L. (2001). Photogeneration of singlet oxygen (1O2) and free radicals (Sen*-, O2*-) by
tetra-brominated hypocrellin B derivative. Free Radical Research, 35, 767-777.
MacDonald, I. J., Morgan, J., Bellnier, D. A., Paszkiewicz, G. M., Whitaker, J. E., Litchfield, D. J. et al.
(1999). Subcellular localization patterns and their relationship to photodynamic activity of pyropheophorbide-a derivatives. Photochem Photobiol, 70, 789-97.
Martinou, J. C., & Green, D. R. (2001). Breaking the mitochondrial barrier. Nat Rev Mo Cell Biol, 2,
63-67.
Matroule, J. Y., Bonizzi, G., Morliere, P., Paillous, N., Santus, R., Bours, V. et al. (1999). Pyropheophorbide-a methyl ester-mediated photosensitization activates transcription factor NF-kappaB through the
interleukin-1 receptor-dependent signaling pathway. J Biol Chem, 274, 2988-3000.
Mazure, N. M., Brahimi-Horn, M. C., Berta, M. A., Benizri, E., Bilton, R. L., Dayan, F. et al. (2004).
HIF-1: master and commander of the hypoxic world. A pharmacological approach to its regulation by
siRNAs. Biochem Pharmacol, 68, 971-980.
634
Midden, W. R., & Dahl, T. A. (1992). Biological inactivation by singlet oxygen: distinguishing O2(1
delta g) and O2(1 sigma g+). Biochim Biophys Acta, 1117, 216-222.
Mitra, S., Goren, E. M., Frelinger, J. G., & Foster, T. H. (2003). Activation of heat shock protein 70
promoter with meso-tetrahydroxyphenyl chlorin photodynamic therapy reported by green fluorescent
protein in vitro and in vivo. Photochem Photobiol, 78, 615-622.
Moan, J., & Berg, K. (1991). The photodegradation of porphyrins in cells can be used to estimate the
lifetime of singlet oxygen. Photochem Photobiol, 53, 549-553.
Momma, T., Hamblin, M. R., Wu, H. C., & Hasan, T. (1998). Photodynamic therapy of orthotopic prostate
cancer with benzoporphyrin derivative: local control and distant metastasis. Cancer Res, 58, 5425-31.
Moretti, L., Attia, A., Kim, K. W., & Lu, B. (2007). Crosstalk between Bak/Bax and mTOR signaling
regulates radiation-induced autophagy. Autophagy, 3, 142-4.
Morgan, J., & Oseroff, A. R. (2001). Mitochondria-based photodynamic anti-cancer therapy. Adv Drug
Deliv Rev, 49, 71-86.
Morimoto, R. I. (1993). Cells in stress: transcriptional activation of heat shock genes. Science, 259,
1409-1410.
Morris, R. L., Azizuddin, K., Lam, M., Berlin, J., Nieminen, A. L., Kenney, M. E. et al. (2003). Fluorescence resonance energy transfer reveals a binding site of a photosensitizer for photodynamic therapy.
Cancer Res, 63, 5194-5197.
Morris, R. L., Varnes, M. E., Kenney, M. E., Li, Y. S., Azizuddin, K., McEnery, M. W. et al. (2002). The
peripheral benzodiazepine receptor in photodynamic therapy with the phthalocyanine photosensitizer
Pc 4. Photochem Photobiol, 75, 652-661.
Murray, A. W. (1998). MAP kinases in meiosis. Cell, 92, 157-159.
Nagata, S., Obana, A., Gohto, Y., & Nakajima, S. (2003). Necrotic and apoptotic cell death of human
malignant melanoma cells following photodynamic therapy using an amphiphilic photosensitizer, ATXS10(Na). Lasers Surg Med, 33, 64-70.
Nakano, H. (2004). Signaling crosstalk between NF-kappaB and JNK. Trends Immunol, 25, 402-405.
Nseyo, U. O., Whalen, R. K., Duncan, M. R., Berman, B., & Lundahl, S. L. (1990). Urinary cytokines
following photodynamic therapy for bladder cancer. A preliminary report. Urology, 36, 167-171.
Nyman, E. S., & Hynninen, P. H. (2004). Research advances in the use of tetrapyrrolic photosensitizers
for photodynamic therapy. J Photochem Photobiol B, 73, 1-28.
Oleinick, N. L., & Evans, H. H. (1998). The photobiology of photodynamic therapy: cellular targets and
mechanisms. Radiat Res, 150, S146-56.
Onuki, J., Teixeira, P. C., Medeiros, M. H., Dornemann, D., Douki, T., Cadet, J. et al. (2002). Is 5-aminolevulinic acid involved in the hepatocellular carcinogenesis of acute intermittent porphyria? Cell
Mol Biol, 48, 17-26.
635
Orenstein, A., Kostenich, G., Roitman, L., Shechtman, Y., Kopolovic, Y., Ehrenberg, B. et al. (1996). A
comparative study of tissue distribution and photodynamic therapy selectivity of chlorin e6, Photofrin
II and ALA-induced protoporphyrin IX in a colon carcinoma model. Br J Cancer, 73, 937-44.
Ortel, B., Chen, N., Brissette, J., Dotto, G. P., Maytin, E., & Hasan, T. (1998). Differentiation-specific
increase in ALA-induced protoporphyrin IX accumulation in primary mouse keratinocytes. Br J Cancer, 77, 1744-51.
Oseroff, A. R., Ohuoha, D., Ara, G., McAuliffe, D., Foley, J., & Cincotta, L. (1986). Intramitochondrial
dyes allow selective in vitro photolysis of carcinoma cells. Proc Natl Acad Sci USA, 83, 9729-33.
Patel, K. D., Cuvelier, S. L., & Wiehler, S. (2002). Selectins: critical mediators of leukocyte recruitment.
Semi Immunol, 14, 73-81.
Peng, Q., Moan, J., Nesland, J. M., & Rimington, C. (1990). Aluminum phthalocyanines with asymmetrical lower sulfonation and with symmetrical higher sulfonation: a comparison of localizing and
photosensitizing mechanism in human tumor LOX xenografts. Int J Cancer, 46, 719-26.
Peng, Q., Warloe, T., Berg, K., Moan, J., Kongshaug, M., Giercksky, K. E. et al. (1997). 5-Aminolevulinic
acid-based photodynamic therapy. Clinical research and future challenges. Cancer, 79, 2282-308.
Penning, L. C., Keirse, M. J., VanSteveninck, J., & Dubbelman, T. M. (1993). Ca(2+)-mediated prostaglandin E2 induction reduces haematoporphyrin-derivative-induced cytotoxicity of T24 human bladder
transitional carcinoma cells in vitro. Biochem J, 292 ( Pt 1), 237-240.
Penning, L. C., Rasch, M. H., Ben-Hur, E., Dubbelman, T. M., Havelaar, A. C., Van der Zee, J. et al.
(1992). A role for the transient increase of cytoplasmic free calcium in cell rescue after photodynamic
treatment. Biochim Biophys Acta, 1107, 255-260.
Piret, B., Legrand-Poels, S., Sappey, C., & Piette, J. (1995). NF-kappa B transcription factor and human immunodeficiency virus type 1 (HIV-1) activation by methylene blue photosensitization. Eur J
Biochem, 228, 447-455.
Plaetzer, K., Kiesslich, T., Krammer, B., & Hammerl, P. (2002). Characterization of the cell death modes
and the associated changes in cellular energy supply in response to AlPcS4-PDT. Photochem Photobiol
Sci, 1, 172-177.
Pluskalova, M., Peslova, G., Grebenova, D., Halada, P., & Hrkal, Z. (2006). Photodynamic treatment
(ALA-PDT) suppresses the expression of the oncogenic Bcr-Abl kinase and affects the cytoskeleton
organization in K562 cells. J Photochem Photobiol B, 83, 205-12.
Rashid, F., & Horobin, R. W. (1990). Interaction of molecular probes with living cells and tissues. Part
2. A structure-activity analysis of mitochondrial staining by cationic probes, and a discussion of the
synergistic nature of image-based and biochemical approaches. Histochemistry, 94, 303-8.
Ravanat, J. L., & Cadet, J. (1995). Reaction of singlet oxygen with 2-deoxyguanosine and DNA. Isolation and characterization of the main oxidation products. Chem Res Toxicol, 8, 379-88.
Renno, R. Z., Delori, F. C., Holzer, R. A., Gragoudas, E. S., & Miller, J. W. (2000). Photodynamic therapy
using Lu-Tex induces apoptosis in vitro, and its effect is potentiated by angiostatin in retinal capillary
endothelial cells. Invest Ophthalmol Vis Sci, 41, 3963-3971.
636
Rousset, N., Keminon, E., Eleouet, S., Le Neel, T., Auget, J. L., Vonarx, V. et al. (2000). Use of alkaline
Comet assay to assess DNA repair after m-THPC-PDT. J Photochem Photobiol B, 56, 118-131.
Rousset, N., Vonarx, V., Eleouet, S., Carre, J., Kerninon, E., Lajat, Y. et al. (1999). Effects of photodynamic therapy on adhesion molecules and metastasis. J Photochem Photobiol B, 52, 65-73.
Ruck, A., Heckelsmiller, K., Kaufmann, R., Grossman, N., Haseroth, E., & Akgun, N. (2000). Lightinduced apoptosis involves a defined sequence of cytoplasmic and nuclear calcium release in AlPcS4photosensitized rat bladder RR 1022 epithelial cells. Photochem Photobiol, 72, 210-216.
Ruhdorfer, S., Sanovic, R., Sander, V., Krammer, B., & Verwanger, T. (2007). Gene expression profiling
of the human carcinoma cell line A-431 after 5-aminolevulinic acid-based photodynamic treatment.
Int J Oncol, 30, 1253-62.
Ruiz-Galindo, E., Arenas-Huertero, F., & Ramon-Gallegos, E. (2007). Expression of genes involved
in heme biosynthesis in the human retinoblastoma cell lines WERI-Rb-1 and Y79: implications for
photodynamic therapy. J Exp Clin Cancer Res, 26, 195-200.
Runnels, J. M., Chen, N., Ortel, B., Kato, D., & Hasan, T. (1999). BPD-MA-mediated photosensitization in vitro and in vivo: cellular adhesion and beta1 integrin expression in ovarian cancer cells. Br J
Cancer, 80, 946-53.
Ryter, S. W., & Gomer, C. J. (1993). Nuclear factor kappa B binding activity in mouse L1210 cells following photofrin II-mediated photosensitization. Photochem Photobiol, 58, 753-6.
Salet, C., Moreno, G., Ricchelli, F., & Bernardi, P. (1997). Singlet oxygen produced by photodynamic
action causes inactivation of the mitochondrial permeability transition pore. J Biol Chem, 272, 2193843.
Santini, M. P., Talora, C., Seki, T., Bolgan, L., & Dotto, G. P. (2001). Cross talk among calcineurin,
Sp1/Sp3, and NFAT in control of p21(WAF1/CIP1) expression in keratinocyte differentiation. Proc Natl
Acad Sci USA, 98, 9575-9580.
Scheid, M. P., & Woodgett, J. R. (2003). Unravelling the activation mechanisms of protein kinase B/Akt.
FEBS Lett, 546, 108-112.
Schieke, S. M., von Montfort, C., Buchczyk, D. P., Timmer, A., Grether-Beck, S., Krutmann, J. et al.
(2004). Singlet oxygen-induced attenuation of growth factor signaling: possible role of ceramides. Free
Radic Res, 38, 729-737.
Semenza, G. L. (2004). Hydroxylation of HIF-1: oxygen sensing at the molecular level. Physiology, 19,
176-182.
Separovic, D., Mann, K. J., & Oleinick, N. L. (1998). Association of ceramide accumulation with photodynamic treatment-induced cell death. Photochem Photobiol, 68, 101-9.
Separovic, D., Pink, J. J., Oleinick, N. A., Kester, M., Boothman, D. A., McLoughlin, M. et al. (1999).
Niemann-Pick human lymphoblasts are resistant to phthalocyanine 4-photodynamic therapy-induced
apoptosis. Biochem Biophys Res Commun, 258, 506-12.
637
Sessler, J. L., & Miller, R. A. (2000). Texaphyrins: new drugs with diverse clinical applications in radiation and photodynamic therapy. Biochem Pharmacol, 59, 733-9.
Shao, C., Furusawa, Y., Kobayashi, Y., Funayama, T., & Wada, S. (2003). Bystander effect induced by
counted high-LET particles in confluent human fibroblasts: a mechanistic study. FASEB J, 17, 14221427.
Shaulian, E., & Karin, M. (2002). AP-1 as a regulator of cell life and death. Nat Cell Biol, 4, E131-6.
Shen, X. Y., Zacal, N., Singh, G., & Rainbow, A. J. (2005). Alterations in mitochondrial and apoptosisregulating gene expression in photodynamic therapy-resistant variants of HT29 colon carcinoma cells.
Photochem Photobiol, 81, 306-13.
Shi, Y. (2004a). Caspase activation, inhibition, and reactivation: a mechanistic view. Protein science:
A Publication of the Protein Society, 13, 1979-1987.
Shi, Y. (2004b). Caspase activation: revisiting the induced proximity model. Cell, 117, 855-858.
Specht, K. G., & Rodgers, M. A. (1991). Plasma membrane depolarization and calcium influx during
cell injury by photodynamic action. Biochim Biophys Acta, 1070, 60-68.
Spikes, J. D. (1990). Chlorins as photosensitizers in biology and medicine. J Photochem Photobiol B,
6, 259-74.
Spikes, J. D., & Bommer, J. C. (1993). Photobleaching of mono-L-aspartyl chlorin e6 (NPe6): a candidate
sensitizer for the photodynamic therapy of tumors. Photochem Photobiol, 58, 346-50.
Stevens, C., & La Thangue, N. B. (2003). E2F and cell cycle control: a double-edged sword. Arch Biochem Biophys, 412, 157-169.
Stockert, J. C., Canete, M., Juarranz, A., Villanueva, A., Horobin, R. W., Borrell, J. I. et al. (2007).
Porphycenes: facts and prospects in photodynamic therapy of cancer. Curr Med Chem, 14, 997-1026.
Stockert, J. C., Juarranz, A., Villanueva, A., & Canete, M. (1996). Photodynamic damage to HeLa cell
microtubules induced by thiazine dyes. Cancer Chemother Pharmacol, 39, 167-9.
Stott, B., & Korbelik, M. (2007). Activation of complement C3, C5, and C9 genes in tumors treated by
photodynamic therapy. Cancer Immunol Immunother, 56, 649-58.
Sun, J., Cecic, I., Parkins, C. S., & Korbelik, M. (2002a). Neutrophils as inflammatory and immune effectors in photodynamic therapy-treated mouse SCCVII tumours. Photochem Photobiol Sci, 1, 690-695.
Sun, X., & Leung, W. N. (2002b). Photodynamic therapy with pyropheophorbide-a methyl ester in human lung carcinoma cancer cell: efficacy, localization and apoptosis. Photochemistry and photobiology,
75, 644-651.
Svaasand, L. O. (1984). Optical dosimetry for direct and interstitial photoradiation therapy of malignant
tumors. Prog Clin Biol Res, 170, 91-114.
Svanberg, K., Liu, D. L., Wang, I., Andersson-Engels, S., Stenram, U., & Svanberg, S. (1996). Photodynamic therapy using intravenous delta-aminolaevulinic acid-induced protoporphyrin IX sensitisation
in experimental hepatic tumours in rats. Br J Cancer, 74, 1526-33.
638
Szeimies, R. M., Sassy, T., & Landthaler, M. (1994). Penetration potency of topical applied delta-aminolevulinic acid for photodynamic therapy of basal cell carcinoma. Photochem Photobiol, 59, 73-6.
Takeuchi, Y., Kurohane, K., Ichikawa, K., Yonezawa, S., Nango, M., & Oku, N. (2003). Induction of
intensive tumor suppression by antiangiogenic photodynamic therapy using polycation-modified liposomal photosensitizer. Cancer, 97, 2027-2034.
Tao, J., Sanghera, J. S., Pelech, S. L., Wong, G., & Levy, J. G. (1996). Stimulation of stress-activated
protein kinase and p38 HOG1 kinase in murine keratinocytes following photodynamic therapy with
benzoporphyrin derivative. J Biol Chem, 271, 27107-15.
Teiten, M. H., Bezdetnaya, L., Morliere, P., Santus, R., & Guillemin, F. (2003). Endoplasmic reticulum
and Golgi apparatus are the preferential sites of Foscan localisation in cultured tumour cells. B J Cancer, 88, 146-152.
Thibaut, S., Bourre, L., Hernot, D., Rousset, N., Lajat, Y., & Patrice, T. (2002). Effects of BAPTA-AM,
Forskolin, DSF and Z.VAD.fmk on PDT-induced apoptosis and m-THPC phototoxicity on B16 cells.
Apoptosis, 7, 99-106.
Tong, Z., Singh, G., & Rainbow, A. J. (2002). Sustained activation of the extracellular signal-regulated
kinase pathway protects cells from photofrin-mediated photodynamic therapy. Cancer Res, 62, 55285535.
Torriglia, A., Perani, P., Brossas, J. Y., Altairac, S., Zeggai, S., Martin, E. et al. (2000). A caspase-independent cell clearance program. The LEI/L-DNase II pathway. Ann N Y Acad Sci, 926, 192-203.
Trivedi, N. S., Wang, H. W., Nieminen, A. L., Oleinick, N. L., & Izatt, J. A. (2000). Quantitative analysis
of Pc 4 localization in mouse lymphoma (LY-R) cells via double-label confocal fluorescence microscopy.
Photochem Photobiol, 71, 634-639.
Uehara, M., Inokuchi, T., Sano, K., & ZuoLin, W. (2001). Expression of vascular endothelial growth
factor in mouse tumours subjected to photodynamic therapy. Eur J Cancer, 37, 2111-2115.
Usuda, J., Azizuddin, K., Chiu, S. M., & Oleinick, N. L. (2003). Association between the photodynamic
loss of Bcl-2 and the sensitivity to apoptosis caused by phthalocyanine photodynamic therapy. Photochem Photobiol, 78, 1-8.
Usuda, J., Okunaka, T., Furukawa, K., Tsuchida, T., Kuroiwa, Y., Ohe, Y. et al. (2001). Increased cytotoxic effects of photodynamic therapy in IL-6 gene transfected cells via enhanced apoptosis. Int J
Cancer, 93, 475-480.
Uzdensky, A. B., Juzeniene, A., Kolpakova, E., Hjortland, G. O., Juzenas, P., & Moan, J. (2004). Photosensitization with protoporphyrin IX inhibits attachment of cancer cells to a substratum. Biochem
Biophys Res Commun, 322, 452-457.
van den Boogert, J., van Hillegersberg, R., de Rooij, F. W., de Bruin, R. W., Edixhoven-Bosdijk, A.,
Houtsmuller, A. B. et al. (1998). 5-Aminolaevulinic acid-induced protoporphyrin IX accumulation in tissues: pharmacokinetics after oral or intravenous administration. J Photochem Photobiol B, 44, 29-38.
639
Van Hillegersberg, R., Van den Berg, J. W., Kort, W. J., Terpstra, O. T., & Wilson, J. H. (1992). Selective
accumulation of endogenously produced porphyrins in a liver metastasis model in rats. Gastroenterology, 103, 647-51.
van Leengoed, H. L., Schuitmaker, J. J., van der Veen, N., Dubbelman, T. M., & Star, W. M. (1993).
Fluorescence and photodynamic effects of bacteriochlorin a observed in vivo in sandwich observation
chambers. Br J Cancer, 67, 898-903.
Verwanger, T., Sanovic, R., Aberger, F., Frischauf, A. M., & Krammer, B. (2002). Gene expression
pattern following photodynamic treatment of the carcinoma cell line A-431 analysed by cDNA arrays.
Int J Oncol, 21, 1353-9.
Verwanger, T., Schnitzhofer, G., & Krammer, B. (1998). Expression kinetics of the (proto) oncogenes
c-myc and bcl-2 following photodynamic treatment of normal and transformed human fibroblasts with 5aminolaevulinic acid-stimulated endogenous protoporphyrin IX. J Photochem Photobiol B, 45, 131-5.
Volanti, C., Gloire, G., Vanderplasschen, A., Jacobs, N., Habraken, Y., & Piette, J. (2004). Downregulation
of ICAM-1 and VCAM-1 expression in endothelial cells treated by photodynamic therapy. Oncogene,
23, 8649-8658.
Volanti, C., Matroule, J. Y., & Piette, J. (2002). Involvement of oxidative stress in NF-kappaB activation
in endothelial cells treated by photodynamic therapy. Photochem Photobiol, 75, 36-45.
Vonarx, V., Foultier, M. T., Xavier de Brito, L., Anasagasti, L., Morlet, L., & Patrice, T. (1995). Photodynamic therapy decreases cancer colonic cell adhesiveness and metastatic potential. Res Exp Med
(Berl), 195, 101-116.
Wang, H. P., Hanlon, J. G., Rainbow, A. J., Espiritu, M., & Singh, G. (2002). Up-regulation of Hsp27
plays a role in the resistance of human colon carcinoma HT29 cells to photooxidative stress. Photochem
Photobiol, 76, 98-104.
Whitacre, C. M., Feyes, D. K., Satoh, T., Grossmann, J., Mulvihill, J. W., Mukhtar, H. et al. (2000).
Photodynamic therapy with the phthalocyanine photosensitizer Pc 4 of SW480 human colon cancer
xenografts in athymic mice. Clin Cancer Res, 6, 2021-7.
Wilson, B. C., Jeeves, W. P., & Lowe, D. M. (1985). In vivo and post mortem measurements of the attenuation spectra of light in mammalian tissues. Photochem Photobiol, 42, 153-62.
Wong, T. W., Tracy, E., Oseroff, A. R., & Baumann, H. (2003). Photodynamic therapy mediates immediate loss of cellular responsiveness to cytokines and growth factors. Cancer Res, 63, 3812-3818.
Woodburn, K. W., Fan, Q., Miles, D. R., Kessel, D., Luo, Y., & Young, S. W. (1997). Localization and
efficacy analysis of the phototherapeutic lutetium texaphyrin (PCI-0123) in the murine EMT6 sarcoma
model. Photochem Photobiol, 65, 410-5.
Woodburn, K. W., Vardaxis, N. J., Hill, J. S., Kaye, A. H., & Phillips, D. R. (1991). Subcellular localization of porphyrins using confocal laser scanning microscopy. Photochem Photobiol, 54, 725-32.
Woods, J. A., Traynor, N. J., Brancaleon, L., & Moseley, H. (2004). The effect of photofrin on DNA
strand breaks and base oxidation in HaCaT keratinocytes: a comet assay study. Photochem Photobiol,
79, 105-113.
640
Xue, L., He, J., & Oleinick, N. L. (1999). Promotion of photodynamic therapy-induced apoptosis by
stress kinases. Cell Death Differ, 6, 855-864.
Xue, L. Y., Chiu, S. M., Azizuddin, K., Joseph, S., & Oleinick, N. L. (2007a). Protection by Bcl-2 against
apoptotic but not autophagic cell death after photodynamic therapy. Autophagy, 4.
Xue, L. Y., Chiu, S. M., Azizuddin, K., Joseph, S., & Oleinick, N. L. (2007b). The death of human cancer
cells following photodynamic therapy: apoptosis competence is necessary for Bcl-2 protection but not
for induction of autophagy. Photochem Photobiol, 83, 1016-23.
Xue, L. Y., Chiu, S. M., & Oleinick, N. L. (2001). Photodynamic therapy-induced death of MCF-7 human breast cancer cells: a role for caspase-3 in the late steps of apoptosis but not for the critical lethal
event. Exp Cell Res, 263, 145-155.
Yang, J., Yu, Y., Sun, S., & Duerksen-Hughes, P. J. (2004). Ceramide and other sphingolipids in cellular
responses. Cell Biochem Biophys, 40, 323-350.
Yeo, E. J., Chun, Y. S., & Park, J. W. (2004). New anticancer strategies targeting HIF-1. Biochem Pharmacol, 68, 1061-1069.
Yom, S. S., Busch, T. M., Friedberg, J. S., Wileyto, E. P., Smith, D., Glatstein, E. et al. (2003). Elevated
serum cytokine levels in mesothelioma patients who have undergone pleurectomy or extrapleural pneumonectomy and adjuvant intraoperative photodynamic therapy. Photochem Photobiol, 78, 75-81.
Yu, L., Lenardo, M. J., & Baehrecke, E. H. (2004). Autophagy and Caspases: A New Cell Death Program. Cell Cycle, 3, 1124-1126.
Zacal, N., & Rainbow, A. J. (2007). Photodynamic therapy resistant human colon carcinoma HT29 cells
show cross-resistance to UVA but not UVC light. Photochem Photobiol, 83, 730-7.
Zawacka-Pankau, J., Issaeva, N., Hossain, S., Pramanik, A., Selivanova, G., & Podhajska, A. J. (2007).
Protoporphyrin IX interacts with wild-type p53 protein in vitro and induces cell death of human colon
cancer cells in a p53-dependent and -independent manner. J Biol Chem, 282, 2466-72.
Zhuang, S., & Kochevar, I. E. (2003). Singlet oxygen-induced activation of Akt/protein kinase B is
independent of growth factor receptors. Photochem Photobiol, 78, 361-371.
key terms
5-ALA: 5-aminolevulinic acid, small amino-acid that is a natural precursor of PPIX.
Apoptosis: Programmed cell death characterized by nuclear condensation and DNA fragmentation.
Autophagy: Programmed cell death characterized by autodestruction using lysosomal machinery.
BPD: Benzoporphyrin derivative, PS also known as Visudyne or Verteporfin clinically used by
ophthalmologists.
641
Bystander Effect: Phenomenon in which uninjured cells surrounding a dying cell also die.
HPD or PF: Hematoporphyrin or Photofrin, first clinically used PS derived from ox blood.
m-THPC: m-tetrahydroxyphenylchlorin, PS also known as Foscan clinically used for cancer.
Necrosis: Non-programmed cell death characterized by membrane bursting.
1
O2: Singlet oxygen, excited state reactive form of oxygen.
PC: Phthalocyanine, PS containing tetrapyrrole structure but made synthetically.

PDT: Photodynamic therapy, treatment combining dyes and light to destroy cells.
Photobleaching: Light dependent alteration of PS molecule.
PPIX: Protoporphyrin IX, naturally occuring PS whose production in cells is increased when 5ALA is administered.
PS: Photosensitizer, dye (frequently a tetrapyrrole) used in PDT.
3
PS: Triplet state, a relatively long-lived excited state of PS.
RB: Rose Bengal, PS with xanthene structure and strong pink color.
ROS: Reactive oxygen species general term for oxidizing species such as singlet oxygen, superoxide
anion, hydrogen peroxide and hydroxyl radical.
SOD: Superoxide oxide dismutase, an antioxidant enzyme that converts superoxide to hydrogen
peroxide and oxygen.
642
643
Chapter XXXVI
Modeling of Porphyrin
Metabolism with PyBioS
Andriani Daskalaki
abstract
Photodynamic Therapy (PDT) involves administration of a photosensitizer (PS) either systemically or
locally, followed by illumination of the lesion with visible light. PDT of cancer is now evolving from
experimental treatment to a therapeutic alternative. Clinical results have shown that PDT is at least as
efficacious as standard treatments of malignancies of the skin and Barretts esophagus. Hemes and heme
proteins are vital components of essentially every cell in virtually all eukaryote organisms. Protoporphyrin IX (PpIX) is produced in cells via the heme synthesis pathway from the substrate aminolevulinic
acid (ALA). Exogenous administration of ALA induces accumulation of (PpIX), which can be used as
a photosensitiser for tumor detection or photodynamic therapy. Although the basis of the selectivity
of ALA-based PDT or photodiagnosis is not fully understood, it has sometimes been correlated with
the metabolic rate of the cells, or with the differential enzyme expressions along the heme biosynthetic
pathway in cancer cells. An in silico analysis by modeling may be performed in order to determine the
functional roles of genes coding enzymes of the heme biosynthetic pathway like ferrochelatase. Modeling and simulation systems are a valuable tool for the understanding of complex biological systems.
With PyBioS, an object-oriented modelling software for biological processes, we can analyse porphyrin
metabolism pathways.
Modeling of Porphyrin Metabolism with PyBioS
INTRODUCTION
The use of PDT for curative treatments of superficial tumors of the skin and for palliative treatments of disseminated tumors of skin and oral mucosa is well known (Daskalaki 2002). PDT also
is efficacious as treatment of malignancies of Barretts oesophagus (Foroulis and Thorpe 2006).
PDT is based on a photochemical process, where photosensitizers (PS) act cytotoxic by generation of
1
O2 after laser irradiation.
The use of fluorescence measurements as quantitative indicators for PpIX accumulation after exogenous ALA administration is suitable for differentiating neoplastic, necrotic and inflammatory tissues
from normal tissues. The modulation of ALA-induced PpIX accumulation and expression will provide
more diagnostic information and more accuracy for the diagnosis of unhealthy tissue, especially in
border-line cases. The modulation of fluorescence characteristics of ALA-induced PpIX with NAD has
been used for differentiation between fibroblast and fibrosarcoma (Ismail et al. 1997).
The flow of substrates into the porphyrin pathway is controlled by tsynthesis of daminolevulinic
acid (ALA), the first committed precursor in the porphyrin pathway. Although light is required to trigger
ALA synthesis and differentiation of chloroplasts (Reinbothe and Reinbothe, 1996), a feedback inhibition
of ALA synthesis by an end product of the porphyrin pathway is thought to be involved in the regulation
of influx into the pathway (Wettstein et al., 1995; Reinbothe and Reinbothe, 1996). Both the nature of
the product and the mechanism involved in effecting feedback inhibition remain unknown, probably
because there have been no porphyrin pathway mutants identified so far that affect both chlorophyll
and heme biosyntheses. Thus, modelling of porphyrin pathway may fill this gap and allow researchers
to address these questions.
Downey (2002) tried to show how the porphyrin pathway may be an integral part of all disease
processes through a model. Analytical techniques capable of measuring porphyrins in all cells are
needed. Data gathered from plant and animal studies need to be adapted to humans where possible. An
inexpensive, accurate and rapid analysis needs to be developed so porphyrins can be measured more
routinely.
The committed step for porphyrin synthesis is the formation of 5-aminolevulinate (ALA) by condensation of glycine (from the general amino acid pool) and succinyl-CoA (from the TCA cycle), in the
mitochondrial matrix. This reaction is catalyzed by two different ALA synthases, one expressed ubiquitously (ALAS1) and the other one only expressed in erythroid precursors (ALAS2) (Ajioka, 2006).
Heme inhibits the activity of ALA synthetase, the first and rate-limiting enzyme of the biosynthetic
pathway, thereby preventing normal cells from drowning in excess production of its own porphyrins.
This negative feedback control can be bypassed in certain malignant cells exposed to an excess amount
of ALA, which is metabolised leading to overproduction of PpIX.
Excess accumulation of PpIX occurs because of the enzyme configuration in malignant cells (Kirby
2001). The enzyme ferrochelatase (FECH) catalyzes insertion of an iron atom into PpIX forming heme
which is not photoreactive. However, cancer cells have a relatively low activity of ferrochelatase which
leads to an excess accumulation of PpIX (Schoenfeld 1988). Another factor leading to augmented PpIX
synthesis is an increased activity of the rate-limiting enzyme porphobilinogen deaminase in various
malignant tissues (Wilson 1991). Kemmner W et al. (2008) recently showed that in malignant tissue a
transcriptional down-regulation of FECH occurs causeing endogenous PpIX accumulation. Furthermore,
accumulation of intracellular PpIX because of FECH small interfering RNA (siRNA) silencing provides
a small-molecule-based approach to molecular imaging and molecular therapy.
644
Kemmner W et al (2008) demonstrated accumulation of the heme precursor PpIX in gastrointestinal

tumor tissues. To elucidate the mechanisms of PpIX accumulation expression of the relevant enzymes
in the heme synthetic pathway has been studied. Kemmner W et al (2008) described a significant downregulation of FECH mRNA expression in gastric, colonic, and rectal carcinomas. Accordingly, in an
in vitro model of several carcinoma cell lines, ferrochelatase down-regulation and loss of enzymatic
activity corresponded with an enhanced PpIX-dependent fluorescence. Silencing of FECH using siRNA
technology led to a maximum 50-fold increased PpIX accumulation.
Bhasin G et al. (1999) investigated the hypothesis that inhibition of ferrochelatase will cause in situ
build up of high PpIX concentrations which may act as a putative agent for photodestruction of cancer
cells. The parenteral administration of lead acetate, a known inhibitor of ferrochelatase, to mice bearing cutaneous tumors (papillomas and carcinomas) caused a six-fold enhancement in the concentration
of PpIX in tumors within a period of one month. A significant reduction in tumor size was observed
starting as early as day one following the treatment.
5-Aminolevulinate synthase (ALAS) is a mitochondrial enzyme that catalyzes the first step of the
heme biosynthetic pathway. Mitochondrial import as well as synthesis of the nonspecific ALAS isoform
(ALAS1) is regulated by heme through a feedback mechanism (Munakata 2004).
Mutations
A deficiency of FECH activity underlies the excess accumulation of protoporphyrin that occurs in
erythropoietic protoporphyria (EPP). In some patients, protoporphyrin accumulation causes liver damage that necessitates liver transplantation. Mutations of the codons for 2 of [2Fe-2S] cluster ligands in
patients with EPP support the importance of the iron-sulfur center for proper functioning of mammalian
Figure. 1. Schematic illustration of PpIX biosynthesis. One part of the synthesis is localized in mitochondrium the other part in cytoplasm. In this scheme not all biosynthesis steps are to be seen.
645
FECH and, at least in humans, its absence has a direct clinical impact (Schneider-Yin X. et al. 2000).
Fu-Ping Chen et al.(2002) studied patients who developed liver disease with mutations in the FECH
gene. Recent attempts to increase the efficacy of ALA-mediated PDT include the use of iron chelators
to decrease the amount of PPIX converted to heme by FECH in removing the free iron that is necessary
for the enzyme to work (Curnow, 1998. Ferreira et al., 1999).
X-linked sideroblastic anemias (XLSAs), a group of severe disorders in humans characterized by
inadequate formation of heme in erythroblast mitochondria, are caused by mutations in the gene for
erythroid eALAS, one of two human genes for ALAS (Astner et al., 2005). Cloning and expression of
the defective gene for delta-aminolevulinate dehydratase (ALAD) from patients with ALAD deficiency
porphyria (ADP) were performed by Akagi et al.(2000).
Mitchel et al.(2001), studied Escherichia coli and human porphobilinogen synthase (PBGS) mutants.
MODELING OF PORPHYRIN METABOLISM

Quantitative modeling studies of pathways have been successfully applied to understand complex cellular processes (Schoeberl 2002, Klipp 2005).
Particular attention has been paid to the way, in which PpIX is distributed and accumulated in
cells under the effect of ALA (Gaullier et al. 1995). For induction of a clinical effect it is important to
recognize the kinetics of PpIX accumulation in cells, as influenced by the applied ALA dose . Cellular
content of tis photosensitizer precursor should be optimal for induction of the photodestructive effect,
following light exposure of targeted neoplastic lesions. The kinetics of PpIX formation under the effect
of exogenous ALA is thought to result from circumvented bottle-neck linked to synthesis of endogenous
ALA, the level of which remains under control of free heme (Kennedy and Pottier (1992). Considering that these problems may not only be of theoretical significance, but also have a practical value for
establishing conditions of a photodynamic therapy,we have to define kinetics of PpIX accumulation in
different cells under the effect of various concentrations of ALA.
Pathway Databases
Pathway databases can act as a rich source for such graphs, because a reaction graph is simply a pathway.
The reactome pathway database (Vastrik et al., 2007) has been used as a key starting point for kinetic
modelling since it entails detailed models for reaction graphs, which describe the series of biochemical
events involved in the models and their relationships. The graphs establish a framework for the models
and suggest the kinetic coefficients that should be obtained experimentally.
Kamburov et al. (2006) developed ConsensusPathDB (Figure 1), a database that helps users to
summarize and verify pathway information and to enrich a priori information in the process of model
annotation. The database model allows integration of information on metabolic, signal transduction and
gene regulatory networks. Cellular reaction networks are stored in a PostgreSQL database and can be
accessed under http://pybios.molgen.mpg.de/CPDB.
By forward modelling we integrate all interactive properties of molecular components to understand
systems behavior (Westerhoff et al. 2008). The forward-modeling approach supports the formulation
646
Figure 2. CPDB database. ConsensusPathDB assists the development, expansion and refinement of
computational models of biological systems and the context-specific visualization of models provided
in SBML.
of hypothesezing for e.g. in silico knock-out experiments. Thus, to construct a model of the porphyrin
metabolism pathway, one should consider one enzymatic or transport step at a time, should comb the
literature for information about this enzyme, its cofactors and modulators, and should translate this
information into a mathematical rate law which could be a Michaelis-Menten, among a wide variety of
possibilities. The collection of all rate laws governs the dynamics of this model. Comparisons of model
responses with biological observations support the validity of this appointed model or suggest adjustments in assumptions or parameter values. This forward process may lead to model representations of
the pathway exhibiting the same features as reality, at least qualitatively, if not quantitatively.
The porphyrin metabolism model was assembled, simulated and analysed by PybioS. PyBioS is an
object-oriented tool for modeling and simulation of cellular processes. This tool has been established
for the modelling of biological processes using metabolic pathways from databases like KEGG and the
Reactome database.
Modeling and simulation techniques are valuable tools for the understanding of complex biological
systems. The platform is implemented as a Python-product for the Zope web application server environment.
647
PyBioS acts as a model repository and supports the generation of large models based on publicly
available information like the metabolic data of the KEGG database. An ODE-system of this model may
be generated automatically based on pre- or user-defined kinetic laws and used for subsequent simulation
of time course series and further analyses of the dynamic behavior of the underlying system.
Modeling with PyBioS

A model of a disease-relevant pathway, such as porphyrin metabolism, has been employed to study the
relationship between basic enzymes and products in the biosynthetic pathway.
Visualization of the porphyrin metabolism interaction network (Figure 2) was enabled by automatically generated graphs that include information about the objects, reactions and mass- and information-flow.
The model includes a total of 16 reactions, and 42 objects. It is composed of an ordinary differential
equations system with 14 state variables and 16 parameters. The law of mass-action has been applied to
describe the rate of porphyrin metabolism.Time-dependent changes of the concentration of participating
proteins and protein complexes are determined by a system of differential equations.
Mutations related to genes FECH and ALAS have been analyzed by simulating knockouts of these
genes by using a mathematical model in order to study mutation effects in the concentration of heme.
Figure 3. A part of the porphyrin metabolism pathway is illustrated as a network diagram in PyBioS.
Catalysis of heme formation by the enzyme ferrochelatase is illustrated in the network graphic in PyBioS.
648
Figure 4. Ferrochelatase (FECH) catalyzes the terminal step of the heme biosynthetic pathway. Graphical illustration of the time course of the PpIX concentration , heme and ferrochelatase (FECH). FECH
catalyzes the production of heme by PpIX.
Table 1. Parameters in the model

Parameter symbol
Biological meaning
massi_k (FECH homodimer )
Heme production rate constant
deg_degradation_k (heme [mitochondrial matrix])
Heme degradation rate constant
massi_k (ALAS homodimer)
5-aminolevulinate production rate constant
massi_k (PPO homodimer (FAD cofactor)
Protoporphyrin IX production rate constant
massi_k (CPO homodimer )
Protoporphyrinogen IX production rate constant
ComplexFormationReversible_kf (ALAD homooctamer )
Formation of complex rate constant
massi_k (ALAD homooctamer (Zinc cofactor)
Porphobilinogen production rate constant
massi_k (Porphobilinogen deaminase)
Hydroxymethylbilane (HMB) production rate constant
massi_k (UROD homodimer )
Coproporphyrinogen III production rate constant
Degradation_degradation_k (ALAD homooctamer (Pb and Zn

bound)
Degradation rate constant
Degradation_degradation_k (Coproporphyrinogen I
Degradation rate constant
massi_k (Uroporphyrinogen-III synthase)
Uroporphyrinogen III production rate constant
massi_k (Protoporphyrin IX)
Export rate constant for Protoporphyrin IX
massi_k (Coproporphyrinogen)
Export rate constant for Coproporphyrinogen III
massi_k (ALAD)
Formation of ALA Dehydratase inactive complex
massi_k Hydroxymethylbilane
Dissociation rate constant for Hydroxymethylbilane
649
Table 2. State variables (proteins) in the model

Parameter symbol
Biological meaning
5-aminolevulinate [cytosol]
5-aminolevulinate in the cytosol
5-aminolevulinate [mitochondrial matrix]
5-aminolevulinate in the mitochondrial matrix
ALAD homooctamer (Pb and Zn bound) [cytosol]
ALA Dehydratase inactive complex in the cytosol
ALAD homooctamer (Zinc cofactor) [cytosol]
ALA Dehydratase in the cytosol
ALAS homodimer [mitochondrial matrix]
ALA Synthetase in the mitochondrial matrix
Coproporphyrinogen I [cytosol]
Coproporphyrinogen I in the cytosol
Coproporphyrinogen III [cytosol]
Coproporphyrinogen III in the cytosol
Coproporphyrinogen III [mitochondrial intermembrane space]
Coproporphyrinogen III in the mitochondrial intermembrane

space
CPO homodimer [mitochondrial intermembrane space]
Coproporphyrinogen oxidase in the mitochondrial

intermembrane space
FECH homodimer (2Fe-2S cluster) [mitochondrial matrix]
Ferrochelatase in the mitochondrial matrix
heme [mitochondrial matrix]
Heme in the mitochondrial matrix
PPO homodimer (FAD cofactor) [mitochondrial intermembrane

space]
Protoporphyrinogen oxidase in the mitochondrial intermembrane

space
UROD homodimer [cytosol]
Uroporphyrinogen III-Decarboxylase in the cytosol
Uroporphyrinogen-III synthase [cytosol]
Uroporphyrinogen-III synthase in the cytosol
Figure 5. A diagram summarizing the heme production. Illustration of (A) heme production and (B) PPIX
time course in case of mutation of ferrochelatase. FECH inhibition is indicated by a blunted line. The
simulation analysis of the model indicates that FECH inhibition caused a (B) decrease of heme.
650
Figure 6. Illustration of 5-aminolevulinate production time course of ALAS. ALAS inhibition is indicated
by a blunted line . The simulation analysis of the model indicates that ALAS inhibition because of ALAS
mutation caused a decrease of 5-aminolevulinate.
651
CONCLUSION
The modeling and simulation platform PyBioS has been used for the in silico analysis of porphyrin
metabolism pathway.
This model of a porphyrin metabolim pathway should be used for hypotheses generation by forward
modeling. Also the model should be disturbed to test tinfluences of gene knock-outs, mutations and
performance of this model system. Knock-out experiments can be performed in order to determine the
functional roles of genes coding enzymes of the heme biosynthetic pathway like ferrochelatase by studying the defects caused by the resulting mutation. A next step should be integration of experimental data
into the kinetic model of this pathway. The results of the in silico experiments have to be compared with
the experimental data to decide, which kind of perturbation caused the phenotype of the investigated
system. Thus, we should be able to test mutations of enzymes playing an important role in the heme
biosynthetic pathway.
References
Ajioka, R. S., Phillips, J. D., & Kushner, J. P. (2006). Biosynthesis of heme in mammals, biochimica et
biophysica acta (BBA). Molecular Cell Research, 1763(7), 723-736.
Akagi, R., Shimizu, R., Furuyama, K., Doss, M. O., & Sassa, S. (2000, March). Novel molecular defects of the delta-aminolevulinate dehydratase gene in a patient with inherited acute hepatic porphyria.
Hepatology, 31(3), 704-8.
Astner, I., Schulze, J. O., van den Heuvel, J., Jahn, D., Schubert, W.-D., & Heinz, D.W. (2005). Crystal
structure of 5-aminolevulinate synthase, the first enzyme of heme biosynthesis, and its link to XLSA
in humans. EMBO J, 24, 3166-3177
Bhasin, G., Kausar, H., & Athar, M. (1999, November, December). Ferrochelatase, a novel target for
photodynamic therapy of cancer. Oncol Rep, 6(6),1439-42.
Chen, F.-P., Risheg, H., Liu Y., & Bloomer, J. (2002, February). Ferrochelatase gene mutations in erythropoietic protoporphyria: Focus on liver disease. Cell Mol Biol (Noisy-le-grand), 48(1), 83-9
Daskalaki, A. (2002). The use of photodynamic therapy in dentistry. Clinical and experimental studies.
Diss. Berlin: FU.
Downey, D. C. (2002). The porphyrin pathway: The final common pathway? Medical Hypotheses,
59(6), 615-621.
Ferreira, G. C., Franco, R., & Jos, J. (1999). Ferrochelatase: A new iron-sulfur center containing enzyme. 3.3 Steady - State kinetic properties of ferrochelatase. Iron metabolism. Wiley-VCH.
Foroulis, C N., & Thorpe, J A. C. (2006). Photodynamic therapy (PDT) in Barretts esophagus with
dysplasia or early cancer. Eur J Cardiothorac Surg, 29, 30-34
Gaullier, J.M., Geze, M, Santus, R, Sa, M.T., Maziere, J.C., Bazin, M., Morliere, P., & Dubertret. L.
(1995). Subcellular localization of and photosensitization by protoporphyrin IX in human keratinocytes
and fibroblasts cultivated with 5-aminolevulinic acid. Photochem Photobiol, 62, 114-122
652
Ismail, M.S., Dressler, C., Strobele, S., Daskalaki, A., Philipp, C., Berlien, H-P, Weitzel, H., Liebsch,
M., & Spielmann, H. (1997). Modulation of 5-ALA-induced PplX xenofluorescence intensities of a
murine tumour and non-tumour tissue cultivated on the chorio-allantoic membrane. Lasers in Medical
Science, 12, 218-225.
Kamburov, A., Wierling, C., Lehrach, H., & Herwig, R. (2006, December 1-2). ConsensusPathDB - Database for matching pathway annotation. Systems Biology, Proceedings of Computational Proteomics
Joint Annual RECOMB 2005 Satellite Workshops on Systems Biology and on Regulatory Genomics,
San Diego, CA, USA.
Kemmner, W., Wan K., Rttinger S., Ebert B., Macdonald R., Klamm U., & Moesta K.T. (2008, February). Silencing of human ferrochelatase causes abundant protoporphyrin-IX accumulation in colon
cancer. FASEB J., (2), 500-9. Epub 2007, Sep 17.
Kennedy, J. C., & Pottier, R. H. (1992).Endogenous protoporphyrin IX, a clinical useful photosensitizer
for photodynamic therapy. J Photoresponse Chem Photobiol, 14, 275-292
Kirby, I., Bland, J., Daly, M., Constantine, & Karakousis, P. (2001). Surgical oncology: Contemporary
principles and practice. New York: McGraw-Hill.
Concepts, implementation and application. WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Klipp, E., Nordlander, B., Kruger, R., Gennemark, P., Hohmann, S. (2005). Integrative model of the
response of yeast to osmotic schock. Nat Biotechnol, 23, 975-982.
Mitchell, L. W., Volin, M., Martins, J., & Jaffe, E. K. (2001) Mechanistic implications of mutations to
the active site lysine of porphobilinogen synthase. J Biol Chem, 12, 276(2), 1538-44.
Munakata, H., Sun, J-Y., Yoshida, K., Nakatan, T., Honda, E., Hayakawa, S., Furuyama, K., & Hayashi,
N. (2004). Role of the heme regulatory motif in the heme-mediated inhibition of mitochondrial import
of 5-aminolevulinate synthase. J Biochem, 136, 233-238, 136(2), 233.
Reinbothe, S., & Reinbothe, C. (1996).The regulation of enzymes involved in chlorophyll biosynthesis.
Eur J Biochem, 237, 323-343 .
Vastrik, I., DEustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath. G., Croft. D., de Bono. B., Gillespie,
M., Jassal, B., Lewis. S., Matthews, L., Wu. G., Birney. E., Stein. L. (2007). Reactome: A knowledge
base of biologic pathways and processes. Genome Biology, 8, R39.
Von Wettstein, D., Gough, S., & Kannangara, C.G. (1995). Chlorophyll biosynthesis. Plant Cell, 7,
1039-1057.
Westerhoff, H. V. et al. (2008). Systems biology towards life in silico: Mathematics of the control of
living cells. J Math Biol.
Wierling, C., Herwig, R., & Lehrach, H. (2007).Resources, standards and tools for systems biology.
Briefings in Functional Genomics and Proteomics, 10, 1093/bfgp/elm027.
Wilson, J.H.P., van Hillegersberg, R., van den Berg, J.W.O., Kort, W.J., & Terpsta, O.T. Photodynamic
therapy for gastrointestinal tumors. Scand J Gastroenterol, 26(Suppl), 188: 20-25.
653
Schneider-Yin X., Gouya, L., Dorsey, M., Rfenacht U., & Deybach J.-C. (2000). Mutations in the ironsulfur cluster ligands of the human ferrochelatase. Blood, 96, 1545-1549
Schoeberl, B., Eichler-Jonsson, G., Gilles, E.D., Muller, G. (2002). Computational modeling of the
dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nat Biotechnol, 20, 370-375.
Schoenfeld, N., Epstein, O., Lahav, M., Mamet, R., Shaklai, M., & Atsmon A. (1988). The heme biosynthetic pathway in lymphocytes of patients with malignant lymphoproliferative disorders. Cancer
Lett, 43, 43-48.
KEY TERMS
ALA: The first committed precursor of the porphyrin pathway
Ferrochelatase: Also known as FECH is an enzyme involved in porphyrin metabolism, converting
protoporphyrin IX into heme.
Forward Modeling: Modeling approach to study the interaction of processes to produce a response.
Heme: Iron-containing porphyrin (heme-containing protein or hemoprotein) that is extensively
found in nature ie. hemoglobin.
Knockout-Experiment : A experiment, where an organism is engineered to lack the expression
and activity of one or more genes.
Mutation: Changes to the nucleotide sequence of the genetic material of an organism. Mutations
can be caused by copying errors in the genetic material during cell division, by exposure to ultraviolet
or ionizing radiation, chemical mutagens, or viruses.
Porphyrin: Porphyrins are heterocyclic macrocycles, consisting of four pyrrole subunits (tetrapyrrole)
linked by four methine (=CH-) bridges. The extensive conjugated porphyrin macrocycle is chromatic
and the name itself, porphyrin, is derived from the Greek word for purple.
PyBioS: PyBioS is a system for the modeling and simulation of cellular processes. It is developed
at the Max-Planck-Institute for Molecular Genetics in the department of Prof. Lehrach.
654
Section X
656
Chapter XXXVII
Interference Microscopy for

Cellular Studies
Alexey R. Brazhe
Technical University of Denmark, Denmark
and Moscow State University, Russia
Nadezda A. Brazhe
and Moscow State University, Russia
Georgy V. Maksimov
Moscow State University, Russia
Erik Mosekilde
Olga V. Sosnovtseva
Alexey N. Pavlov
Saratov State University, Russia
abstract
This chapter describes the application of interference microscopy and double-wavelet analysis to noninvasive study of cell structure and function. We present different techniques of phase and interference
microscopy and discuss how variations in the intrinsic optical properties of a cell can be related to the
intracellular processes. Particular emphasis is given to the newly developed phase modulation laser
interference microscope. We show how this setup, combined with wavelet analysis of the obtained data
series, can be applied to live cell imaging to investigate the rhythmic intracellular processes and their
mutual interactions. We hope that the discussion will contribute to the understanding and learning of
new methods for non-invasive investigation of intracellular processes.
INTRODUCTION
There is a significant and rapidly growing interest in the development of new experimental techniques
that will allow us to perform non-invasive studies on live cells with a spatial and temporal resolution
Interference Microscopy for Cellular Studies
that is sufficient to reveal the motion of intracellular structures and to simultaneously follow cellular
processes that take place in different compartments and on different time scales.
Most cells are practically transparent to light and limited information is directly available from
conventional amplitude microscopy. To examine processes within the cell various forms of staining
have to be used. However, such dye-based approaches only allow the investigation of a few processes
at a time and, moreover, the staining affects normal cellular processes.
Local refractive index, an intrinsic optical property of biological objects, provides additional valuable information. Although a cell often doesnt absorb light efficiently, various cellular structures may
have different values of the refractive index and therefore retard light beams that propagate through
the object differently. The idea of exploiting the associated phase shifts underlies a number of different
microscopy setups. A main advantage of the phase imaging technique is that no staining is required to
visualize the transparent structures in the cell. Moreover, phase imaging allows the spatial resolution
to exceed the Rayleigh barrier which is impossible in amplitude microscopy. Besides techniques for
imaging it is also essential to develop non-invasive techniques that can be used to examine the dynamics
of the cellular processes and their mutual interactions.
The possibility of using the intrinsic optical properties as a non-invasive probe of neuronal properties was first considered by Hill and Keynes (Hill and Keynes, 1968). They observed changes in the
light scattering intensity of a nerve fibre during electrical activity. Cohen (Cohen, 1969) found that
the intrinsic optical properties depend on the ion currents through the plasma membrane, and Stepnoski (Stepnoski et al., 1991) observed that the intensity of light scattered by neurons depends on their
membrane potential. By now it is clear that the intrinsic optical properties of a cell also depend on
the organization of the cytoskeleton and on the location of the various organelles (Haller, 2001). Cells
exhibit dynamic processes over a broad range of different time scales and across a variety of cellular
compartments. Moreover, these processes interact with one another to produce mutual modulation. It
appears worthwhile to examine the possibility of time-resolved phase measurement for non-invasive
studies of cellular processes. Such studies must, of course, be accompanied by a development of new
mathematical methods capable of unravelling the complexity of the interacting processes.
The following section presents a general background for our discussion, describing a number of
different approaches to phase imaging and providing an introduction to the use of wavelet analysis of
rhythmic phenomena in non-stationary time series. This is followed by a section on cell visualization
where we examine several cell types in order to clarify the kinds of information that one can obtain in
such studies. Finally, we present the results of a time-resolved interference microscopy study of several
cell types.
GENERAL BACKGROUND
Phase Microscopies
Phase Contrast
Historically, the first technique applied to convert phase shifts in the light passing through a transparent specimen into amplitude or contrast changes was phase contrast microscopy. This technique was
invented by Frits Zernike in 1930s, and in 1953 he received the Nobel prize in physics for this invention.
657
The core idea of the phase contrast design is to place a condenser annulus and a phase plate in conjugated aperture planes (Bennett, 1951). When the light wave hits an object with refractive index different
from the medium, part of the light diffracts and/or refracts to produce a new light wave. The original
(surrounding) wave is reduced in intensity and retarded by the phase plate while the diffracted wave
is left mostly unchanged. Recombination of the two waves on the image plane results in their mutual
interference thus enhancing the contrast in transparent objects (Ross, 1967). Phase contrast microscopy
is mainly used in microbiology and for studies of single cell algae and protista. The technique is qualitative and introduces several artefacts. At places where the spatial refractive index gradient is high, bright
halos emerge, thus making it hard to determine the exact boundaries of the object. Such halos are often
observed near the cell membrane and around some of the organelles. Moreover, an object of uniform
refractive index doesnt necessarily appear uniform in the phase contrast image, but a shade off arises
near the centre because it is impossible to fully separate the surrounding and diffracted waves. Phase
contrast techniques also imply a very small aperture which reduces their sensitivity and accuracy.
Differential Interference Contrast

This type of microscopy is used to visualize local gradients of refractive index in a specimen (Murphy,
2001). The spatial resolution is better than for the phase contrast microscopy, the aperture of the lens
system is higher and the resulting image lacks the artefacts specific to phase contrast microscopy. The
light from the source is split by a Nomarski-modified Wollanstone prism into two beams polarized at
90o to each other and slightly sheared. After focusing, these two displaced beams go through adjacent
points of the specimen, the distance being typically about 0.2 m. The beams thus experience slightly
different optical path lengths. In reality many pairs of beams go through the specimen and the light thus
carries two slightly displaced bright-field images made by beams with different polarization. The beams
then pass through the objective lens system and are focused on a second Nomarski prism in which they
are combined into a single beam. This leads to interference which elucidates phase differences in the
beam pairs associated with the variation of the refractive index along the sample. Thus one obtains a
first-derivative of the phase image of an object. This kind of setup, known as a shearing interferometer,
thus visualizes refractive non-uniformities of the sample.
The resulting images look like gray-scale objects under oblique illumination with strong light and
dark shadows on corresponding faces. The direction of apparent illumination is determined by the
orientation of the Wollanstone prisms. Features parallel to this direction are not visible. To obtain full
information on the object one has to image the sample at different shear directions. The main limitation
of this technique is that a cell has to be relatively thin and transparent and the differences between the
optical paths in adjacent points have to be less than half of the wavelength. This technique is not quantitative in its traditional setups as the local measured light intensities depend not only on interference
but also on absorbance by the specimen.
Interferometers
Quantitative information on the refractive index values in the sample can be obtained with microscopes
based on double-beam interferometers. In these setups one of the beams passes through the sample while
the other one the reference beam doesnt. If the speed of light propagation in the sample differs
from the speed in the medium, then the optical path difference between the beams is:
658
= (ns nm ) z
(1)
where ns and nm are the refractive indices of the sample and of the medium respectively and z is the
thickness of the sample. The two beams form an interference image on the detector. Mach-Zehnder
and Linnik interferometers are usually used for interference microscopy. The latter is simpler and is
similar to a Michelson interferometer (Figure 1a), only with microscope objectives added. The MachZehnder interferometer has a more complex design with two beam splitters (Figure 1b) (Zetie, 2000).
Contrary to the Linnik interferometer, which is a reflected-light type setup where the sample is placed
on the mirror and light passes through the sample twice, the Mach-Zehnder interferometer is usually a
transmitted-light type setup and light passes through the object only once. This allows the investigator
to study specimens of greater optical and physical thickness. The Mach-Zehnder scheme also offers
an improved sensitivity, and another advantage is that the two branches of the interferometer can be
adjusted independently.
Broad-band (e.g. white) light has low temporal coherence and a precise matching of the two optical
paths is needed to obtain an interference image. Shuch twin-matched optical systems are hard to set
up and use. If a laser light source with long temporal coherence is used, the two optical paths can be of
unequal length and still produce interference patterns.
Figure 1. Schemes of Michelson (a), Mach-Zehnder (b), and modified Mach-Zehnder (c) interferometers.
L is the light source, D is detector, PM is phase modulation mirror. BS are beamsplitters, M are mirrors
O are objectives and WP are wave plates.
659
Two approaches are usually applied to measure optical path differences (phase height) in any point
of an object (Davies, 1954) In first approach the optical path difference between the object beam and
the reference beam is varied linearly across the field producing bands (fringes) of interference across
the field. This is done by an optical wedge in the reference beam. Introducing an object to the system
results in the displacement of the fringes which is related to the optical path difference brought in by
the object. In order to correctly reconstruct the optical path lengths through the object one needs to
use several images made under different band positions. In the second approach the background field
is made uniformly illuminated. Introducing an object then results in changes in the light intensity according to the formula:
I = A02 + A12 + 2A0 A1cos (
(2)
where A0, 0 and A1, 1 are amplitudes and phases of the reference and object beams, respectively. Four
raw images taken at different 0 values are needed to reconstruct 1 values. So, in both approaches,
full interference images are recorded and optical path lengths through different points of the object are
then reconstructed. There are two possible problems in these approaches. First, the distribution of light
intensities in the image can be altered as a result of light refraction. Second, the measured intensity I can
be very small in some regions of the object thus lowering the accuracy of results, even if the intensity
of the object beam A1 is high. Modern applications may involve complicated mathematical procedures
to preform the reconstruction and avoid some of the problems.
Phase-Modulation Laser Interference Microscopy

The results discussed in the following sections were obtained with an interference microscope where the
optical path lengths in the sample are obtained in another way. The phase modulation microscope MIM2.1, based on the modified Mach-Zehnder scheme (Figure 1c) and developed by the Russian company
Amphoralabs Ltd. (www.amphoralabs.ru) (Andreev, 2003; Andreev, 2005) was used. In this setup the
problems that we discussed above are eliminated by a clever approach to the of phase measurement. The
main idea is that the phase is determined independently for each pixel of the photo-sensitive camera.
The length of the reference beam is harmonically modulated at 500 Hz. For each pixel, the intensity
I is measured as a function of time and the position of the mirror corresponding to the maximal rate
of change in I is determined. The phase shift of the reference beam in this position of the modulation
mirror is offset by /2 relative to the phase of the object beam. Thus, optical path lengths are obtained
independently for each point of the object. We will further use the term phase height for the optical
path difference:
obj
0
2

(3)
here 0 is the initial phase, obj the phase in the presence of the object, the wavelength of laser light,
and 0 a constant phase shift determined by the choice of the reference point. Phase heights in all points
of an object define a phase image. For heterogeneous objects, as living cells are, the phase height in
each pixel i is given by:
660
i = (ni (z ) nm )dz 0
(4)
with ni(z) being the refractive index of the cell in a point i at a distance z from the mirror and nm is a
constant refractive index of the physiological saline surrounding. Z is the upper limit of integration
that is chosen as a point just above the cell. The lateral resolution of the method depends on the laser
wavelength and the numerical aperture (NA) according to the Abbe formula (Brandon, 1999):
D 0.61 / NA
for the results to be presented the NA value is 0.15 and = 532nm. Thus D=2.16 m. The power of laser
light per cell is 2 mW.
Both the physical size of the cell and its refractive index contribute to the measured phase height.
In static imaging one can remove of this ambiguity by comparing the results for several physiological
salines with known refractive indices or for lasers of different wavelengths. When measuring variations
of the phase height in a point of a cell, the main input is mostly associated with changes in the refractive index rather than with changes in the cell shape. To show this, we can consider a general form of
differential for phase height:
d = zdno + (n nm )dz
For a typical cell we have n nm 0.02, n 1.4, and z 10 m. If we consider 1% changes in z and
n, then zdn = 0.14 m, and (n nm)dz = 0.002 m.
In our experiments with cell imaging the working field was 27 27 m, and images of 256 256
pixels were obtained. One of the pixels was chosen as a reference pixel where the optical phase difference was assumed to be zero. In other pixels was calculated relative to this value. The time required to
measure the phase in each pixel is 2 ms.The setup allows us to perform measurements of phase heights
in arbitrarily selected individual pixels and along lines or rectangular sets of pixels. The sampling rate
for each acquisition channel will then be 500Hz divided by the total number of pixels. One can therefore
adjust the sampling rate for the obtained data series in the dynamical studies.
Wavelet Analysis
Cellular activity involves processes at many different time-scale and occurring in various compartments.
Many cell processes are interrelated and depend on each other. It is already established, that cellular
electric activity, modification of plasma membrane structures, cytoplasm compartmentalization, as well
as position and shape of organelles alter the local refractive index (Cohen, 1968; Stepnoski, 1991; Haller,
2001; Rappaz, 2005). Evidently, regular cooperative processes in the cell result in regular changes of the
intrinsic optic properties, i.e. in the refractive index. Therefore, analysis of the variations of refractive
index provides information about the frequencies of the regular cellular processes. An important step
in the analysis of biological data is the choice of an appropriate mathematical tool. We demonstrate how
the combination of interference microscopy with advanced time-series analysis can be applied to study
multimode dynamical phenomena in cells.
Spectral analysis of biological time series is often based on the application of a wavelet-transformation (Grossmann, 1984; Daubechies, 1992). The advantages of this approach in comparison with the
661
classical Fourier-transform have been widely discussed (Kaiser, 1994). The wavelet-transform of a
signal x(t) is obtained as follows:
Tx (a,t )=
1
a
u t
du
a
x (u )
(5)
Here is a mother function that is soliton-like with zero average. Tx(a,t) are the wavelet coefficients and a is a time scale parameter. The bar over the wavelet function denotes complex conjugate.
The details of this transform (e.g., the choice of ) depend on the problem to be solved. In the analysis
of rhythmic components, the Morlet function is typically considered. A simplified expression for the
Morlet function has the form:
()
1/ 4
exp (
)exp
(6)
The parameter f 0 allows us to search for a compromise between the localizations of the wavelet in
the time and frequency domains. In our work f 0 =1 or 5, depending on the frequency band. The relation
between the scale a and the central frequency for the mother function f in this situation is f=1/a.
Besides the coefficients Tx(a,t), the energy density of the signal x(t) in the time scale plane can be
estimated: Ex(a,t) ~ |Tx(a,t)|2. Following the definition used by Kaiser (Kaiser, 1994), the coefficient of
proportionality between Ex(a,t) and |Tx(a,t)|2 depends on both the scale and the shape of the mother wavelet although in some works the simple expression Ex(a,t) = |Tx(a,t)2| is used. Note that the moduli of the
original wavelet coefficients Tx(a,t) estimated from Eq. (5) do not correspond to actual amplitudes of the
rhythmic components. To study amplitude variations, it is possible to slightly change the definition of the
wavelet transform or to make corrections for the energy density Ex(a,t). In the present work we consider
2
Ex (a,t )= Ca 1 Tx (a,t ) , where C is a parameter that depends on the wavelet mother function.
CELL VISUALIZATION
To demonstrate the use of phase-modulation laser interference microscopy for cell visualization we chose
three different types of cells: erythrocytes, mast cells and neurons. Erythrocytes can be regarded as
tough cells due to the strong submembrane cytoskeleton. They have a simple intracellular structure,
and their refractive index (RI) is determined mainly by the distribution of haemoglobin and by the
structure of the cytoskeleton net. On the contrary, mast cells and, especially, neurons are soft cells
with complicated cytoplasmic compartmentalization and therefore inhomogeneous distribution of the
refractive index. Such cells can hardly be visualized by means of atomic force microscopy, a technique
that has become quite popular during recent years.
Preparation
The studied samples were human erythrocytes, rat mast cells and isolated neurons of the pond snail
Lymnaea stagnalis and the medical leech Hirudo medicinalis. Erythrocytes were taken from the blood
of healthy donors and from patients with heart failure, functional NIHA II classification at the stage of
662
decompensation. Our experiments were performed in accordance with the standards of the Ethics Committee of the A.L.Myasnikov Institute of Clinical Cardiology. Preparation of mast cells and neurons
were done as described by Brazhe et al. (Brazhe, 2005; Graevskaya, 2001). During the experiments the
cells were placed in a containment chamber with a mirror bottom layer and filled with an appropriate
physiological solution (Brazhe, 2006). In order to avoid photodamage cells were tested for the effect
of laser light. Neurons and mast cells absorb weakly in the region of the used laser light (532nm), so
their photodamage is minimal. We also note that the used laser light did not cause lysis of any of the
studied cells.
Phase Images
Figure2 shows the typical photograph in the transmitted light (a) and the phase height image (b) of an
erythrocyte from the blood of a healthy donor. It has the shape of a normal erythrocyte: a discocyte. The
typical toroidal form is clearly seen on the phase height image. The discocyte has a smooth shape (that
is also seen from the photograph) and a homogeneous distribution of the refractive index, indicating
a uniform hemoglobin distribution. On the contrary, the phase height image of the erythrocyte of the
patient with heart failure (Figure2d) has a rough toroidal form with protuberances. It is important, that
such structures can not be seen from the photograph (Figure2c) that appears similar to the erythrocyte
of a healthy person (Figure2a). The observed protuberance toroidal shape results from complex changes
of the cytoskeleton structure and from an inhomogeneous distribution of hemoglobin in the cytoplasm
and in the submembrane region. These changes can partly be caused by pathological processes in erythrocytes in connection with the heart failure. There is evidence that severe hypoxia and blood system
diseases affect the plasma membrane fluidity and hemoglobin properties of erythrocytes (Rodnenkov,
2005). Another explanation of the protuberance toroidal form is that it reflects the initial phase of the
discocyte transformation into echynocyte. This example demonstrates that phase-modulation laser
Figure 2. Photographs in the transmitted light (a, c) and phase images of human erythrocytes (b, d) from
the blood of a healthy donor (a, b) and patient with heart failure (c, d). Erythrocytes have the discocytic
form. Cell visualization was performed by means of phase-modulation laser interference microscopy.
Figure 1a: bar is 10 m; fig. 1 b: x, y bar are 10 m, z bar is 100 nm.
663
interference microscopy (PM-LIM) can be used as an additional technique to monitor erythrocytes as

it is sensitive to changes in the cell shape and intrinsic optical properties and it can reveal differences
between healthy and non-healthy cells that can not be seen on photographs, obtained in transmitted
light (Figure2a-d) (Brazhe, 2006).
Our next example shows an application of the interference microscopy to the study of neurons in
different functional states. Figure3a presents the photograph of a pond snail neuron in transmitted light
while Figure3b is a phase image of a part of the same neuron under normal conditions (i.e. in normal
physiological solution). The neuron has a smooth shape after isolation and only a small axon hillock is
seen on the cell top. The granular structure of the cytoplasm can be observed on the photograph, however, it seems to be similar in density in different cell parts (Figure3a). On the contrary, the phase image
(Figure3b) reveals the complex inhomogeneous structure of the neuron cytoplasm. One can clearly see
the difference between various neuron parts that originates from the difference in the local refractive
index values due to the special landscape of organelles, cytoskeleton and submembrane structures. K+depolarization produced by bath application of KCl in high concentration increases the phase height of
the neuron (there are more red and white areas on the phase image, Figure 3c). An explanation of the
observed changes can be the following: a solution with high K+ concentration produces depolarization
of the plasma membrane, short-term activation of Na+ and Ca2+-channels and prolonged activation of
the Na+/Ca2+-exchanger (Hille, 1992). Influx of Ca2+ into the cytoplasm triggers Ca2+-sensitive signaling
Figure 3. Photograph in the transmitted light of a neuron under normal conditions (a), its phase images
under normal conditions (b) and in the physiological solution with high K+ concentration (c). Frame on
Figure 3a shows region for registration of the phase image. (d) and (e) are photograph in the transmitted
light of a mast cell and its phase image, respectively. X bars on all figures are 10 m. Colour and grey
bars represent scale for phase images in nm.
664
pathways and affects cytoskeleton structure (Metuzals, 1981; Mironov, 2005). This results in complex
reorganizations of the cytoplasm and in changes of the local refractive index (Haller, 2001).
Our third example concerns mast cells, one of the most useful objects to study the effect of newly
synthesized anti-allergic drugs. Mast cells possess an extremely intensive exocytosis, which can be easily stimulated by many factors (red light, pressure, neurotransmitters, Ca2+ ionofores, etc.). Therefore
the visualization of live mast cells without significant artificial modification of their state is a difficult
task. Mast cells have three major compartments: cytoplasm, large nucleus with one or several nucleoli
and vesicles of different-size (up to 1 mm in diameter) with mediators (histamine, serotonin. etc.) (Yen,
1994). These structures can only be guessed from the photograph in the transmitted light (Figure3d),
but they are much more pronounced in the phase image (Figure3e). The large nucleus (light-grey region in the image) is well distinguished from the surrounding cytoplasm (dark grey region) and a small
nucleolus is clearly seen in the nucleus region (white and black hillock in the 2 oclock position). Several
exocytotic vesicles can also be observed as smaller hillocks close to the nucleus. Thus, any dynamical
changes of intracellular compartmentalization of mast cells caused by external or internal stimuli can
be traced and visualized by interference microscopy.
STUDY OF CELLULAR DYNAMICS

Characteristic Frequencies in the Dynamics of the Refractive Index
Since different types of cells are diverged in properties and processes it is natural to expect different
dynamics of the refractive index. In the following examples we compare the dynamics of the refractive
index for excitable (neurons) and for non-excitable cells (mast cells and erythrocytes). Neurons have the
most active processes at the plasma membrane, and the local changes of the refractive index associated
with these processes should exceed the changes observed for the other cells. Mast cells represent a cell
type with an active vesicular transport and exocytosis, whereas erythrocytes are cells with a dense packing of hemoglobin and a rigid submembrane structure. Due to the intracellular compartmentalization,
the processes in cytoplasm, submembrane and membrane regions are localised. Study of the refractive
index dynamics in various parts of the cell can therefore provide information about the different spatially
separated processes. Here we will consider variations of the refractive index in the boundary region of
the neurons and mast cells. As the membrane/cytoplasm volume ratio is higher for the boundary region
of the cell than for the centre, the contribution of the membrane and submembrane processes into the
RI dynamics is also higher for the cell boundary. Hence, in the case of the boundary region (membrane
region) we explore mainly membrane and submembrane processes. In the experiments with erythrocytes
we do not distinguish different cellular regions.
Neurons
Wavelet analysis of data (time-dependence of the local refractive index in a certain part of the cell)
gives a matrix of wavelet coefficients showing the time-dependence of the frequencies and their power.
With the wavelet coefficients one can calculate instantaneous and averaged power spectra. Figures4a,
b show time-averaged power spectra of the refractive index variations in the membrane region of
neuron. The low and high frequency bands are represented separately since different values of f 0 are
665
used for their calculation (f 0=1 and f 0=5, respectively) and because the involved rhythmic components
have very different powers. Several well-defined rhythms can be observed in the low frequency range
around 0.1, 0.2-0.4, 1 and 2-3Hz (Figure 4a). The structure of the high frequency range is not so clear,
however there is a group of pronounced frequencies around 10 and 20-25 Hz (Figure4b). We associate
low frequencies with plasma membrane processes and high frequencies with cytoplasmic events. There
is an evidence for such an assumption. Firstly, low frequency rhythms are much more pronounced in
the membrane region than in the centre of neurons, while the 20-25Hz rhythms display an opposite
behaviour (Brazhe, 2006). Secondly, there are data of independent experiments on the same type of
neurons showing the existence of regular processes. Szucs and co-workers (1999) found that frequencies in the range of 0.2-0.4Hz depend on the activity of Ca2+-channels. Another group showed that
neurons of invertebrates possess intrinsic electric activity with 1, 1.5-3Hz frequencies (Schutt, 2000).
The suggestion about the origin of high frequencies (20-25 Hz) from the cytoplasm processes is in the
accordance with experimental data on vesicle movements in neurons (8-40Hz) obtained by light scattering measurements (Landowne, 1969).
Mast Cells
The structure of the power spectra in the membrane region of mast cells differs significantly from the
structure of spectra in neurons. Rhythms around 1-2Hz are broader than the same for neurons and
Figure 4. Power spectra calculated via the wavelet technique for the low (a, c, e) and high (b, d, f)
frequency bands of the refractive index variations in the membrane region of the neuron (a, b), mast
cell (c, d) and erythrocyte (e, f). Measurements are performed by means of phase-modulation laser
interference microscopy.
666
peaks at 0.1-0.4Hz, that have the highest power in neurons, are absent. Besides, mast cells possess a
new rhythm around 4-6Hz (Figure4c). Contrary to neurons, the power of the high frequency peaks
(24 and 26Hz) in the membrane region of the mast cell essentially exceeds peaks in the low frequency
range (Figure 4d).
It is natural to assume, that the highest peaks in the power spectra belong to the most active cellular
processes. As the most active processes in neurons are related to the changes of the membrane potential,
the highest peaks in their spectra should be rhythms that originate from the processes underlying electric
activity (0.1-0.4 and 1-3Hz). For mast cells the major processes are vesicle transport and exocytosis
(Dvorak, 1991). Thus, rhythms with the highest power (24 and 26Hz) are associated with transport and
exocytosis processes.
Erythrocytes
Figures4e, f present the low (e) and high (f) frequency ranges of the power spectra for RI dynamics
of erythrocytes. The spectrum structure is different from the structure observed for neurons and mast
cells, and the magnitude of the main peaks is much lower. We assume that this is due to the rigid submembrane and membrane structure that prevents strong changes in the membrane and cytoplasm and,
therefore, eliminate significant changes of the refractive index (Discher, 1995).
Modulation Properties of Rhythms

Cellular processes are interrelated and influence one another. It is obvious that interaction of processes
underlying rhythmic variations of RI will result in the modulation of rhythms by slower oscillations.
Thus, our next step is to investigate each frequency range in a view of detecting such interactions.
This was done on the refractive index variations in the membrane region of neurons. Figure5a shows
so-called skeletonogram, i.e., the time-dependence of the main frequency components in the low frequency range (<5Hz). All observed rhythmic components (0.1, 0.3, 1, and 2-4Hz) are stable during the
whole time of observation. Rhythms between 0.1Hz and 0.3Hz maintain constant values in time while
rhythms between 1Hz and 2-4Hz demonstrate variations caused by slower components. In the high
frequency range there is a large number of coexisting rhythmic components with quite non-stationary
behaviour and different modulation properties (Sosnovtseva, 2005). As it was mentioned above, modulation of fast oscillations by slower signals can be considered as a form of nonlinear interaction between
specified modes. This phenomenon can be used for identification of the observed frequencies with
particular cellular processes and for description of processes interactions. In order to study interaction
properties of different rhythms the approach of double-wavelet analysis can be applied (Sosnovtseva,
2005). The time dependence of the instantaneous frequency ffast(t) is considered as an input signal for
the next wavelet-transform (Eq.5). Again, the wavelet coefficients and the energy density are estimated
and the simplified visualization of the energy density is considered. The energy density will contain
information about all modes involved in the modulation process. We can examine how the features of
the frequency modulation change in time. By analogy, instead of the instantaneous frequency of the fast
dynamics we can take the instantaneous amplitude of this oscillatory mode and, in this way, it is possible to study the properties of amplitude modulation of the fast rhythm as well. This approach allows
one to characterize the non-stationary dynamics of a modulated signal, i.e. to detect all components
667
Figure 5. (a) Typical time-dependence of the main frequency components in the dynamics of the refractive
index in the membrane region of neuron. Calculations were performed by means of a wavelet technique.
(b)Typical spectrum of amplitude modulation of rhythms 1 Hz (black line) and 3 Hz (grey line) and (c) the
scatter plot of the main maxima of the amplitude modulation of 1 and 3 Hz rhythms. (d)Typical spectrum
of amplitude modulation of rhythms 11 Hz (black line) and 17 Hz (grey line) and (e) the scatter plot of
the main maxima of the amplitude modulation of 11 and 17 Hz rhythms. Calculations were performed
by means of a double-wavelet technique.
that are involved in the modulation, estimate their contributions, and analyze whether the modulation
properties change during the observation time.
Figure5b illustrates typical spectra of the amplitude modulation for 1 and 3Hz rhythms. Modulation
spectra of rhythms clearly distinguish from each other. Statistical analysis also shows that the rhythmic
components at 1Hz (black circles) and 2-4Hz (white circles) are clearly separated with respect to modulation frequency in the case of amplitude modulation (Figure5c). In the case of frequency modulation
we also observed clear separation of rhythms 1 and 3Hz (Sosnovtseva, 2005). Such a well-defined
separation indicates different biological regulatory mechanisms of the 1 and 2-4Hz rhythmic activities.
We suggest that change of the neuron membrane potential under the rest condition has two components
(1 and 2-4Hz rhythms) that are modulated in different ways.
Double-wavelet analysis of high frequency rhythms reveals more complex structure of the modulation spectra, however, there were no difference between amplitude or frequency modulation properties
of rhythms around 20Hz (Figure5d, e). The absence of a clear distinction between the modulation
rhythms allows us to assume that they originate from a common biological regulation.
Analysis of rhythm modulations can help us to correlate the observed frequencies with certain cellular processes. Besides, change of modulation can indicate change of the cellular processes. We showed
that the amplitude modulation of the 1Hz rhythm depends on the membrane potential and changes
in opposite way under membrane depolarization and hyperpolarization (Brazhe, 2006). Based on this
finding we suggest that the 1Hz rhythm can be used for studies of changes of the membrane potential.
This can be important for investigation of the electric properties of small neurons, glial cells and thin
nerve fibres that can not be studied by traditional microelectrode technique.
668
CONCLUSION
This chapter briefly described different types of phase microscopy that allows visualizing transparent
cells without using of contrast and fluorescent dyes. Here we also showed that phase-modulation laser
interference microscopy has several advantages over other techniques in this field. Firstly, in PM-LIM
technique values of the optical path length (and, therefore, the information about refractive index) are
obtained independently for each point of the object. Secondly, it is possible to register long time-series
of data with the chosen sampling rate and analyze regular variations of the refractive index in a wide
range of frequencies.
The chapter illustrated how the phase-modulation laser interference microscopy combined with
advanced wavelet-analysis can be used for non-invasive cell visualization and studies of the cellular
dynamics. We showed that interference microscopy can be applied for investigations of different cell
types, not only excitable cells, but also non-excitable, such as erythrocytes and mast cells. Phase images
of cells provided new data about their compartmentalization, membrane and submembrane structures at
normal conditions and under influence of stimuli (chemicals, drugs, etc.). As an example, phase images
of erythrocytes reveal changes of cytoskeleton structure and hemoglobin distribution that can hardly
be observed by conventional microscopy. Interference microscopy can also be applied for studies of
ion and neurotransmitter effects on nerve cells, as their phase images change under modulation of the
cell functional state.
Besides all advantages of the phase-modulation laser interference microscopy in the field of cellular
visualization, its most significant application is the study of cellular dynamics. To our knowledge, other
works on the investigation of the characteristic frequencies of the refractive index dynamics and their
correlation with cellular processes have not appeared. Most other studies are focused on the dynamical
morphometry of cells (Rappaz, 2005) or on the observation of the relatively slow changes of the cell
thickness (<0.5 Hz) (Popesku, 2006). Our approach showed that registration of the time-dependence of
the refractive index in various parts of the cell allows investigation of processes taking place in the plasma
membrane, submembrane and cytoplasm regions. Our results demonstrated that the RI dynamics differs
among neurons, mast cell and erythrocytes and depends on the cellular compartment (Brazhe, 2006).
We propose that (i) low frequency dynamics in neurons and mast cells relates to processes in the plasma
membrane and submembrane regions (e.g., change of the physical-chemical properties of the plasma
membrane); (ii) well-distinguished high frequencies (e.g., 24 and 26Hz for mast cells) correspond to the
vesicular transport. By means of double-wavelet analysis we revealed the presence of nonlinear interactions in the form of frequency and amplitude modulation of the fast rhythms (1 and 3 Hz, >10 Hz) by the
slower processes (0.1, 0.2-0.4, 1 Hz). Further analysis of the refractive index dynamics and the relations
between the observed frequencies of the refractive index changes and identified cellular processes will
be useful for a better understanding of the function and interaction of different time-scale processes in
cellular compartments at rest conditions as well as under the influence of external stimuli.
We conclude that cell imaging and investigation of cellular dynamics with combination of phasemodulation interference microscopy and wavelet analysis are promising approaches to non-invasive,
non-stained cell studies, cell visualization and the unravelling of the relations between different spatial
and temporal processes at normal and pathological conditions.
669
REFERENCES
Allen, R. D., David, G. B., & Nomarski, G. (1969). The Zeiss-Nomarski differential equipment for
transmitted light microscopy.
Andreev, V. A., & Indukaev, K. V. (2003). The problem of subrayleigh resolution in interference microscopy. Journal of Russian Laser Research, 24, 220-236.
Andreev, V. A., & Indukaev, K. V. (2005). Phase modulation microscope MIM-2.1 for measurements
of surface microrelief. General principles of design and operation. Journal of Russian Laser Research,
26, 380-393.
Bennett, A. H., Jupnik, H., Osterberg, H., & Richards, O. W. (1951). Phase microscopy. Principles and
Applications. New York: John Wiley & Sons, Inc.; London: Chapman & Hall, Ltd.
Brandon, D., & Kaplan, W. D. (1999). Microstructural characterization of materials. Chichester, West
Sussex, UK: JohnWiley & Sons Ltd.
Brazhe, N. A., Erokhova, L. A., Churin, A. A., & Maksimov, G. V. (2005). Investigation of differentscale membrane processes under nitric oxide influence. Journal of Biological Physics, 31, 533-546.
Brazhe, N. A., Brazhe, A. R., Pavlov, A. N., Erokhova, L. A., Yusipovich, A. I., Maksimov, G. V.,
Mosekilde, E., & Sosnovtseva, O. V. (2006). Unraveling cell processes: Interference imaging interwoven
with data analysis. Journal of Biological Physics, 32, 191-208.
Cohen, L. B., Keynes, R. D., & Hille, B. (1968). Light scattering and birefringence changes during nerve
activity. Nature, 218, 438-441.
Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: S.I.A.M.
Davis, H. G., Wilkins, M. H. F., Chayen, J., & La Cour, L. F. (1954). The use of the interference microscope to determine dry mass in living cells and as a quantitative cytochemical method. Quaterly
Journal of Microscopical Science, 95(3), 271-304.
Discher, D. E., Winardi, R., Schischmanoff, P. O., Parra, M., Conboy, J. G., & Mohandas, N. (1995).
Mechanochemistry of protein 4.1s spectrin actin binding domain: Ternary complex interactions, membrane
binding, network integration, structural strengthening. The Journal of Cell Biology, 130, 897-907.
Dvorak, A. M. (1991). Basophil and mast cell degranulation and recovery. New York: Plenum Press.
Graevskaya, E. E., Akhalaya, M. Y., & Goncharenko, E. N. (2001). Effects of cold stress and epinephrine on degranulation of peritoneal mast cells in rats. Bulletin of Experimental Biology and Medicine,
131, 333-335.
Grossmann, A., & Morlet, J. (1984). Decomposition of hardy functions into square integrable wavelets
of constant shape. S.I.A.M. Journal of Mathematical Analysis, 15, 723-736.
Haller, M., Mironov, S. L., & Richter, D. W. (2001). Intrinsic optic signals in respiratory brain stem
regions of mice: Neurotransmitters, neuromodulators, and metabolic stress. Journal of Neurophysiology, 86, 412-421.
670
Hill, D.K., & Keynes, R.D. (1949). Opacity changes in stimulated nerve. Journal of Physiology (London), 108, 278-281.
Hille, B. (1992). Ion channels of excitable membranes. Washington: University of Washington.
Kaiser, G. (1994). A friendly guide to wavelets. Boston: Birkhauser.
Landowne, D., & Cohen, L.B. (1969). Changes in light scattering during synaptic activity in the electric
organ of the skate Raia erinacea. The Biological Bulletin, 137, 407-408.
Metuzuls, J., Montpetit, V., & Clapin, D. F. (1981). Organization of the neurofilamentous Network. Cell
Tissue Research, 214, 455-482.
Mironov, S.L., Ivannikov, M.V., & Johansson, M. (2005). Ca2+i signaling between mitochondria and
endoplasmic reticulum in neurons is regulated by microtubules: From mitochondrial permeability
transition pore to Ca2+-Induced Ca2+ release. Journal of Biological Chemistry, 280, 715-721.
Murphy, D. (2001). Differential interference contrast (DIC) microscopy and modulation contrast microscopy. In Fundamentals of Light Microscopy and Digital Imaging (pp. 153-168). New York: Wiley-Liss.
Popescu, G., Badizadegan, K., Dasari, R. R., & Feld, M. S. (2006). Observation of dynamic subdomains
in red blood cells. Journal of Biomedical Optics Letters, 11, 040503-1-3.
Rappaz, B., Marquet, P., Cuche, E., Emery, Y., Depeursinge, C., & Magistretti, P. J. (2005). Measurement
of the integral refractive index and dynamic cell morphometry of living cells with digital holographic
microscopy. Optics Express, 13, 9361-9373.
Rodnenkov, O. V., Luneva, O. G., Ulyanova, N. A, Maksimov, G. V., Rubin, A.B., Orlov, S. N., & Chazov,
E. I. (2005). Erythrocyte membrane fluidity and haemoglobin haemoporphyrin conformation: Features
revealed in patients with heart failure. International Journal Pathophysiology, 11, 209-213.
Ross, K. (1967). Phase contrast and interference microscopy for cell biologists. London, England:
Edward Arnold Publishers, Ltd.
Schutt, A., Bullock, T. H., & Basar, E. (2000). Odor input generates 1.5Hz and 3Hz spectral peaks in
the Helix pedal ganglion. Brain Research, 879, 73-87.
Sosnovtseva, O. V., Pavlov, A. N., Brazhe, N. A., Brazhe, A. R., Erokhova, L. A., Maksimov, G. V.,
& Mosekilde, E. (2005). Interference microscopy under double-wavelet analysis: a novel approach to
studying cell dynamics. Physical Review Letters, 94, 218103-1-4.
Stepnoski, R. A., LaPorta, A., Raccuia-Behling, F., Blonder, J. E., Slusher, R. E., & Kleinfeld, D. (1991).
Noninvasive detection of changes in membrane potential in cultured neurons by light scattering. Proceedings of National Academy of Science USA, 88, 9382-9386.
Szucs, A., Molnar, G., & S-Rozsa, K. (1999). Periodic and oscillatory firing patterns in identified nerve
cells of Lymnaea stagnalis L. Acta Biologica Hungerica, 50, 269-278.
Yen, A., Mathieu-Costello, O., Gigli, I., & Barrett, K. E. (1994). Inhibition of mast cell mediator secretion induced by protoporphyrin plus long-wave ultraviolet light: A morphometric and ultrastructural
analysis. Journal of Allergy Clinical Immunology, 93, 909-918.
671
Zetie, K. P., Adams, S. F., & Tocknell, R. M. (2000). How does a Mach-Zehnder interferometer work?
Physics Education, 35, 46-48.
Key TERMS
Amplitude Modulation: Variations in the amplitude of the studied rhythms introduced by external
influence, e.g. another rhythmical process.
Frequency Modulation: Variations in the instantaneous frequency of a studied rhythm introduced
by external influence.
Interference Microscopy: One of the subtypes of phase microscopy, based on interferometers. In
general, allows quantitative measurement of the refractive index.
Intrinsic Optical Properties: A sum of various optical properties originating from structural and
dynamical features of the specimen, i.e. refractive index (RI), light scattering and absorbance, autofluorescence.
Phase Microscopy: A collective name for microscopy techniques aimed at visualization of changes
in the phase of transmitted light introduced by a specimen, thus allowing to contrast optically transparent structures.
Phase-Modulation Interference Microscopy (PM-LIM): Interference microscopy in which the
optical path length of the reference beam is harmonically modulated, which makes independent measurements of the refractive index in each point of the sample feasible.
Phase Image: A result of interference microscopy. Each point of the phase image represents optical
path difference or phase shift in a corresponding point of the sample.
Wavelet Analysis: Computational technique of data series analysis based on wavelet transform.
Wavelets are time-localised oscillating functions. Wavelet transform is a convolution of the signal with
the dilated and translated copies of the wavelet function. It resolves spectral properties of the signal in
time domain.
672
673
Chapter XXXVIII
Fluorescence Imaging of
Mitochondrial Long-Term
Depolarization in Cancer Cells
Exposed to Heat-Stress
Cathrin Dressler
Laser- und Medizin-Technologie GmbH, Berlin, Germany
Olaf Minet
Urszula Zabarylo
Jrgen Beuthan
abstract
This chapter deals with the mitochondrias stress response to heat, which is the central agent of thermotherapy. Thermotherapies function by inducing lethal heat inside target tissues. Spatial and temporal
instabilities of temperature distributions in targets require optimized treatment protocols and reliable
temperature-control methods during thermotherapies. Since solid cancers present predominant targets
to thermotherapy, we analyzed hyperthermic stress-induced effects on mitochondrial transmembrane
potentials in breast cancer cells (MX1). Heat sensitivities and stress reactions might be extremely different among different tissue species and tissue dignities; therefore it is very important to investigate
tissue-specific stress responses systematically. Even though this chapter provides minimal information
only to the enlightenment of systemic cellular heat stress mechanisms, it may contribute to deepening
the basic knowledge about systemic stress responses. In addition, the data presented here might support
optimizing of treatment protocols applied during thermotherapy, particularly LITT and hyperthermia.
Fluorescence Imaging of Mitochondrial L ong-Term Depolarization in Cancer Cells
INTRODUCTION
Heat represents a primarily environmental physical stress factor inducing multiform stress reactions
in cells and tissues exposed to unphysiologically elevated temperatures. Molecular mechanisms triggered by heat stress include enhanced synthesis of so-called heat shock proteins (HSP) providing a
limited protection to cells and organisms against heat-induced damages [Lindquist, 1986; Sonna, 2002;
Takayama, 2003; Kregel, 2002]. HSP are also involved in other stress responses initiated by non-thermal
stress factors, like oxidative stress, energy and nutrient depletions, or drug toxicities [Kregel, 2002]. The
effectiveness of HSP- mediated damage protections depend on the stress factor dose, exposure time,
and the cell species encountering stress. Especially hyperthermic stress induces cell type individual
stress reactions involving complex networks of genetic and biochemical changes [Sonna, 2002]. Detailed
knowledge about molecular HSP functions and the resulting metabolic interactions does not really exist
so far, althought they are under intensive investigation worldwide [Kregel, 2002].
Understanding of HSP regulation by using systems biology tools, HSP interact with multiple key
components of signalling pathways that regulate growth and development. The molecular relationships
between heat shock proteins and various signalling proteins appear to be critical for the normal function of signal transduction pathways. The relative levels of these proteins may be important, as too little
or too much Hsp70 or Hsp90 can result in aberrant growth control, developmental malformations and
cell death. Although the functions of HSP as molecular chaperones have been well characterized, their
complementary role as `stress-induced proteins to monitor changes and alter the biochemical environment of the cell remains elusive. Genetic and molecular interactions between HSP, their co-chaperones
and components of signalling pathways suggest that crosstalk between these proteins can regulate
proliferation and development by preventing or enhancing cell growth and death, as the levels of HSP
vary in response to environmental stress or disease.
The field of clinical cancer therapy comprises two major groups of different ablative thermotherapy
methods, 1) laser-induced thermotherapy (LITT) working with locally confined temperatures between
50C and 150C to induce target tissue coagulation [Gewiese, 1994; Roggan, 2001; Nikfarjam, 2003]
and 2) hyperthermia working with temperatures below 45C [van der Zee, 2002]. Combined modality treatments such as hyperthermia-assisted radiotherapy and/or chemotherapy are under intensive
scientific investigation today [van der Zee, 2002; Debes, 2004, Hehr, 2003; Colombo, 2003]. Another
approach of intersitial thermoablation has been published by Jordan et al. (2006) and Johannsen (2007)
who suggested injection of magnetic nanoparticles into a tumor followed by application of alternating
current magnetic fields inducing elevated temperatures inside the particle-loaded tissue [Campbell,
2007; Johannsen, 2007; Jordan, 2006]. Therapy preconditions for this modern approach of nanomedicine concerning sufficient target loading with nanoparticles as well as feasible opportunities providing
homogeneous particle distributions were not discussed in this context.
Generally, if the temperature induced in a target tissue is high enough to cause lethal effects and
to induce thermal destruction thermotherapy is successful [Gewiese, 1994; Roggan, 2001; Nikfarjam,
2003; van der Zee, 2002; Debes, 2004, Hehr, 2003; Colombo, 2003]. Since heat transfer and distribution inside a target tissue usually are irregular and exhibit spatial and temporal instabilities, there is
always a risk that certain areas of the target will survive the thermotherapeutic intervention to cause
persistence or recurrence of disease [Nikfarjam, 2003; Gellermann, 2005; Mack, 2004]. Thresholds for
thermal damage in human tissues vary among tissue species as well as among normal and diseased
674
tissues [Dewhirst, 2003; Park, 2005]. Nevertheless a computer-aided dosimetry model was developed
to support the irradiation planning for clinical LITT applications [Roggan, 2001].
Especially in the temperature range beneath 50C (122F) therapeutic successes are predominantly
influenced by tissue-specific response properties of the target volume. At subcoagulating temperature
levels specific stress-response mechanisms, including the complex reaction cascades during apoptosis,
might be activated in single cells or confined tissue areas. Anyhow, the outcome will always be either
cell survival or cell death resulting from apoptosis or necrosis. The signaling pathways, molecular
genetics, biochemical reactions and cell phenomenologies of apoptotic or necrotic death processes are
entirely different from one another and should be strictly kept apart [Vaux, 2002; Proskuryakov, 2003].
Modulation of signaling pathways enables cells to switch between both death processes [Proskuryakov,
2003].
Whole-body hyperthermia or fever does not necessarily strike human organisms with critical situations unless body temperatures above 40C are reached. Anyway, fever has been proposed as curing
agent for centuries. A considerable amount of historical reports about tumor regressions under febrile
conditions have been cited and commented by Hobohm [Hobohm, 2001; 2005]. Concepts of oncological
fever therapy were attributed to the common notion that cancers are more sensitive to heat than normal
tissues, because of peculiar intrinsic conditions mainly concerning deviant metabolism, disorganized
vascular perfusion and vessel permeability when compared with surrounding tissues [Campbell, 2007;
Hobohm, 2001]. Satisfactory evidences supporting the statement that cancer tissues exhibit higher thermosensitivities than healthy tissues have not been assembled yet. Reasonings are conflicted with the
circumstances that tissues in general, due to their species, physiologies and molecular genetics, show
extremely different response patterns when exposed to heat stress.
Several reports have demonstrated mitochondria to be major targets of hyperthermic stress inside
eukaryotic cells [Funk, 1999; Huckriede, 1995; Lai, 1996; Macouillard-Poulletier, 1998, 2000]. Mitochondria are highly dynamic organelles that frequently move inside cells and exhibit morphological as
well as biochemical changes during physiological cell metabolism and stress responses [Jakobs, 2006].
Consequently these organelles capture a central directory in the cells fate when affected by physical,
biochemical or environmental stress factors. In the present study the heat responsibilities of mitochondria in MX1 breast carcinoma cells were investigated within the hyperthermic range between 40C
and 56C. Spatial mitochondrial distributions were imaged by structure labeling with mitochondriaspecific MitoTracker Green FM (MTG). Functional states of mitochondria were analyzed by membrane
potential-dependent fluorescence labeling. The mitochondrial transmembrane potential (m) sensor
JC-1 was used.
m is established by the proton pumps in the mitochondrial electron transport chain and is significantly
involved in energy providing processes and calcium metabolism [Duchen, 1998]. In intact and highly
polarized mitochondria JC-1, a lipophilic cation, accumulates as orange-red fluorescing J-aggregates
while in damaged cells with depolarized mitochondrial membranes the dye forms green fluorescent
monomers. Consequently the relation between red and green fluorescence intensities describes m and
will give evidence of the physiological or pathological state of mitochondria inside the cells under investigation [Cossarizza, 1993]. Absorption and fluorescence properties of monomer and multimer forms of
the JC-1 sensor dye are illustrated in Figure 1A. The mitochondrial labeling characteristics of MTG, on
the other hand, is not affected by m since this dye is essentially nonfluorescent in aqueous solutions
and becomes fluorescent due to accumulation in the lipid environment of mitochondria [http://probes.
675
Figure 1. Absorption and fluorescence spectra of JC-1 monomer and J-aggregate (A) [http://probes.
invitrogen.com/servlets/spectra?fileid=3168p82] and MitoTracker Green FM solved in methanol (B)
[http://probes.invitrogen.com/servlets/spectra?fileid=7514moh]. Both dyes were purchased from Molecular Probes, Invitrogen (Germany).
(A)
(B)
invitrogen.com/handbook/sections/1202.html]. Therefore MTG fluorescence is extensively independent

of morphological or physiological alterations. Spectral properties of MTG are given in Figure 1B.
Depending on the stress temperature the cells were exposed to, the mitochondria were more or less
depolarized, which led to an increase of green fluorescent JC-1 monomers accompanied by a decrease
of orange-red J-aggregates. This effect was rather weak under comparatively mild hyperthermic stress
conditions between 40C and 45C while under severe stress at 50C or 56C the response was more
pronounced and clearly documented by very low m. Mitochondria in MX1 cells consequently react
to heat stress in m depolarizing with temperature-dependent extents.
676
MATERIAL AND METHODS

Cell cultivation, Human undifferentiated breast cancer cells of the species MX1 (Deutsches Krebsforschungszentrum, Germany) were employed as tissue model. Cells were maintained in RPMI 1640 medium
supplemented with 20mM HEPES buffer, 10% (v/v) heat inactivated fetal calf serum, 1% antibiotic-antimycotic solution containing 100U/mL penicillin, 100g/mL streptomycin, and 0.25g/mL amphotericin B (PAA Laboratories GmbH, Germany) at 37C. Monolayer cultures were dissociated with 0.05%
trypsin-0.02% EDTA (PAA Laboratories GmbH, Germany). All other culture medium components and
solutions were purchased from Biochrom KG seromed, Germany. Experimental cells were grown in glass
chamber slides (Nunc GmbH & Co. KG, Germany) until sub-confluent cell densities were achieved.
Thermal stressing: Heat-stress treatments were performed in a temperature-regulated water bath
at the temperatures given with the results (40C, 42C, 45C, 50C, or 56C) for 30min, each. Control
cells were not submitted to hyperthermic stress, but were continuously kept at 37C. Immediately after
stressing, cells were submitted to subsequent experimental processing.
Mitochondrial fluorescence labeling, Mitochondrial structures were visualized by labeling with
organelle specific MitoTracker Green FM (MTG). MTG was diluted in culture medium and applied in
a final concentration of 75nmolar. The mitochondrial transmembrane potential sensor JC-1 (5.5,6.6tetrachloro-1.1,3.3-tetraethylimidazolcarbocyanine iodide) was used to label mitochondria in transmembrane potential-dependent manner. The JC-1 stock solution was prepared in anhydrous dimethyl
sulphoxide and diluted in supplemented culture medium with an end concentration of 9.4molar. In
physiologically polarized cells JC-1 accumulates at mitochondria as red fluorescent J-aggregates while
in cells with depolarized mitochondria the dye forms green fluorescent monomers [http://probes.invitrogen.com/servlets/spectra?fileid=3168p82]. Incubation periods with either sensor dye were 30min at
37C. Both dyes were purchased from Molecular Probes, Invitrogen Germany. Labeling procedures
were performed following the heat treatments according to the manufacturers recommendations
[http://probes.invitrogen.com/media/publications/159.pdf]. Spectral properties of JC-1 and MTG are
illustrated in Figure 1.
Fluorescence microscopy: Mitochondrial fluorescences were imaged with a wide-field epifluorescence
microscope Axiovert 200M (Carl Zeiss, Germany) combined with a digital color camera AxioCam
MRc (Carl Zeiss, Germany). JC-1-labeled cells were imaged with the red fluorescence channel (excitation,
54612nm band pass filter, detection, >590nm long pass filter) and the green fluorescence channel (excitation,
450-490nm band pass filter, detection, >515nm long pass filter), each. Illumination times required for image acquisitions with respect to optimal fluorescence imaging were digitally measured by the AxioCam
MRc camera for every image (variation range, 268-2354ms).
Statistical evaluation, Ratios of the fluorescence amplitudes measured in the red and the green channel were evaluated for every image frame after correcting the amplitudes with the illumination times
applied in each case. The average value in every temperature group was used for statistical assessments
of red and green fluorescence intensities.
RESULTS
After labeling MX1 mitochondria with MTG more or less evenly, cytoplasmic distributions inside intact
control cells (37C) were observed. Because of comparatively high organelle numbers per cell single
677
mitochondria could not be distinguished from each other with the optical magnification applied in Figure
2. Under mild stress conditions at 40C the mitochondrial fine structure already was diminished when
compared with the control. In the 42C group aggregation of mitochondria was obvious. This process
was enhanced in the 45C group, definitely supported by the changes of cell morphologies and sizes,
since MX1 cells were smaller and round after stress at 45C. Treatments at more severe stress conditions
(50C or 56C) resulted in pronounced diffuse pan cellular distributions, although cell morphologies
again resembled the control cells as well as the 40C and 42C stress group. This fact clearly indicated
that 30min 45C induced an active heat stress response when the cells shrunk and were rounding up
while higher stress temperatures did not permit active cell responses anymore (Figure 2).
Figure 2. Mitochondrial fluorescence in heat stressed MX1 cells after labeling with MitoTracker Green
FM (Molecular Probes, Invitrogen Germany). Panel A shows control cells incubated at 37C, panels
B-F show heat stressed cells treated with the temperatures indicated at every panel for 30min each.
A) 37C
B) 40C
C) 42C
D) 45C
F) 56C
E) 50C
678
Since MTG labeling did not provide any information about functional states of heat-stressed mitochondria, MX1 cells were then labeled with the mitochondrial membrane potential sensor JC-1 in
order to monitor heat-stress induced effects on m. The fluorescence micrographs in Figure 3 clearly
Figure 3. Green (left) and red (right) fluorescence of heat stressed MX1 cells after mitochondrial transmembrane potential dependent labeling with JC-1 (Molecular Probes, Invitrogen Germany). Panel A
shows control cells incubated at 37C, panels B-F show heat stressed cells treated with the temperatures
indicated at every panel for 30min each. Illumination times employed for acquiring the images were:
panel A) 645ms green, 268ms red; panel B) 580ms green, 295ms red; panel C) 711ms green, 633ms red;
panel D) 571ms green, 656ms red; panel E) 765ms green, 1172ms red; panel F) 677ms green, 2345ms
red.
A) 37C
B) 40C
C) 42C
D) 45C
continued on following page
679
E) 50C
F) 56C
showed membrane potential correlated labeling of mitochondria in MX1 cells. With increasing stress
temperatures at 40C, 42C, 45C, 50C, or 56C the m dependent orange-red and green fluorescences
in MX1 cells labeled with JC-1 were altered when compared with control cells. The orange-red fluorescence of J-aggregates at mitochondria with high m was imaged in both fluorescence channels while
green fluorescence of monomer JC-1, of course, was only imaged in the green channel.
After the mitochondria in the control group and the 40C, 42C, and 45C stress groups exhibited
orange-red fluorescence, the signal color in the red channel was changed in the 50C and 56C stress
groups. Hyperthermic temperatures at 50C or 56C caused JC-1 to show deep red fluorescence (Figure
3).
A direct readout of fluorescence intensities imaged in both channels for every object was not possible,
since acquisition times used for making the images were digitally adjusted with the purpose of optimal
contrast for fluorescence imaging. Therefore the ratio of mean fluorescence amplitudes measured in
the red and the green channel was corrected with the inverse ratio of the acquisition times to get the
real red/green ratio for every region of interest or every image pair, respectively. Average red/green
ratios of the control group and the different heat stress groups were plotted against the temperatures
MX1 cells were treated with (Figure 4). The red/green fluorescence ratios were shown to decrease with
increasing stress temperatures in a nonlinear manner, because orange-red fluorescence intensities discontinuously decreased while green fluorescence intensities increased and consequently the red/green
ratios declined.
The nonlinear interrelation between mitochondrial depolarization and stress temperature is presented
by the exponential function shown in Figure 4. Whereas the red-to-green fluorescence intensity ratios
decayed rather rapidly in the temperature range 40C-45C and switched over to a more decelerated
decay at higher temperatures following an exponential curve progression as illustrated in Figure 4.
In Figure 5 the fluorescence intensities in the green and red channel were separated and illustrated
as monochromatic 3D profiles.
680
Figure 4. Average ratios of fluorescence amplitudes measured in the red and the green fluorescence channel. Red/green ratios are indicating changes of the quantitative relation between orange-red fluorescent
J-aggregates in MX1 cells with high m and green fluorescent JC1 monomers in cells with low m.
Standard deviations are given as mean square deviations.
Figure 5. Green and red fluorescence distributions in heat stressed MX1 cells after mitochondrial membrane potential dependent labeling with JC-1 (Molecular Probes, Invitrogen Germany). The regions of
interest (left) were devided into red (center) and green (right) data fraction for the green fluorescence
channel (atop in every panel) and red fluorescence channel (at the bottom of every panel) each and illustrated as 3D fluorescence distributions. Panel A shows control cells incubated at 37C, panels B-F
show heat-stressed cells treated with the temperatures indicated at every panel for 30min each. Two
representative cells are shown in every panel.
A)
37C
green
red

681
B)
40C
green
red
C)
C) 4 42C
2C
geen
red
682
D)
45C
green
red
E)
E) 5
50C
0C
green
red
683
F)
F) 5
56C
6C
green
red
Generally, these results indicated that the amount of red fluorescent J-aggregates accumulating
at mitochondria with comparatively high transmembrane potentials decreased with increasing stress
temperatures opposite to the enhanced green fluorescence of J-monomers in cells with more or less
depolarized m. In the control group the red/green ratio was approximately 1.7. Under mild heat-stress
conditions at 40C or 42C the red/green ratios only slightly decreased compared with control cells. The
ratios in these groups were definitely above 1 denoting that orange-red signal intensities of J-aggregates
fluorescence was higher than green fluorescence signal intensities. With higher stress temperatures at
45C and 50C ratios drastically decreased with values below 0.7, because, accordingly, red fluorescence
intensities were weaker than green fluorescence intensities. Exposing cells to 56C heat resulted in pancellular green and red fluorescence distributions, whereas the green signal intensities were much higher
than red signal intensities, which only represented residual background fluorescence. Therefore in the
56C stress group the average red/green ratio again was drastically decreased to reach a value of approximately 0.2 showing that red fluorescence intensity was approximately 20% of the green fluorescence
intensity measured in this experimental group. The red signal intensity decrease resulted from massive
depolarization of m under exposure to 56C heat accompanied by structural organelles destruction
enhancing green fluorescence of JC-1 monomers inside severely damaged MX1 cells.
The fluorescence intensity distributions inside cells imaged in the red and the green fluorescence
channel were derived as three-dimensional (3D) illustrations. Characteristic examples of 3D fluorescence intensity profiles are shown in Figure 4. Here the fluorescence amplitudes were corrected with
the respective acquisition times in order to compare distributions of absolute signal intensities with
684
each other. Each 3D image is z scaled to the maximum of fluorescence intensity. The red fluorescence
intensity maxima (RFIM) were highest and nearly constant in the temperature range between 37C and
42C. From a stress temperature of 45C on, RFIM drastically declined while the green fluorescence
intensity maxima (GFIM) exceeded RFIM. This again indicated an amplified depolarization of mitochondria in response to hyperthermic stress.
DISCUSSION
The aim of the present study was to investigate mitochondrial responses to heat stress in the temperature
range between 40C and 56C (30min each) in breast cancer cells species MX1. In this subcoagulative
stress range obvious mitochondria-involving stress responses were expected. As stress-sensitive indicator
mitochondrial m was analyzed by fluorescence microscopy after selective staining of mitochondria
with JC-1. The m results were correlated with MTG mediated structure analysis. Several studies have
documented m to be a parameter sensitive to various kinds of environmental derogations [Keshavan,
2004; Zunino, 2006; Lieven, 2003; Cossarizza, 1993]. Since heat also represents an environmental stress
factor, we were interested in the cellular m response to elevated temperatures.
Hyperthermia has been proposed a potential therapeutic implementation in clinical cancer therapy
[van der Zee, 2002]. The cellular mechanisms involved in heat-stress responses and their impact on
various subcellular structures so far have only been described for selected tissue models. Funk et al.
studied heat-stress induced changes of mitochondrial morphologies in astrocytes and MDCK cells by
video-enhanced contrast microscopy using a perfusion cell chamber system [Funk, 1999]. In their study
the morphological alterations inside mitochondria exposed to moderate hyperthermic stress conditions
were revealed to be reversible. Recovery of mitochondrial changes subsequent to heat treatment was
also detected in microglial cells [Macouillard-Poulletier, 2000]. Metabolic investigations on microglial
cells revealed dropping of the physiological ATP content by 60% 1h after a 20min heat shock at 45C
indicating that heat stress leads to scaling down of energy resource ATP in this cell type [MacouillardPoulletier, 1998]. In heat-shocked rats failing of energy metabolism and ATP depletion were detected as
the earliest cell-damaging factors of ischemic insult [Wang, 2005]. These data emphasize the interface
actions of energy supplying processes during cellular stress responses.
Several subcellular structures were already analyzed in MX1 cells after heat stress. In a previous
study the F-actin cytoskeleton was shown to be thermally more sensitive than the plasma membranes,
since F-actin fibers exhibited morphological alterations under comparatively mild stress at 40C or
42C while plasma membrane morphologies were not affected by these conditions. Only temperatures
exceeding 42C induced detectable morphological changes in plasma membranes [Dressler, 2005; Beuthan, 2004]. These results did not give any information about the functionalities of the investigated cell
components. It had been previously shown that MX1 cell viability was not attenuated after stress at 40C
or 42C. When exposed to 45C or higher temperatures cell viabilities were increasingly diminished.
Comparing these results it should be concluded that reorganization of F-actin cytoskeletons in MX1 cells
resulted from active stress response under hyperthermic stress at 40C or 42C, which is supposed to be
compensated during cellular recovery from heat stress. It was also demonstrated earlier that MX1 cells
undergo necrosis during 30min treatment at 56C, but not at 50C or lower temperatures, because cells
exposed to 50C did not exhibit necrotic phenotypes [Beuthan, 2004]. Therefore complete depolarization
685
of mitochondria in cells stressed at 56C was a consequent result of necrosis. This process was clearly
initiated during 50C treatment. Mitochondrial permeability transition in general is an early sign of the
initiation of cellular apoptosis or necrosis provoking a collapse in the electrochemical gradient across
mitochondrial membranes [Crompton, 1999; Kim, 2003; Cossarizza, 1993].
Especially cellular responses to sublethal stress temperatures, as used for different therapeutic hyperthermia applications, are essential for the understanding of tissue-specific effects induced by heat.
Mild or severe heat stress is not unequivocally definable and depends on cell species, tissue origin, cell
cycle, developmental stages, as well as exposure times. There possibly exists a rough borderline between
lethal and sublethal heat stress [Park, 2005], which might be influenced by intrinsic and environmental
factors.
As the central factory of cellular energy metabolism mitochondria present cardinal targets of hyperthermic stress and other nonthermal stressors [Funk, 1999; Huckriede, 1995; Lai, 1996; MacouillardPoulletier, 2000]. Mitochondrial dysfunctions lead to an important decrease of cellular ATP levels and
phosphorylation efficiencies [Wang, 2005] thus triggering problems with energetic supplies.
Depolarization of mitochondria was detected in MX1 cells after heat stress, whereas a nonlinear
interrelation between mitochondrial depolarization dynamics and stress temperatures was revealed
[Dressler, 2006]. This response was reflected by an exponential decay function describing the average
ratios of fluorescence amplitudes measured for the membrane potential sensor JC-1 in the red and the
green emission channel (Figure 5). The standard deviation, which is comparatively large in this case,
should be reduced by increasing the number of investigated micrographs.
In myocytes also an exponentially proceeding depolarization of mitochondria was detected after
application of FCCP (4-(trifluoromethoxy) phenylhydrazone). The decrease of m dependent tetramethylrhodamine ethyl ester (TMRE) fluorescence was measured in this study when spontaneous
transient m depolarizations were observed [OReilly, 2003]. There depolarizations were reversible
and not permanent like in our study [Dressler, 2006]. Apparently mitochondrial depolarizations follow
exponential dynamics.
Only the red fluorescent J-aggregates measure the m dependent accumulation at mitochondria,
and the green fluorescence depends on passive binding of JC-1 monomers to any cellular membrane.
It should be considerded that the fluorescence intensity ratios of red and green emissions relate to
phenomena occurring in different cellular regions. [Bernardi, 1999]. Since orange-red fluorescence of
aggregated JC-1 strictly depend on high m, the fluorescence intensity ratios measured in this study
reflect the heat-induced depolarization of mitochondria.
The nonlinear response of MX1 mitochondria to hyperthermic stress corresponded very well with
our results obtained using various microscopic techniques for investigating heat-stress responses of different subcellular structures in MX1 cells [Dressler, 2005]. In our previous studies nonlinear responses
of plasma membranes, cytoskeletons in heat stressed MX1 cells were observed, too [Dressler, 2005;
Beuthan, 2004].
Applications of heat as therapeutic agent in general are conflicted with spatial and temporal variations of temperature distributions inside a target volume and surrounding structures [Dewhirst, 2003].
It is also possible that different subcellular components in a certain cell species exhibit various heat
sensitivities. Consequently thermal destruction-inducing therapies should apply very precise and specific intervention protocols concerning the volume and location of diseased tissue, blood supply and
perfusion dynamics, heat dose, application geometry and control techniques, as well as the patients
686
physical condition. Only a well-balanced interplay of all relevant parameters may ensure a successful
thermotherapy [Roggan, 2001].
Connecting Experimental Data to Models

The use of systems biology tools help to enrich a priori information for the laboratory work.
Cellular reaction networks are stored in several databases (ConsensusPathDB, Biocarta) that will
help users to summarize and verify cellular processes, pathways and systemic interrelations.
Understanding of HSP-associated pathways requires integration of different kinds of functional
annotations. The HSP network is illustrated as interactive graphic models of cellular pathways (Figure
6). ConsensusPathDB assists the development, expansion and refinement of computational models of
biological systems and context-specific visualization of models provided in SBML. The database model
allows integration of information on metabolic, signal transduction and gene regulatory networks.
A heat shock response dependends on a complex regulatory network involving 21 known transcription factors and 4 HSP families. It is well known that HSP and transcription factors (Hsfs) are involved
in cellular response to various forms of stress besides heat [Swindell, 2007]. However, the role of HSP
and Hsfs under cold and non-thermal stress conditions is not well understood, and it is unclear which
types of stress interact least or most strongly with HSP and Hsf response pathways. To address this issue, transcriptional response profiles should be analyzed and evaluated in the near future.
Figure 6. Stress induction of HSP regulation. Pathway information is provided by BioCarta [http://cgap.
nci.nih.gov/Pathways/BioCarta/h_hsp27Pathway]
687
CONCLUSION
Our fluorescence investigations on mitochondrial responses to hyperthermic stress in MX1 breast
cancer cells revealed a comparatively high thermosensitivity, since mitochondrial morphologies and
cellular distributions already changed when exposed to a mild stress temperature of 40C (30min) after
MTG labeling. Morphological changes were not directly accompanied by m depolarizations, because
JC-1-labeled mitochondria exhibited only minor m reductions after stress in a temperature range
between 40C and 45C when compared with control cells. More severe stress conditions at 50C or
56C induced complete mitochondrial depolarization detected by massive decrease of orange fluorescence intensities shifting to weak red fluorescence. Orange fluorescence was emitted by J-aggregates
accumulated at highly polarized mitochondria. In cells containing depolarized organelles JC-1 formed
green fluorescent monomers. Heat-induced m depolarization of mitochondria was monitored by the
absolute fluorescence intensity ratios evaluated for the maximum fluorescence amplitude measured
in the red and the green fluorescence image of every object (red:green ratio). Our results should not
be considered as groundbreaking novelties, but nevertheless may support further developments in the
field of thermotherapeutic dosimetry and irradiation planning (LITT). Anyhow, the biology of heat still
presents an open field of unresolved questions and all experimental data will contribute more or less
essential pieces to the puzzle of systemic stress- response mechanisms.
References
Bernardi, P., Scorrano, L., Colonna, R., Petronelli, V., & Di Lisa, F. (1999). Mitochondria and cell death.
Eur. J. Biochem, 264, 687-701.
Beuthan, J., Dressler, C., & Minet, O. (2004). Laser-induced fluorescence detection of quantum dots
redistributed in thermally stressed tumor cells. Laser Physics, 14(2), 213-219.
Campbell, R. B. (2007). Battling tumors with magnetic nanotherapeutics and hyperthermia, turning up
the heat. Nanomed, 2(5), 649-652.
Colombo, R., Salonia, A., Da Pozzo, L. F., Naspro, R., Freschi, M., Paroni, R., Pavone-Malasco, M.,
& Rigatti, P. (2003). Combination of intravesical chemotherapy and hyperthermia for the treatment of
superficial bladder cancers: Preliminary and clinical experience. Crit Rev Oncol/Hematol, 47(2), 127139.
Cossarizza, A., Baccarani-Contri, M., Kalashnikova, G., & Francheschi, C. (1993). A new method for the
cytofluorimetricanalysis of mitochondrial membrane potential using the J-aggregate forming lipophilic
cation 5,5,6,6-tetrachloro-1,1,3,3-tetraethylbenzimidazolcarbocyanine iodide (JC-1). Biochem Biophys
Res Commun.,197(1), 40-45.
CPDB . http://pybios.molgen.mpg.de/CPDB
Crompton, M. (1999). The mitochondrial permeability transition pore and its role in cell death. Biochem
J, 341(2), 233-249.
688
Debes, A., Willers, R., Gbel, U., & Wessalowski, R. (2004). Role of heat treatment in childhood cancers: Distinct resistance profiles of solid tumor cell lines towards combined thermotherapy. Pediatr.
Blood Cancer, 45(5), 663-669.
Dewhirst, M. W., Viglianti, B. L., Lora-Michiels, M., Hanson, M., & Hoopes, P. J. (2003). Basic principles of thermal dosimetry and thermal thresholds for tissue damage from hyperthermia. Int. J. Hyperthermia, 19(3), 267-294.
Dressler, C., Minet, O., Novkov, V., Mller, G., & Beuthan, J. (2005). Microscopical heat stress investigations under application of quantum dots. J Biomed Optics, 10, 1-9.
Dressler, C., Beuthan, J., Mller, G., Zabarylo, U., & Minet, O. (2006). Fluorescence imaging of heatstress induced mitochondrial long-term depolarization in breast cancer cells. J Fluor, 16, 689-695.
Duchen, M. R., Leyssens, A., & Crompton, M. (1998). Transient mitochondrial depolarizations reflect
focal sarcoplasmic reticular calcium release in single cardiomyocytes. J Cell Biol, 142(4), 975-988.
Funk, K. R. H. W., Nagel, F., Wanka, F., Krinke, H. E., Glfert, F., & Hofer, A. (1999). Effects of heat
shock on the functional morphology of cell organelles observed by video-enhanced microscopy. Anat
Rec, 255(4), 458-64.
Gellermann, J., Wlodarczyk, W., Hildebrandt, B., Ganter, H., Nicolau, A., Rau, B., Till,y W., Fhling,
H., Nadobny, J., Felix, R., & Wust, P. (2005). Noninvasive magnetic resonance thermography of recurrent rectal carcinoma in a 1.5 Tesla hybrid system. Cancer Res, 65(13), 5872-5880.
Gewiese, B., Beuthan, J., Fobbe, F., Stiller, D., Mller, G., Bse-Landgraf, J., Wolf, K-J., & Deimling,
M. (1994). Magnetic resonance imaging-controlled laser-induced interstitial thermotherapy. Investig
Radiol, 29(3), 345-351.
Hehr, T., Wust, P., Bamberg, M., & Budach, W. (2003). Current and potential role of thermoradiotherapy
for solid tumours. Onkologie, 26 (3), 295-302.
Hobohm, U. (2001). Fever and cancer in perspective. Cancer Immunol Immunother, 50, 391-396.
Hobohm, U. (2005). Fever therapy revisited. Br J Cancer, 92, 421-425.
Invitrogen. Probes for Mitochondria. Retrieved from http://probes.invitrogen.com/handbook/
sections/1202.html
Invitrogen. JC-1/pH 8.2. Retrieved from http://probes.invitrogen.com/servlets/spectra?fileid=3168p82
Invitrogen. JC-1. Retrieved from http://probes.invitrogen.com/media/publications/159.pdf
Invitrogen. MitoTracker Green FM/MeOH. Retrieved from http://probes.invitrogen.com/servlets/
spectra?fileid=7514moh
Huckriede, A., Heikema, A., Sjollema, K., Briones, P., & Agsteribbe, E. (1995). Morphology of the mitochondria in heat shock protein 60 deficient fibroblasts from mitochondrial myopathy patients. Effects
of stress conditions. Virchows Arch, 427(2), 159-65.
Jakobs, S. (2006). High resolution imaging of live mitochondria. Biochim Biophys Acta, 1763, 561575.
689
Johannsen, M., Gneveckow, U., Thiesen, B., Taymoorian, K., Cho, C. H., Waldfner, N., Scholz, R.,
Jordan, A., Loening, S., &. Wust, P. (2007). Thermotherapy of prostate cancer using magnetic nanoparticles: Feasibility, imaging, and three-dimensional temperature distribution. Eur Urol, 52, 1653-1662.
Jordan, A., Scholz, R., Maier-Hauff, K., van Landegehm, FK., Waldoefner, N., Teichgraeber, U.,
Pinkernelle, J., Bruhn, H., Neumann, F., Thiesen, B., von Deimling, A., & Felix, R. (2006). The effect
of thermotherapy using magnetic nanoparticles on rat malignant glioma. J. Neurooncol., 78(1), 7-14.
Keshavan, P., Schwemberger, S. J., Smith, D. L. H., Babcock, G. F., & Zucker, S. D. (2004). Unconjugated bilirubin induces apoptosis in colon cancer cells by triggering mitochondrial depolarization. Int.
J. Cancer, 112(3), 433-445.
Kim, J-S., He, L., & Lemasters, J. L. (2003). Mitochondrial permeability transition: A common pathway
to necrosis and apoptosis. Biochem. Biophys. Res. Commun., 304(3), 463-470.
Kregel, K. C. (2002). Molecular biology of thermoregulation. Invited review. Heat shock proteins:
Modifying factors in physiological stress responses and acquired thermotolerance. J. Appl. Physiol.,
92(5), 2177-2186.
Lai, Y. K., Lee, W. C., Hu, C. H., & Hammond, G. L. (1996). The mitochondria are recognition organelles of cell stress. J. Surg. Res., 62(1), 90-94.
Lieven, C. J., Vrabec, J. P., & Levin, L. A. (2003). The effects of oxidative stress on mitochondrial
transmembrane potential in retinalganglion cells. Antioxidants & Redox Signaling, 5(5), 641-646.
Lindquist, S. (1986). The heat-shock response. Ann. Rev. Biochem., 55, 1151-1191
Mack, M., Straub, R., Eichler, K., Sllner, O., Lehnert, T., & Vogl, T. (2004). Breast cancer metastases
in liver: Laser-induced interstitial thermotherapy-Local tumor control rate and survival data. Radiol,
233(2), 400-409.
Macouillard-Poulletier de Gannes, F., Merle, M., Canioni, P., & Voison, P-J. (1998). Metabolic and
cellular characterization of immortalized human microglial cells under heat stress. Neurochem. Intl.,
33(1), 61-73.
Macouillard-Poulletier de Gannes, F., Leducq, N., Diolez, P., Belloc, F.., Merle, M., Canioni, P., & Voison,
P-J. (2000). Mitochondrial impairment and recovery after heat shock treatment in a human microglial
cell line. Neurochem. Intl., 36(3), 233-241.
Nikfarjam, M. & Christophi, C. (2003). Interstitial laser thermotherapy for liver tumours. Brit. J. Surg.,
90(9), 1033-1047.
OReilly, C. M., Fogarty, K. E., Drummond, R. M., Tuft, R. A., & Walsh Jr., J. V. (2003). Quantitative
analysis of spontaneous mitochondrial depolarizations. Biophys. J., 85(5), 3350-3357.
Park, H. G., Han, S. I., Oh, S. Y., & Kang, H.S. (2005). Cellular responses to mild heat stress. Cell. Mol.
Life Sci., 62(1), 10-23.
Proskuryakov, S. Y., Konoplyannikov, A. G., Gabai, V. L. (2003). Necrosis: A specific form of programmed cell death? Experim. Cell Res., 283, 1-16.
690
Roggan, A., Ritz, J-P., Knappe, V., Germer, C-T., Isbert, C., Schdel, D., & Mller, G. (2001). Radiation
planning of thermal laser treatment. Med. Laser Appl., 16(2), 65-72.
Sonna, L. A., Fujita, J., Gaffin, S. L., & Lilly, C. M. (2002). Effects of heat and cold stress on mammalian
gene expression. J. Appl. Physiol., 92(4), 1725-1742.
Swindell, W. R., Huebner, M. & Weber, A. P. (2007). Transcriptional profiling of Arabidopsis heat shock
proteins and transcription factors reveals extensive overlap between heat and non-heat stress response
pathways. BMC Genomics, 8, 125
Takayama, S., Reed, J. C., & Homma, S. (2003). Heat-shock proteins as regulators of apoptosis. Oncogene, 22(56), 9041-9047.
van der Zee, J. (2002). Heating the patient: A promising approach? Ann. Oncol., 13(8), 1173-1184.
Vaux, D. L. (2002). Apoptosis and toxicology-what relevance? Toxicol., 181-182, 3-7
Wang, J-L., Ke, D-S., & Lin, M-T. (2005). Heat shock pretreatment may protect against heatstrokeinduced circulatory shock and cerebral ischemia by reducing oxidative stress and energy depletion.
Shock, 23(2), 161-167
Zunino, S. J. & Storms, D. H. (2006). Resveratrol-induced apoptosis is enhanced in acute lymphoblastic leukemia cells by modulation of the mitochondrial permeability transition pore. Cancer Lett. 240,
123-134
Key Terms
Depolarization: This term describes the process or act of neutralizing polarity. In biology D. is a
decrease in the absolute value of a cells membrane potential. Thus, changes in membrane voltage in
which the membrane potential becomes less positive or less negative are both depolarizations.
Heat Shock Proteins (HSP): This term summarizes a group of proteins that are present in all cells
in all life forms and are induced by various types of environmental stresses like heat, cold or oxygen
deprivation. HSP act like chaperones,controling shape and location of proteins inside cells exposed to
physiological or stress conditions.
Hyperthermia: In general this term describes a condition of elevated body temperature that might
cause heat stroke in an advanced state. In clinical applications H. is intentionally produced for thermotherapy of cancers. Local, regional, and whole-body H. need to be differentiated.
Laser-Induced Thermotherapy (LITT): LITT is a minimally invasive method for the treatment
of malign and benign tumors in different organs (i.e. liver, lung, brain, head and neck area, abdomen,
prostate). The tumor is not removed by LITT but ablated in situ, whereupon surrounding normal tissues are spared. After punture the laser radiation is directed into the target tissue via flexible optical
waveguides and appropriate application systems. Because of light absorption temperatures between
45C and 100C are achieved inside the target volume, inducing massive protein coagulations and
destruction of radiated tissue.
691
Mitochondrial Transmembrane Potential (m): m is an electrical potential difference (voltage) between interior and exterior of mitochondrial membranes. The voltage results from different
electrolyte concentrations seperated by the mitochondrial membrane. m governs ion fluxes across
mitochondrial membranes.
692
Section XI
694
Chapter XXXIX
Protein Interactions and Diseases

Athina Theodosiou*
Biomedical Research Foundation of the Academy of Athens, Greece
Charalampos Moschopoulos*
Biomedical Research Foundation of the Academy of Athens, Greece
Marc Baumann
Biomedicum, Helsinki University, Finland
Sophia Kossida
Biomedical Research Foundation of the Academy of Athens, Greece,
abstract
In previous years, scientists have begun understanding the significance of proteins and protein interactions. The direct connection of those with human diseases is now unquestionable and proteomics have
become a scientific section of great research interest. In this chapter, we present a detailed description
of the nature of protein interactions and describe the more important methodologies that are being
used for their detection. Moreover, we review the mechanisms leading to diseases and involving protein
interactions and refer to specific diseases such as Huntingtons disease and cancer. Lastly, we give an
overview of the most popular computational methods that are used for the prediction or the healing of
the diseases.
INTRODUCTION
The recent completion of many genome-sequencing projects of various organisms, from viruses to
mammals, is undoubtedly the greatest triumph of molecular biology since the discovery of the DNA
double helix. After the complete genome sequencing of many organisms, including human, the focus
of molecular biology has gradually shifted the interest from genomes to proteomes, in order to explore
and discover the function of proteins (Eisenberg et al., 2000; Pandey et al., 2000).
One of the great challenges in the protein field is to reconstruct the complete protein interaction
network within the cells, the so-called interactome. There is great difficulty in achieving this goal as
the nature of the protein interactions varies depending on many environmental conditions that affect
the cell (Nooren et al., 2003). However, due to the fact that protein interactions play a vital role in the
basic functions of an organisms cell, analysis of these networks will unravel the secrets of the pathways
in which the under question interactions are detected and ultimately provide insights in how diseases
are developed (Sam et al., 2007).
Several methods, which will be detailed within this book chapter, exist for the detection of protein
interactions. During the last years, new high-throughput methodologies are being used to detect a great
amount of protein interactions with a single experiment (Piehler 2005). Unfortunately, these methods
are error-prone, therefore the generated data need further analysis (Droit et al., 2005). Today, large
amounts of protein interactions of many organisms are stored in large on-line databases and are available for academic purposes.
These data are useful in order to better understand the connection between protein interactions and
diseases (Chen 2006). In this chapter, we present a detailed description of protein interactions and a full
overview of the approaches taking advantage of these, to better understand specific diseases.
The chapter is organized as follows: the first section reviews the nature of protein interactions and
various experimental and computational methods for detecting and predicting those. The most important
databases used for storing and integrating protein interactions and protein interactions associated with
disease are described, whereas the recent information about the human interactome is mentioned. The
second section describes mechanisms of protein interactions that have been shown to lead to disease
and the third section describes the computational methods that are used for the holistic understanding
of specific diseases.
PROTEIN INTERACTIONS
Introduction
One of the goals of system biology is to understand the behaviour of the biological systems by studying
the molecules that are involved in them. Therefore, it is of great importance to determine the interactions
taking place among the molecules. The study of protein interactions has been vital to the understanding of how proteins function within the cell, in where proteins interact with other proteins, metabolites
and nucleic acids. More specifically, protein interactions are crucial for forming structural complexes,
for extra-cellular signalling, for intra-cellular signalling, for cell communication and for several other
aspects of cellular function.
The characterization of protein interactions is really important to understand the molecular mechanism of biological pathways and disease processes. Complete knowledge of these pathways will help
us to understand how diseases, such as cancer, develop. Since almost all processes are regulated by
multiple complexes, the absence of some interactions or the complete absence of physical interactions
can be the cause of disease in humans (Ryan and Matthews, 2005).
695
The following section will focus on the general information about protein interactions. A brief description of physical protein interactions is presented and it is divided into two major groups: protein-protein
interactions and protein-DNA interactions. Another distinction between protein interactions is the predicted and experimentally measured interactions (Uetz et al., 2005) Furthermore, several experimental
and computational methods for predicting protein interactions are reviewed. Moreover, several databases
used to store and integrate protein interactions and interactions related to diseases are explored. In addition, the latest information available on the complete human interactome is discussed.
Protein - Protein Interactions

Most proteins live and function in very complex environments and have many potential binding partners. Some proteins are very selective on their binding partner, while other proteins are more openminded and can interact with different kind of proteins making the binding more competitive. This, so
called, multi-specific binding between two protein families is very common in regulatory pathways and
networks (Nooren et al., 2003). There is an important distinction between the types of protein-protein
interactions. They can be classified according to the proteins involved in the interactions, (structural
or functional) or they can be classified based on their physical properties. From the structural point of
view, protein-protein interactions can occur between identical or non identical chains (homo or heterooligomers). In addition, depending on the stability and mechanism of the formation of a protein-protein
complex, they can be subdivided into non-obligated (short living) complexes and obligated (stable)
complexes. Furthermore, they can also be divided into transient and permanent, based on the lifetime
of the complex. Last but not least, protein-protein interactions can be classified based on their functional
role. Common functional classes are the enzyme-inhibitor complexes, antibody-protein complexes and
protein-receptor complexes.
Protein - DNA Interactions

Proteinnucleic acid interactions play important role in various important cellular processes such as
transcriptional regulation, recombination, genome rearrangement, replication, repair and DNA modification. A classification of protein-DNA complexes was attempted by various authors (Harisson, 1991;
Luisi, 1995; Luscombe et al., 2000).
The process of transcription is mediated by a number of proteinprotein and DNA-protein complexes. The protein factors modulating gene transcription are the transcriptional regulators which bind
to specific DNA sequences named promoter sequences. Several transcription factor- DNA interactions
have revealed new insights into the molecular basis of cancer and other human diseases (Tan et al.,
1998). Genome-wide proteinDNA interactions may be measured using chromatin immunoprecipitation
(ChIP) in conjunction with expression microarrays (Lee et al., 2002). In contrast with protein-protein
interactions, proteinDNA interactions are not obligate, as both the proteins and the DNA exist in
isolation (Jones et al., 1999).
Methodologies to Detect Protein Interactions

In recent years, a huge variety of methodologies has been developed in order to detect protein interactions. For the detection of proteinprotein and proteinnucleic acid interactions, the scientists follow
696
different strategies and methodologies. They can be separated in two main categories: the experimental
methods and the computational ones. The computational methodologies are used to predict potential
protein interactions, to validate the results of high-throughput interaction screens and to analyze the
protein networks inferred from interaction databases. Various data mining procedures, pattern recognition techniques and neural or Bayessian networks are being used in order to predict protein interactions
(Valencia et al., 2002).
The experimental methods can be separated into: the in vitro methods which are performed in biological laboratories and the in vivo methods that are performed in living cells (Figure 1). Because of
the nature of the proteins and the different interaction types, many parameters have to be taken into
consideration when in vitro methods are applied. The in vivo methods have the advantage that they are
applied in a natural cellular environment. Below, an overview of the most popular techniques of the
experimental category is provided.
In Vitro Methods for Protein-Protein Interactions

The classical methods for detecting proteinprotein interactions are co-immunoprecipitation (Adams
et al., 2002) and pull down assays (Vikis et al., 2004). The first one uses an antibody specific to the
protein of interest and it is added during the cell lysis. The sample is incubated for some hours, when
the antibody forms a complex with the under question protein. All the remaining proteins are washed
away and in the end, the complex is separated. Any other proteins bound to the under question protein
are separated too. Finally, Western blot analysis is used for the identification of the proteins (Hall 2004).
The pull down assay method is similar to the co-immunoprecipitation one, with the exception that a
Figure 1. In vitro and in vivo methodologies for detecting protein-protein interactions
697
protein and not an antibody is used as a bait. Usually, Glutathione-S-transferase (GST)-fusion proteins
are used as baits.
A refinement of the above mentioned methods is the Tandem Affinity Purification (TAP) technique
(Rigaut et al., 1999; Puig et al., 2001). TAP uses a tag, which consists of two IgG binding domains
and a calmodulin binding peptide separated by a TEV protease, which is fused to the under question
protein. This tag is selected to maintain the expression of the fusion protein at, or close to, its natural
level. Two different affinity steps are followed in order to separate the protein complex from the TAP
tag and some associated components. The purified protein complexes are subsequently analyzed by
SDS-PAGE and mass spectrometry. Although TAP technique is a very useful one, it can not detect
transient interactions (Piehler 2005).
Recently, a new technology has been developed, called protein microarrays (Zhu et al., 2001; Templin
et al., 2003; Stoll et al., 2005). A protein microarray is a piece of glass on which many different probes
are fixed. By combining small volume of proteins with these probes, valuable conclusions on protein
interactions can be extracted. Many probes can be fixed on a protein microarray, hence massive amounts
of information can be obtained by a single experiment.
In Vivo Methods for Protein-Protein Interactions

The most popular technique is the so called yeast two-hybrid system (Y2H) (Fields et al., 1989). In
this technique, the bait protein is fused to a DNA binding domain (DBD) and the fish proteins
are fused to an activation domain (AD). AD and DBD are parts of the transcription factor. If the two
proteins interact, a reporter gene is transcriptionally activated and a color reaction can be recorded on
specific media. This method can be applied in various organisms with many variations (Toby et al.,
2002). However, the Y2H method does not offer any information about the kind of the interaction determined between the proteins and it is prune to errors. For this reason, the Y2H system is usually used
in combination with other techniques (Johnsson et al., 2003 for example). The Y2H system is also used
to detect protein-DNA interactions (Joung et al., 2000).
Another commonly used in vivo technique is the Phage Display method (Smith et al., 1997; Willats
2002). This method integrates multiple genes from a gene bank into phages which are subsequently
added to a small plastic dish with the protein of interest. The dish is washed and the phage-displaying
proteins interacting with the unknown protein remain attached to the dish. The DNA extracted from
the interacting phages contains the sequence of the interacting proteins. The size limitation of protein
sequence is a drawback of this method.
Although, these two methods detect protein interactions in vivo, they do not function in real time.
Currently the most powerful technique for real time detection in living cells is fluorescence resonance
energy transfer (FRET) (Yan et al., 2003). FRET describes an energy transfer mechanism between two
fluorescent molecules. Fluorescence microscopy techniques and flow cytometry can by combined with
FRET and take advantage of its features (Kenworthy 2001; Chan et al., 2004). However, these techniques
are experimentally and technically very demanding.
Methods to Detect Protein-Nucleic Acid Interactions

The most commonly used technique for the detection of protein-nucleic acid interactions is called
Electrophoretic Mobility Shift Assay (EMSA) (Fried 1989; Jing et al., 2004). The assay is based on the
698
fact that protein-DNA complexes migrate more slowly through a native polyacrylamide or agarose gel
than unbound DNA. The individual protein-DNA complexes can be visualized as discrete bands within
the gel using chemiluminescence or radioisotopic detection. The method has the ability to recognize
all the proteins and the DNA fragments that interact with each other. A variation of this method is the
technique called Supershift Assay (Denissova et al., 2000) which is using antibodies in order to make
the complexes between proteins and DNA more stable within the gel.
Another relatively old technique is the one called DNA Footprinting (Petri et al., 1997). The method
uses an enzyme that can cut or modify DNA at every base pair. However, the fragments of DNA that
interact with a protein are protected from these changes. At the end of the experiment, the DNA is
examined in order to record the changes that occurred. The unmodified remaining parts are the ones
interacting with the protein.
Human Protein Interactome

Proteins react with other proteins forming complexes, whereas complexes are part of an extensive
network. The so-called interactome network is the complete collection of all physical protein-protein
interactions that can take place within a cell. The first large scale protein interaction studies were done
in yeast (Uetz et al., 2000; Ito et al., 2001) but have more recently been done in the fly (Giot et al., 2003)
and the worm (Li et al., 2004). After these studies the research community has put more emphasis into
the human interactome. A comprehensive and accurate mapping of human protein interaction network
(Colland et al., 2004; Stelzl et al., 2005) has been constructed. Interaction maps were constructed from
literature (Ramani et al., 2005) and from experimental approaches (Rual et al., 2005; Stelzl et al., 2005).
A catalog of all human protein-protein interactions is seen as a crucial prerequisite to understand how
cells function and to decipher the general principles governing this function. Importantly, such information should also enhance the understanding of complex disease processes such as cancer. In various
bioinformatics analyses, the authors collected information, concerning human interactome, and constructed maps by identifying conserved orthologous interactions (Lehner and Fraser, 2004). However,
transferring interaction information from model organisms to humans has been shown to be a difficult
task (Bork et al., 2004; Ramani et al., 2005).
In order to understand disease mechanisms and signalling cascades, smaller protein interaction
networks, representing part of the human interactome, were generated. For instance the interaction
network for Huntigton`s disease included 186 interactions (Goehler et al., 2004) and the network for
the transforming growth factor- signalling pathway contained 755 interactions (Colland et al., 2004).
Moreover, a study of the interaction attributes of all known human cancer genes has been attempted,
where it was showed that cancer proteins display a different global topology from non-cancer proteins
(Jonsson and Bates, 2006). This study clearly demonstrated the central role of cancer proteins within the
human interactome. The human protein interactome has proven to reveal information about potential
new target genes responsible for genetic diseases (Xu and Li, 2006).
Protein Interaction Graphs

The new biological methodologies have generated large amounts of data concerning protein-protein
interactions. A very efficient way of summarizing these new datasets is by forming protein interaction graphs (Figure 2). These graphs provide a valuable tool that helps the better understanding of the
functional organization of the proteome.
699
Figure 2. A subgraph of humans interactome derived from DIP database as it is represented with the
help of Cytoscape Tool
A graph is represented as G = (V,E), where V is the set of the graph vertices and E is the set of the
graph edges. In a protein interaction graph, the vertices represent the proteins and the edges the pairwise interaction between two proteins. A protein interaction graph can be weighted or unweighted. In
a weighted one, each edge connecting two proteins has been characterized by a number that represents
the validly of the connection between these two proteins. In an unweighted protein interaction graph,
an assumption is made that this number is equal to 1 for all the edges of the graph.
Generally, the protein interaction graphs are undirected and unweighted graphs. Some properties have
been identified to be common between the protein interaction graphs of all the organisms. First of all,
they are all scale free. Moreover, it is proved that similar proteins usually interact with each other and
that they lie within short distance in the interaction graph. Finally, there are few vertices having many
interactions and many that have few interactions. This means that if some proteins are eliminated, the
topology of the protein interaction graph does not change which subsequently confirms the robustness
of the organisms as they can afford to loose some proteins without jeopardizing the existence or even
the normal function of the network.
In protein interaction graphs, the dense subgraphs are valuable since they provide details concerning
the functionality of the subgraph proteins and the consistency of protein complexes. Given the mathematical representation of a graph, algorithms derived from the graph theory are well suited in order
to isolate these dense areas.
Protein Interaction Databases

The amount of data that has been derived from high-throughput approaches, automated text mining
techniques, and/or manually from the scientific literature, has been stored in databases called proteinprotein interaction databases. These databases are valuable resources for the researchers, where from
700
they can easily retrieve and analyze the stored data (Suresh et al., 2005). Usually these databases include
data of protein interactions obtained from many organisms. The most popular databases that include
data concerning human protein interactions are HPRD (Peri et al., 2003), BIND (Alfarano et al., 2005),
MINT (Zanzoni et al., 2002) and IntAct (Hermjakob et al., 2005). A more comprehensive review of
protein interaction databases to date is presented in Table 1.
All these databases support the PSI-MI (http://psidev.sourceforge.net/mi/xml/doc/user) format which
is a format for the protein-protein interaction data. The HPRD database has recorded about 3 times
more human protein interactions compared to the other databases. All the remaining databases have
almost the same amount of human protein interactions. Moreover, there is a significant difference in
the total number of protein-protein interactions among the various protein-protein interaction databases
(Mathivanan et al., 2006), due to the fact that data for each database were derived using different methods. Finally, all these databases have some entries where disease genes from the OMIM database with
at least one protein-protein interaction are recorded.
Apart from the databases where data obtained from experimental methods are stored, there are some
other databases where protein interactions predicted by computational methods are stored. The most
significant one is called Online Predicted Human Interaction Database (OPHID) (Brown et al., 2005)
and combines the data that are stored in HRPD, BIND and MINT databases with in silico predicted
data. The STRING database has integrated known and predicted interactions from a variety of sources
as well (Von Mering et al., 2007).
Table 1. The most important protein interaction databases

Databases
Features
Web links
References
BIND
Binary molecular interactions, molecular

complexes and pathways
http://www.blueprint.org/bind/bind.php
Alfarano et al.,
2005
DIP
PPI data manually curated from literature
http://dip.doe-mbi.ucla.edu/
Xenarios et al.,
2000
HPRD
Human PPIs, information about posttranslational modifications, subcellular

localization, protein domain architecture,
tissue expression and human disease
associations
http://www.hprd.org/
Peri et al., 2003
MINT
Experimental verified protein interactions
http://mint.bio.uniroma2.it/mint/Welcome.do
Zanzoni et al.,
2002
MIPS
Mammalian interaction data manually

curated from literature
http://mips.gsf.de/proj/ppi/
Pagel et al., 2005
IntAct
Interactions, experimental methods and

literature citation of human proteins. No
species restriction
http://www.ebi.ac.uk/intact/site/index.jsf
Hermjakob et al.,
2005
PDZBase
PPIS involving protein with PDZ domains,

confirmed in vitro and in vivo experiments
http://icb.med.cornell.edu/services/pdz/start
Beuming et al.,
2005
Reactome
Pathways and biochemical reactions in

humans
http://www.genomeknowledge.org/
Joshi-Tope et al.,
2005
STRING
Known and predicted protein-protein

interactions from various organisms
http://string.embl.de/
Von Mering et al.,

2007
OPHID
Predicted human protein protein

interactions
http://ophid.utoronto.ca/ophid/index.html
Brown et al., 2005
701
The HPRD and the OMIM database, due to their importance in relation to this chapter will be presented in more detail in the following paragraphs.
OMIM
Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is an online database focusing on
human genes and genetic disorders. Initially, it was based on Dr. Victor A. McKusicks book entitled
Mendelian Inheritance in Man. Today, the online database OMIM is distributed electronically by
the National Center for Biotechnology Information (NCBI). It is updated daily and provides links to a
variety of related resources. OMIM catalogues all the known diseases with a genetic component and
it links them to the relevant human genes, when this information is available. Each entry has textual
information and it is accompanied with references. Many other databases, including HPRD, are based
on the entries provided by OMIM which strengthens the trust in the quality of the data included within
the database.
HPRD
The Human Protein Reference Database (HPRD) (Peri et al., 2003; Peri et al., 2004) is an online database
providing information about human proteins. This database was developed from Dr. Akhilesh Pandeyats
team in Johns Hopkins University and the Institute of Bioinformatics. The database includes domain
architecture, protein functions, protein-protein interactions, post-translational modifications, subcellular localization and disease association of genes (Mishra et al., 2006). HPRD also reports interactions
of proteins with other nucleic acids and small molecules. HPRD is a curated database where data are
derived manually by expert biologists reading the published literature. The larger part of HPRD data
is derived from in vitro methods.
The property that makes HPRD a very important database is that it contains information about the
connection of many proteins with diseases. This kind of information is obtained from OMIM database
where the disease genes form these proteins are annotated. The HPRD is a database that connects
proteins with diseases based on OMIM database information. Moreover, HPRD has information about
protein modifications which are very important as they are related with diseases. The identification of
protein modifications can lead to the design of new and more effective drugs.
PROTEIN INTERACTION AND DISEASES

Introduction
One among the major goals of biological science, that has a great impact to our society, is to improve
our understanding of the many diseases that currently exist. Some diseases are caused by a single gene,
where one and only one gene has a dramatic influence on the disease phenotype. In contrast, the more
complex diseases such as diabetes and heart diseases, are caused by a combination of multiple genes,
environmental factors and behaviors. The latter scenario is more frequently met within the population
(review in Pevsener, 2003). Many human diseases can be the result of abnormal protein-protein interac-
702
tions or the result of the loss of essential interactions (Ryan and Matthews, 2005). In this section, we
review the mechanisms leading to diseases and involving protein interactions.
It is possible that protein interactions are the cause of pathological processes, for example Huntingtons disease (Li and Li, 2004), Alzheimers disease and prion diseases (Cohen and Prusiner, 1998) and
several types of human cancer (zur Hausen, 2000). Moreover, interactions between virus proteins can
occur during replication and human viruses can be assembled in the host cells (Loregian et al., 2002).
Here, we present, in more detail, some case studies for possible links between protein interactions and
diseases.
Huntingtons Disease and Protein Interactions

Huntingtons disease belongs to the family of inherited neurodegenerative diseases that are caused by
expansion of CAG repeats that encode polyQ tracts in the associated disease proteins (review Li and
Li, 2004). This expansion of protein huntingtin (htt) produces an altered form of the Htt protein, the
mutant Huntingtin (mHtt). Strong evidence indicates that the aggregation of mutant htt is linked to
disease progression (Davies et al., 1997). Therefore, it is possible that proteins interacting with htt and
influencing the aggregation; are possible modulators of disease pathogenesis.
The htt protein has many interaction partners and a range of functions including anti-apoptotic
effects, transcription regulation, cellular trafficking and neuronal development (Harjes and Wanker,
2003). It has been shown that the htt protein interacts with p53 transcription factor. It is believed that
the expanded repeat of htt protein causes aberrant transcriptional regulation through its interaction with
cellular transcription factors like p53, which may result in neuronal dysfunction and cell death (Steffan
et al., 2000). Many huntingtin-interacting proteins were characterized, but how these proteins function
in interacting network in normal situation or how dysregulation of these proteins affect physiology is
still unknown (Borell-Pages et al., 2006).
Cancer and Protein-Protein Interactions

Cancer is a multi-step process that generally occurs when cell division gets out of control. The transformed cell has six acquired capabilities (Hanahan and Weinberg, 2000). These are self-sufficiency in
growth signals, insensitivity to anti-growth signals, apoptosis, limitless replicative potential, sustained
angiogenesis and tissue invasion and metastasis.
In this section, we will examine examples of abnormal protein interactions form pathogens leading
to cancer and different approaches of protein-protein interaction inhibitors influencing the activities
that the transformed cell is capable of. The ultimate goal is the development of new therapies for various human diseases.
Pathogen-Host Interactions
A major mechanism leading to disease is the interactions of virus components with cellular proteins.
Several types of human pappiloviruses (HPVs) infect humans and can lead to cervical cancer and
several other types of carcinomas (zur Hausen, 2000, Baseman 2005). Pappiloma viruses are double
stranded viruses and infect epithelial cells by taking over the cell mechanism for their own replication
and survival. Interactions of viral oncoproteins with growth-regulating host cell proteins have been
703
reported. HPV genomes, particular those which are generally thought as high-risk genomes, code for
at least three proteins with growth-stimulating and transforming properties (E5, E6, E7). E5 is a protein found in the Golgi apparatus and in the plasma membrane (Burkhardt et al., 1989). This protein
interacts with various transmembrane proteins such as epidermal growth factor receptor (Hwang et al.,
1995). A number of interactions have been reported for E6 and E7 proteins that give stronger evidence,
compared to E5, about their functions as oncoproteins. E6 and E7 proteins were shown to immortalize
and transform cells in culture (Vousden, 1994). E6 protein is found to bind with the cellular protein p53
and is mediated by E6-associated ligase (reviewed in zur Hausen, 2000), whereas E7 protein binds to
the pRB family of pocket proteins resulting in the loss of normal control over cell cycle progression.
Another interaction of E7 protein is that it inactivates the cyclic dependent kinase inhibitors p21CIP-1
and p27KIP-1 (reviewed in zur Hausen, 2000). This is believed to be one of the major factors in growth
stimulations of the infected by papiloma viruses cells.
Protein-Protein Interaction Inhibitors

There has been a particular interest in inhibiting specific protein-protein interactions in order to develop
therapies for various human diseases and particularly cancer (reviewed in Arkin, 2005). Protein complexes of cells and microbes have been looked upon as real drug targets and several approaches have
been developed in order to generate inhibitors in order to block abnormal protein-protein interactions.
For example, antibodies and therapeutic proteins are widely used as antagonists of extracellular protein
complexes. Antibodies against growth factors or their receptors are widely used to treat several types of
cancer (Hinoda et al., 2004) for instance. However, antibodies cannot block intracellular targets due to
their big size, therefore smaller molecules like cross linked peptides and peptide mimetics are more likely
to target intracellular and extracellular protein-protein interactions (Zhao and Chmielewski, 2005).
An interesting interaction is the one between p53 and murine double-minute 2 (MDM2). Substantial
progress and effort has been made in order to develop inhibitors of the p53-MDM2 interactions (reviewed
in Klein and Vassilev, 2004). p53 is a sequence specific DNA binding transcritpion factor regulating
cell cycle and also functioning as tumour suppressor. The main function of MDM2 is to regulate the
protein level and activity of the tumour suppressor p53. In the normal cell, p53 is usually inactive and
kept at a low level, bound to the protein MDM2, which prevents its action and promotes its degradation by acting as ubiquitin ligase. P53 is usually activated by cancer causing agents like stress signals
or DNA damage and takes on an active role as transcription regulator which leads to DNA repair and
at some cases, apoptosis (Jin, 2001). In cases of cancer, MDM2 is found over-expressed, binds to p53,
blocks the transcription activation domain and the function of p53 in general, and promotes the growth
of tumours (Chen et al., 1996). Restoring p53 function by inhibiting its interaction with MDM2 is viewed
as a viable anticancer strategy (Chene, 2003; Ventura et al., 2007).
The programmed cell death, termed apoptosis, is one of the major activities of transformed cells.
Several protein-protein complexes, such as the Bcl-2 family of proteins, have a potential role in regulating apoptosis (reviewed in Arkin, 2005). Over-expression of Bcl-2 has been observed in several types of
cancer (Buolamwini, 1999) and all proteins of the family act by forming complexes with other members
of the family (Fry and Vassilev, 2005). The development of inhibitors of these proteins as potential anticancer therapeutics has been previously explored (reviewed in Enyedy et al., 2001), but obtaining smallmolecule inhibitors has proved difficult owing to the necessity of targeting a protein-protein interaction.
Evidence suggested that inhibiting Bcl-2 could reverse the resistance to chemotherapy.
704
Another example of inhibitor is the binding of small molecule inhibitor to the papilloma E2 protein
that was designed to prevent the binding of E1 proteins and prevent viral replication (reviewed in Ryan
and Matthews, 2005).
APPLICATIONS OF COMPUTATIONAL METHODS IN PROTEIN

INTERACTIONS AND DISEASES
The protein interactions contain a great amount of information that can be used in order to unravel the
mechanisms of the diseases. Research is focusing on proteins and on the identification of active pathways
that are related with a specific disease. Usually, computational methods are used for prediction of the
evolution of a disease or for the design of new more potent drugs.
In order to obtain results of high quality, the computational methods need to have good quality input
data. As far as the protein interaction data are concerned, this is very difficult to be achieved due to
the error prone experimental methods. To overcome this issue, the data fusion technique for the implementation of a computational method was introduced. Data fusion is the process of putting together
information obtained from many different sources such as protein interaction databases, experimental
data etc. This technique generates input sets that have more reliability than those that are obtained from
a single source such as an experimental method or a protein database.
The last years, several computational approaches on this matter have emerged, proving that there is
an increasing interest by the scientific community in this field. The well known conference of Pacific
Symposium on Biocomputing has a session which is dedicated to the impact of protein interactions
in diseases. An overview of the most significant approaches taken by the computational techniques as
described at the Pacific Symposium of Biocomputing are presented below.
Computational Applications
Machine learning techniques and classifiers have been broadly used in order to reveal the connection
between proteins and specific diseases (Terribilini et al., 2006; Xu et al., 2006). These methods utilize
two kinds of datasets: the training and the evaluation dataset. These datasets contain data about protein properties, which are every time chosen depending on the selected approach. The records in these
datasets have a tag that identifies if the particular record is connected with a disease or not. All records
within these datasets have been verified experimentally.
In many occasions, the structural properties of the proteins are used. In this way, the prediction of the
protein interactions involved in the disease is more accurately determined (Ye et al., 2006; (Kelly et al.,
2007). Furthermore, through the structural analysis of the interacting proteins, the disease mechanism
can be better understood and more potent drugs could ultimately be designed.
The study of the whole protein interaction network of an organism could complement the above
mentioned goal. As we saw earlier, a protein interaction network can be represented as a graph. The
data forming the protein interaction network are obtained from protein databases or scientific literature
with the help of data mining techniques. Subsequently, algorithms can determine the significance of
each protein regarding a specific disease (Chen 2006; Gonzalez et al., 2007). Various algorithms derived
from the graph theory exist which can be applied in order to detect protein pathways (Bandyopadhyay
et al., 2006) or proteins that play vital role in a disease (Toyoda et al., 2000).
705
Computational methods are used to simulate the behavior of living cells (Troncale et al., 2006) as
biological processes are often difficult to be studied by in vivo or in vitro methods. In addition, the design of new tools has been shown to help guiding the in vivo and/or in vitro experiments (Cook et al.,
2007). The initial results of the fruitful exchange of the experimental and computational approaches
seem really promising for the future in the field of protein interactions and diseases.
NOTE
*
Both authors contributed equally to this work.
REFERENCES
Adams, P. D., Seeholzer, S., & Ohh, M. (2002). Identification of associated proteins by coimmunoprecipitation. In Protein- Protein Interactions, edited by E. Golemis. Cold Spring Harbor Laboratory
Press. (pp. 59-74).
Alfarano. C., Andrade, C. E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K., Betel, D., Bobechko,
B., Boutilier, K., Burgess, E., et al.,(2005). The biomolecular interaction network database and related
tools 2005 update. Nucleic Acids Research, 33, 418-424.
Arkin, M. (2005). Protein-protein interactions and cancer: Small molecules going in for the kill, Current
opinion in chemical biology, 9, 317-324.
Bandyopadhyay, S., Kelley, R., & Ideker, T. (2006). Discovering regulated networks during HIV-1
latency and reactivation. Pacific Symposium on Biocomputing, 11, 354-366.
Baseman, J. G., & Koutsky, L.A. (2005). The epidemiology of human papillomavirus infections. J Clin
Virol, 32(Suppl 1), S16-24.
Beuming, T., Skrabanek, L., Niv, M. Y., Mukherjee, P., & Weinstein, H. (2004). PDZBase: A proteinprotein interaction database for PDZ-domains. Bioinformatics, 21, 827-828.
Bork, P., Jensen, L. J., von Mering, C., Ramani, A. K., Lee, I., & Marcotte, E. M. (2004). Protein interaction networks from yeast to human. Current Opinion in Structural Biology, 14, 292-299.
Borrell-Pages, M., Zala, D., Humbert, S., & Saudou, F. (2006). Huntingtons Disease: From Huntingtin
function and dysfunction to therapeutic strategies. Cell Mol Life Sci, 63, 2642-2660.
Brown, K., & Jurisica, I. (2005). Online Predicted Human Interaction Database. Bioinformatics, (21),
2076-2082.
Buolamwini, J. K. (1999). Novel anticancer drug discovery. Curr. Opin. Chem. Biol., 3, 500-509.
Burkhardt, A., Willingham, M., Gay, C., Jeang, K. T., & Schlegel, R. (1989). The E5 oncoprotein of bovine
papillomavirus is oriented asymmetrically in Golgi and plasma membranes. Virology, 170, 334-339.
706
Chan, F. K., & Holmes, K. L. (2004). Flow cytometric analysis of fluorescence resonance energy transfer: a tool for high-throughput screening of molecular interactions in living cells. Methods Mol Biol,
263, 281-292.
Chen, J. Y. (2006). Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome
Data. Pacific Symposium on Biocomputing, 11, 367-378.
Chen, J., Wu, X., Lin, J. and Levine, A.J. (1996). mdm-2 inhibits the G1 arrest and apoptosis functions
of the p53 tumor suppressor protein, Molecular and Cellular Biology, 16, 2445-2452.
Chene, P. (2003). Inhibiting the p53-MDM2 interaction: an important target for cancer therapy. Nature
Reviews, 3, 102-109.
Cohen, F.E., & Prusiner, S.B. (1998). Pathologic conformations of prion proteins. Annual review of
Colland, F., Jacq, X., Trouplin, V., Mougin, C., Groizeleau, C., Hamburger, A., Meil, A., Wojcik, J.,
Legrain, P., & Gauthier, J.M. (2004). Functional proteomics mapping of a human signaling pathway.
Genome research, 14, 1324-1332.
Cook, D., Wiley, J., & Gennari. (2007). CHALKBOARD: Ontology-Based Pathway Modeling and
Qualitative Inference of Disease Mechanisms. Pacific Symposium on Biocomputing, 12, 16-27.
Davies, S. W., Turmaine, M., Cozens, B. A., DiFiglia, M., Sharp, A. H., Ross, C. A., Scherzinger, E.,
Wanker, E. E., Mangiarini, L., & Bates, G. P. (1997). Formation of neuronal intranuclear inclusions
underlies the neurological dysfunction in mice transgenic for the HD mutation. Cell, 90, 537-548.
Denissova, N., Pouponnot, C., Long, J., He, D., & Liu, F. (2000). Transforming growth factor -inducible independent binding of SMAD to the Smad7 promoter PNAS, 97, 6397-6402.
Droit, A., Poirier, G., & Hunter, J. (2005). Experimental and bioinformatic approaches for interrogating protein-protein interactions to determine protein function. Journal of Molecular Endocrinology
34, 263-280.
Eisenberg, D., Marcotte, E.M., Xenarios, I., & Yeates, T.O. (2000). Protein function in the post-genomic
era. Nature, 405, 823-826.
Enyedy, I. J., Ling, Y., Nacro, K., Tomita, Y., Wu, X., Cao, Y., Guo, R., Li, B., Zhu, X., Huang, Y., Long,
Y. Q., Roller, P. P., Yang, D., & Wang, S. (2001). Discovery of small-molecule inhibitors of Bcl-2 through
structure-based computer screening. Journal of Medicinal Chemistry, 44, 4313-4324.
Fields, S., & Song, O. (1989). A novel genetic system to detect protein protein interactions. Nature, 340,
245-246.
Fried, M. G. (1989). Measurement of protein-DNA interaction parameters by electrophoresis mobility
shift assay. Electrophoresis, 10(5-6), 366-376.
Fry, D.C., & Vassilev, L.T. (2005). Targeting protein-protein interactions for cancer therapy. Journal of
molecular medicine (Berlin, Germany), 83, 955-963.
707
Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E., Godwin, B.,
Vitols, E., Vijayadamodar, G., Pochart, P., Machineni, H., Welsh, M., Kong, Y., Zerhusen, B., Malcolm,
R., Varrone, Z., Collis, A., Minto, M., Burgess, S., McDaniel, L., Stimpson, E., Spriggs, F., Williams,
J., Neurath, K., Ioime, N., Agee, M., Voss, E., Furtak, K., Renzulli, R., Aanensen, N., Carrolla, S.,
Bickelhaupt, E., Lazovatsky, Y., DaSilva, A., Zhong, J., Stanyon, C. A., Finley, R. L., Jr., White, K. P.,
Braverman, M., Jarvie, T., Gold, S., Leach, M., Knight, J., Shimkets, R. A., McKenna, M. P., Chant, J.,
& Rothberg, J. M. (2003). A protein interaction map of Drosophila melanogaster. Science, 302, 17271736.
Goehler, H., Lalowski, M., Stelzl, U., Waelter, S., Stroedicke, M., Worm, U., Droege, A., Lindenberg,
K.S., Knoblich, M., Haenig, C., Herbst, M., Suopanki, J., Scherzinger, E., Abraham, C., Bauer, B., Hasenbank, R., Fritzsche, A., Ludewig, A. H., Bussow, K., Coleman, S. H., Gutekunst, C. A., Landwehrmeyer,
B. G., Lehrach, H., & Wanker, E. E. (2004). A protein interaction network links GIT1, an enhancer of
huntingtin aggregation, to Huntingtons disease. Molecular Cell, 15, 853-865.
Gonzalez, G., Uribe, J. C., Tari, L., Brophy, C., & Baral, C. (2007). Mining dene-disease relationships
from biomedical literature: Weighting protein protein interactions and connectivity measures. Pacific
Symposium on Biocomputing, 12, 28-39.
Hall, R. A. (2004). Studying protein-protein interactions via blot overlay or far western blot. In ProteinProtein Interactions, Methods and Applications, Methods in Molecular Biology, 261, Humana Press,
Totowa, N.J., (pp. 167-174).
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick, V. A. (2005). Online Mendelian
inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids
Research, 33, D514-D517.
Hanahan, D., & Weinberg, R. A. (2000). The hallmarks of cancer. Cell, 100, 57-70.
Harjes, P., & Wanker, E. E. (2003). The hunt for huntingtin function: Interaction partners tell many
different stories. Trends in Biochemical Sciences, 28, 425-433.
Harrison, S. C. (1991). A structural taxonomy of DNA-binding domains. Nature, 353, 715-719.
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M.,
Roechert, B., Roepstorff, P., Valencia, A., et al.,(2004). IntAct: An open source molecular interaction
database. Nucleic Acids Research, 32, 452-455.
Hinoda, Y., Sasaki, S., Ishida, T., & Imai, K. (2004). Monoclonal antibodies as effective therapeutic
agents for solid tumors. Cancer science, 95, 621-625.
Hwang, E. S., Nottoli, T., & Dimaio, D. (1995). The HPV16 E5 protein: Expression, detection, and stable
complex formation with transmembrane proteins in COS cells. Virology, 211, 227-233.
Ito, T., Tashiro, K., & Kuhara, T. (2001). Systematic analysis of Saccharomyces cerevisiae genome: Gene
network and protein-protein interaction network. Tanpakushitsu kakusan koso, 46, 2407-2413.
Jin, S., & Levine, A. J. (2001). The p53 functional circuit, Journal of cell science, 114, 4139-4140.
708
Jing, D., Beechem, J. M., & Patton, W. F. (2004). The utility of a two-color fluorescence electrophoretic
mobility shift assay procedure for the analysis of DNA replication complexes. Electrophoresis, 25(15),
2439-2446.
Johnsson, N., & Varshavsky, A. (1994). Split ubiquitin as a sensor of protein interactions in vivo. Proceedings of the National Academy of Sciences of the United States of America, 91, 10340-10344.
Jones, S., van Heyningen, P., Berman, H. M., & Thornton, J. M. (1999). Protein-DNA interactions: A
structural analysis. Journal of Molecular Biology, 287, 877-896.
Jonsson, P. F., & Bates, P. A. (2006). Global topological features of cancer proteins in the human interactome. Bioinformatics (Oxford, England), 22, 2291-2297.
Joshi-Tope, G., Gillespie, M., Vastrik, I., DEustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath,
G. R., Wu, G. R., Matthews, L., Lewis, S., Birney, E., & Stein, L. (2005). Reactome: A knowledgebase
of biological pathways. Nucleic Acids Research, 33, 428-432.
Joung, J., Ramm, E., & Pabo, C. (2000). A bacterial two-hybrid selection system for studying proteinDNA and protein-protein interactions. Proc Natl Acad Sci U S A 97, 13, 7382-7387.
Kelly, L., Karchin, R., & Sali, A. (2007). Protein interactions and disease phenotypes in the ABC transporter superfamily. Pacific Symposium on Biocomputing, 12, 51-63.
Kenworthy, A. K. (2001). Imaging protein-protein interactions using fluorescence resonance energy
transfer microscopy. Methods, 24, 289-296.
Klein, C., & Vassilev, L. T. (2004). Targeting the p53-MDM2 interaction to treat cancer. British Journal
of Cancer, 91, 1415-1419.
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B.,
Ren, B., Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K., & Young, R. A. (2002).
Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799-804.
Lehner, B., & A. G. Fraser. (2004). A first-draft human protein-interaction map. Genome Biol, 5(9),
R63.
Li, S. H., & Li, X. J. (2004). Huntingtin-protein interactions and the pathogenesis of Huntingtons disease. Trends Genet, 20, 146-154.
Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P. O., Han, J. D., Chesneau,
A., Hao, T., Goldberg, D. S., Li, N., Martinez, M., Rual, J. F., Lamesch, P., Xu, L., Tewari, M., Wong,
S.L., Zhang, L. V., Berriz, G.F., Jacotot, L., Vaglio, P., Reboul, J., Hirozane-Kishikawa, T., Li, Q., Gabel,
H. W., Elewa, A., Baumgartner, B., Rose, D. J., Yu, H., Bosak, S., Sequerra, R., Fraser, A., Mango, S.
E., Saxton, W. M., Strome, S., Van Den Heuvel, S., Piano, F., Vandenhaute, J., Sardet, C., Gerstein, M.,
Doucette-Stamm, L., Gunsalus, K. C., Harper, J. W., Cusick, M. E., Roth, F. P., Hill, D. E., & Vidal, M.
(2004). A map of the interactome network of the metazoan C. elegans. Science, 303, 540-543.
Loregian, A., Marsden, H. S., & Palu, G. (2002). Protein-protein interactions as targets for antiviral
chemotherapy. Reviews in medical virology, 12, 239-262.
709
Luisi, B. F. (1995). DNA-protein interaction at high resolution. In DNA ProteinStructural Interactions.

Edited by Lilley DMJ. New York: Oxford University Press. (pp. 1-48).
Luscombe, N. M., Austin, S. E., Berman, H. M., & Thornton, J. M. (2000). An overview of the structures
of protein-DNA complexes. Genome Biology, 1, REVIEWS001.
Mathivanan, S., Periaswamy, B., Gandhi, T. K. B, Kandasamy, K., Suresh, S., Mohmood, R., Ramachandra, Y. L., & Pandey, A. (2006). An evaluation of human protein-protein interaction data in the public
domain. BMC Bioinformatics, 7(Suppl 5), S19.
Mishra, G. R., Suresh, M., Kumaran, K., et al., (2006). Human protein reference database--2006 update.
Nucleic Acids Research, 34, D411-D414.
Nooren, I. M. A., & Thornton, J. M. (2003). Diversity of protein protein interactions. EMBO, 22(14),
3486-3492.
Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach, I., Frishman, G., Montrone, C.,
Mark, P., Stmpflen, V., Mewes, H.W., Ruepp, A., & Frishman, D. (2005). The MIPS mammalian protein-protein interaction database. Bioinformatics, 21, 832-834.
Pandey, A., & Mann, M. (2000). Proteomics to study genes and genomes. Nature, 405, 837-846.
Peri, S., Navarro, J. D., Amanchy, R., Kristiansen, T. Z., Jonnalagadda, C. K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T. K., Gronborg, M., et al., (2003). Development of human protein
reference database as an initial platform for approaching systems biology in humans. Genome Research,
13, 2363-2371.
Peri S., Navarro, J. D., Kristiansen, T. Z. et al., (2004). Human protein reference database as a discovery
resource for proteomics. Nucleic Acids Research, 32, D497-D501.
Petri V, & Brenowitz, M. (1997). Quantitative nucleic acids footprinting: Thermodynamic and kinetic
approaches. Current Opinion in Biotechnology, 8(1), 36-44.
Pevsener, J. (2003). Bioinformatics and functional genomics. Hoboken, NJ: John Wiley & Sons Inc.
Phillips, A. C., & Vousden, K. H. (1997). Analysis of the interaction between human pappilomavirus
type 16 E7 and the TATA-binding protein. TBP, Journal of General Virology, 78, 905-909.
Piehler, J. (2005). New methodologies for measuring protein interactions in vivo and in vitro. Current
Opinion in Structural Biology, 15, 4-14 .
Puig, O., Caspary, F., Rigaut, G., Rutz, B., Bouveret, E., Bragado-Nilsson, E., Wilm, M., Seraphin, B.
(2001). The tandem affinity purification (TAP) method: A general procedure of protein complex purification. Methods, 24, 218-229.
Ramani, A. K., Bunescu, R. C., Mooney, R. J., & Marcotte, E.M. (2005). Consolidating the set of known
human protein-protein interactions in preparation for large-scale mapping of the human interactome.
Genome Biology, 6, R40.
Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., & Seraphin, B. (1999). A generic protein
purification method for protein complex characterization and proteome exploration. Nat Biotechnology,
17, 1030-1032.
710
Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G. F., Gibbons,
F. D., Dreze, M., Ayivi-Guedehoussou, N., Klitgord, N., Simon, C., Boxem, M., Milstein, S., Rosenberg,
J., Goldberg, D. S., Zhang, L. V., Wong, S. L., Franklin, G., Li, S., Albala, J.S., Lim, J., Fraughton, C.,
Llamosas, E., Cevik, S., Bex, C., Lamesch, P., Sikorski, R. S., Vandenhaute, J., Zoghbi, H. Y., Smolyar,
A., Bosak, S., Sequerra, R., Doucette-Stamm, L., Cusick, M. E., Hill, D. E., Roth, F. P., & Vidal, M.
(2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437,
1173-1178.
Ryan, D. P., & Matthews, J. M. (2005). Protein-protein interactions in human disease. Current Opinion
in Structural Biology, 15, 441-446.
Sam, L., Liu, Y., Li, J., Friedman, C., & Lussier, Y. (2007). Discovery of protein interaction networks
shared by disease. Pacific Symposium on Biocomputing, 12, 76-87.
Smith, G., & Petrenko, V. (1997). Phage Display. Chem. Rev., 97, 391-410.
Steffan, J. S., Kazantsev, A., Spasic-Boskovic, O., Greenwald, M., Zhu, Y. Z., Gohler, H., Wanker, E.
E., Bates, G. P., Housman, D. E., & Thompson, L. M. (2000). The Huntingtons disease protein interacts
with p53 and CREB-binding protein and represses transcription. Proceedings of the National Academy
of Sciences of the United States of America, 97, 6763-6768.
Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., Goehler, H., Stroedicke, M., Zenkner,
M., Schoenherr, A., Koeppen, S., Timm, J., Mintzlaff, S., Abraham, C., Bock, N., Kietzmann, S.,
Goedde, A., Toksoz, E., Droege, A., Krobitsch, S., Korn, B., Birchmeier, W., Lehrach, H., & Wanker,
E. E. (2005). A human protein-protein interaction network: a resource for annotating the proteome.
Cell, 122, 957-968.
Stoll, D., Templin, M. F., Bachmann, J., Joos, T. O. (2005). Protein microarrays: Applications and future
challenges. Current Opinion in Drug Discovery and Development, 8(2), 239-52.
Suresh, S., Sujatha, Mohan, S., Mishra, G., Hanumanthu, G. R., Suresh, M., Reddy, R., & Pandey, A.
(2005). Proteomic resources: Integrating biomedical information in humans. Gene, 364, 13-18.
Tan, S., & Richmond, T. J. (1998). Eukaryotic transcription factors. Current Opinion in Structural
Biology, 8, 41-48.
Templin, M. F., Stoll, D., Schwenk, J. M., Potz, O., Kramer, S., & Joos, T. O. (2003). Protein microarrays: Promising tools for proteomic research. Proteomics, 3(11), 2155-66.
Terribilini, M., Lee, J-H., Yan, C., Jernigan, R. L., Carpenter, S., Honavar, V., & Dobbs, D. (2006).
Identifying interaction sites in Recalcitrant proteins: Predicted protein and RNA binding sites in rev
proteins of HIV-1 and EIAV agree with experimental data. Pacific Symposium on Biocomputing, 11,
415-426.
Toby, G. G., & Golemis, E. A. (2001). Using the yeast interaction trap and other two-hybrid-based approaches to study protein-protein interactions. Methods, 24, 201-217.
Toyoda T., & Takigawa Y. (2000). Selection of candidate genes for polygenic diseases by utilizing protein-protein interaction networks. Genome Informatics, 11, 286-288.
711
Troncale, S., Tahi, F., Campard, D., Vannier, J-P., & Guespin, J. (2006). Modeling and simulation with
hybrid functional Petri Nets of the role of interleukin-6 in human early haematopoiesis. Pacific Symposium on Biocomputing, 11, 427-438.
Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V.,
Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S., & Rothberg, J. M. (2000). A comprehensive analysis
of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-627.
Uetz, P., & Finley, R. L., Jr. (2005). From protein networks to biological systems. FEBS Letters, 579,
1821-1827.
Valencia, A., & Pazos, F. (2002). Computational methods for the prediction of protein interactions.
Current Opinion in Structural Biology. 12, 368-373.
Ventura, A., Kirsch, D. G., McLaughlin, M. E., Tuveson, D. A., Grimm, J., Lintault, L., Newman, J.,
Reczek, E. E., Weissleder, R., & Jacks, T. (2007). Restoration of p53 function leads to tumour regression in vivo. Nature, 445, 661-665.
Vikis, H. G., & Guan, K.-L. (2004). Glutathione-s-transferase-fusion based assays for studying proteinprotein interactions. In Protein-Protein Interactions, Methods and Applications, Methods in Molecular
Biology, 261, Fu, H. (ed.). Totowa, NJ: Humana Press. (pp. 175-186).
Von Mering, C., Jensen, L. J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., Snel, B., & Bork, P. (2007).
STRING 7-Recent developments in the integration and prediction of protein interactions. Nucleic Acids
Research,, D358-D362.
Vousden, K .H. (1994). Interactions between papillomavirus proteins and tumor suppressor gene products. Advances in Cancer Research, 64, 1-24.
Willats, W. (2002). Phage display: Practicalities and prospects. Plant Molecular Biology, 50, 837-855.
Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., & Eisenberg, D. (2000). DIP:
The Database of Interacting Proteins. Nucleic Acids Research, 28, 289-291.
Xu, J., Li, Y. (2006). Discovering disease genes by topological features in human protein-protein interaction network. Bioinformatics, (22), 2800-2805.
Yan, Y., & Marriott, G. (2003). Analysis of protein interactions using fluorescence technologies. Current
Opinion in Chemical Biology, 7,.635-640.
Ye, Y., Li, Z., & Godzik A. (2006). Modeling and analyzing three-dimensional structures of human
disease proteins. Pacific Symposium on Biocomputing, 11, 439-450.
Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., Cesareni, G.
(2002). MINT: a Molecular INTeraction database. FEBS Lett, 513, pp.135-140.
Zhao, L., & Chmielewski, J. (2005). Inhibiting protein-protein interactions using designed molecules.
Current Opinion in Structural Biology, 15, 31-34.
Zhu, H., & Snyder, M.. (2001). Protein arrays and microarrays. Current Opinion in Chemical Biology,
5, 40-45.
712
Zur Hausen, H. (2000). Papillomaviruses causing cancer: evasion from host-cell control in early events
in carcinogenesis. Journal of the National Cancer Institute, 92, 690-698.
KEY TERMS
ChIP: Abbreviation for Chromatin Immunoprecipitation. It refers to a procedure used to determine
whether a given protein binds to or is localized to a specific DNA sequence in vivo.
Databases: A database is a collection of records or data that are stored in a computer in such a way
that a computer program can easily select desired pieces of data.
Diseases: The terms disease refers to the abnormal situation of a living organism that impairs function. It may include disabilities, disorders syndromes and infections.
Protein Complexes: Protein complex is a group of two or more associated proteins, formed by
protein-protein interaction. It is usually stable over time and it is a form of quaternary structure.
Protein Interactions: Protein interactions refer to the association of protein molecules with proteins,
DNA or any other molecule and the study of these associations from the perspective of biochemistry,
network and signal transduction.
Protein Network: A protein network is a map of protein protein interaction. The network is usually
presented as a graph where nodes indicate proteins and links between them indicate the interactions
between the proteins.
SDS-PAGE: Abbreviation for sodium dodecyl sulfate polyacrylamide gel electrophoresis. This is a
technique used in biochemistry, genetics and molecular biology to separate proteins according to their
electrophoretic mobility.
TAP: Abbreviation for Tandem Affinity Purification. It involves the fusion of the TAP tag to the
target protein of interest and the introduction of the construct into the cognate host cell or organism.
713
714
Chapter XL
The Breadth and Depth of

BioMedical Molecular Networks:
The Reactome Perspective
Bernard de Bono
European Bioinformatics Institute, UK
and University of Malta, Malta
abstract
From a genetic perspective, disease can be interpreted in terms of a variation in molecular sequence
or expression (dose) that impairs normal physiological function. To understand thoroughly the knockon effect such pathological changes may have, it is crucial to map out the physiological relationship
affected genes maintain with their functional neighbors. The goal of the Reactome project is to build
such a network knowledgebase for all human genes. Constructing a map of such extent and scope requires a considerable range of expertise, so this project collaborates with field experts to integrate their
pathway knowledge into a single quality-checked human model. This resource dataset is systematically
cross-referenced to major molecular and literature databases, and is accessible to the community in a
number of well-established formats. As an evolving network systems resource, Reactome is also starting to provide increasingly powerful and robust tools to investigate tissue-specific biology and steer
targeted drug design.
INTRODUCTION
A major theme emerging from biomedical research in recent years is the multifactorial origin of many
diseases (for example, (Talmud 2004; Barnetche, Gourraud et al. 2005)). This feature is thought to reflect
the concerted evolution of a number of genes responsible for our survival on the one hand, and rapidly
changing environmental pressures on the other. Therefore, disease is seen as a reporter phenotype for
evolutionary and environmental change.
The Breadth and Depth of BioMedical Molecular Networks
As the effort to establish the genetic basis of disease intensifies, single genes and their products are
under close scrutiny to determine their biological role and their individual contribution to pathology
and morbidity. The challenge today is to integrate this accumulated knowledge to provide the bigger
picture a global functional context in which every human gene has a well-defined role. A matrix of
this nature, that describes the function of all genes in relation to each other, requires an eloquent grammar for the step-wise depiction of biological processes in detail.
From a medical standpoint, the notion of disease is based on a deviation from normal function. In
the course of studying a specific pathology, establishing how to identify and quantify this deviation depends on a proper definition of function under normal conditions. Establishing a standardized approach
to describe all large scale biological processes therefore creates a common platform that connects and
relates all known disease mechanisms at a molecular level. Furthermore, such an approach provides a
unique opportunity to reclaim and integrate applicable knowledge generated from studies using model
organisms that are of relevance to human disease. In this review I discuss the basic principles and limitations of the methods employed to depict human molecular physiology.
Background: Depicting normal human gene function

A key step to integrating knowledge about gene function is to develop a unified method to describe
the properties of their products. The Gene Ontology (GO) Consortium (Harris, Clark et al. 2004) has
developed a successful interdisciplinary project to catalogue and standardize a vocabulary of terms depicting biological activity and localization of expressed products. Each term is supported by a text-based
description that illustrates and defines a biological property, with which any number of gene products
may be associated. This annotation strategy has achieved considerable coverage of a large number of
genes from a wide variety of model organism databases (http://geneontology.org).
This qualitative relational classification of descriptive GO terms provides vital bearings on the
functional landscape of sequence molecules. GO maintains actively three distinct ontologies of terms,
namely, Molecular Function, Biological Process and Cellular Component. GOs stated objective is
to keep its ontologies strictly orthogonal to each other, thus minimizing the descriptive overlap of these
vocabularies.
Both Molecular Function and Biological Process terms represent some form of biological activity associated with gene products a process can be seen as a recognized series of functions. Some
activity terms deal with the movement of biological entities (for example, protein transporter activity
(GO:0008565)), a significant proportion with a molecular conversion of some kind (for example, adenylosuccinate lyase activity (GO:0004018)), while others are concerned with assembly (for example,
actin cable formation (GO:0045011)). In many ways, GO activity terms relate to a structural change
that has a biological implication.
The Cellular Component ontology describes subcellular and extracellular locations, representing
a higher level of structural complexity, starting from macromolecular assemblies. For example, the GO
term actin cable (GO:0030482) is defined as a long bundle of actin filaments, comprising filamentous
actin and associated proteins, found in cells.
In practice, GO terms describe where gene products locate themselves as well as an indication as to
what their role is and how this is carried out. The wording employed in GO terms also provides a unique
insight into the conceptual relationship of structure and activity and the difficulties often encountered in
715
distinguishing the two. For instance, it is possible to find examples in which the biological purpose of a
structure is considered at par with its activity (for example, nutrient reservoir activity (GO:0045735) and
structural constituent of chitin-based cuticle (GO:0005214 ) are both Molecular Function terms).
An activity term may also be qualified by a timeframe (for example, Biological Process term: activation of plasma proteins during acute inflammatory response (GO:0002541)). In such scenarios,
the structure containing an activity may also be seen to denote time span, in that a particular process
is taken to occur within the lifetime of its housing structure (for example, Biological Process term:
proteolysis within endosome associated with antigen processing and presentation (GO:0002499)).
Analogous situations where structures assume the role of progress milestones are a familiar feature in
development (for example, Biological Process term: multi-layer follicle stage, oogenesis (GO:0048162))
and tumour staging (for example, colon cancer (Sarma 1988)). Furthermore, the use of quantifiable
structural change between sequence molecules, given a mathematical model of evolution, is at the basis
of estimating species divergence times (for example, Lecomte, Vuletich et al. 2005; Jansen, Devaere et
al. 2006, and see also Box 1.).
A number of high level structures are also referred to in GO activity terms. These include references
to elements from the Cell Component ontology (for example, Molecular Function term: axon guidance
receptor activity (GO:0008046)), as well as standard taxonomic terms (for example, Biological Process
term: neuroblast fate determination (sensu Nematoda and Protostomia) (GO:0043347)).
The above examples seem to suggest that, in certain cases, the same biological property is used in
more than one orthogonal ontology. This feature is especially evident in the case of structural properties that are used concurrently as a context, as a timeframe, as a location, as well as an activity. From
this perspective, it may be legitimate to consider the precise role structure should take in the depiction
of biological activity, particularly if a demonstrable structural change is at the basis of a consistent
depiction of a biological event. This approach should also be coupled with more precise criteria that
distinguish a simple assembly of gene products from a higher order Cell Component term (for example,
1-phosphatidylinositol-4-phosphate kinase, class IA complex (GO:0005943)), as well as which collection
of Molecular Functions should be mapped onto a higher order Biological Process term. A step-based
approach to these problems, discussed in the following section, may well supply some of the answers.
The Reactome Knowledgebase

The rationale for representing biological processes in pathway form is based on the notion that a crisp
depiction of function depends on clear description of structural change. This approach is less vulnerable
to potential ambiguities that stem from the structure-function duality illustrated above.
The Reactome (Vastrik, DEustachio et al. 2007) (http://reactome.org) model of human molecular
biology consists of a broad descriptive graph of Reaction nodes (Figure 1). Each node recounts the conversion of one or more input Entities into a resultant output, often brought about by some participating
catalyst that brokers this step. Input, output and catalyst Entities represent definite biological structures
that are cross-referenced to appropriate accession identifiers when available (for example, proteins,
small molecules). In contrast, the GO strategy uses an approach in which the definition of the Molecular
Function term adenylosuccinate lyase activity (GO:0004018) is the Catalysis of the reaction: N6-(1,2dicarboxyethyl)AMP = fumarate + AMP, specified in free text.
716
All Entities and Reactions are contained within a housing structure called a Compartment, that
corresponds to a selection of well-established non-overlapping cellular locations featured in the Cell
Component ontology from GO.
A set of Reactions in Reactome, usually consecutive and interlinked, can be grouped to form a Pathway. Pathways and Reactions are generically called Events. GO terms from both Molecular Function
and Biological Process vocabularies can also be linked directly to the action of catalysts (as in the
above case of adenylosuccinate lyase), Reactions and Pathways, providing a valuable rational mapping
between terms from these two ontologies (Figure 2). Both Events and Entities may be further qualified
through the association of key literature references, as well as the annotation with original diagrams
and summaries to highlight items of interest.
It has to be pointed out that this large scale, objective and computationally accessible representation
of function in Reactome is based strictly on the pathway connectivity and corresponding structural
conversions. While this network is able to tell a clear, consistent and well-defined story, however, the
value of extra links, commentaries and graphical enhancements in making this resource human readable should not be underestimated.
Building Content
The complexity of building the network at the basis of human physiology stands as a firm reminder that
everything is connected to everything else(First Law of Ecology (Commoner 1971)). At a molecular
level, this approach provides a meaningful context for the interaction of human proteins with each other
and with other types of molecule (not necessarily produced by human synthetic machinery - for example,
significant amounts of Vitamin K are produced by bacterial flora (Hill 1997)).
Creating a detailed functional model to gain further understanding of the behaviour of this vast
physiological network is a formidable task. Representation, even of the most basic and fundamental of
processes, such as energy balance, requires extensive mechanistic knowledge of both cause and effect,
spanning a number of molecular pathways and organ systems. A list of top-level pathways is featured
on the main panel of the Reactome website shown in Figure 3.
A pathway model provides the starting point for a number of investigations. With increasing volumes
of high-throughput protein interaction and gene expression results, it is crucial to interpret such data
in the functional context of a standard pathway reference framework. The ability to map experimental
results onto a curated model, is therefore a key step to gaining insight through the correlation with
pathway-specific knowledge. The structure of the model is also indicative of the expected behaviour of
its components. Modeling of protein and small molecule connectivity thus provides a way to analyse
crosstalk and feedback loops that determine the functional interdependencies between network elements
(Klamt, Saez-Rodriguez et al. 2006). Therefore, integrating detailed knowledge of physiological mechanisms enables the logical analysis of their pathways, as well as the identification of optimal intervention
target points for further scientific enquiry and biotechnological development.
However, the creation of a molecular model of sufficient quality and breadth to address for instance,
energy balance, is hampered by a number of production issues. The first is securing the biological expertise necessary to describe molecular mechanisms ranging from carbohydrate metabolism and lipid
synthesis to the physiology of growth and the regulation of appetite and temperature (Jungermann and
Barth 1996; Hill 2006; Trayhurn and Bing 2006). Secondly, given the resources invested in such an
717
undertaking, the model then requires (a) regular maintenance and updating, as well as (b) packaging in
a manner that is accessible to and adaptable by the scientific community.
The main objective of the Reactome Knowledgebase is to provide a scalable solution to these production issues by integrating verifiable curated pathway data into a unified human model under constant
expert and editorial scrutiny. This supervision has also to ensure that the model is scientifically sound
and consistent throughout as, from a human biology perspective, knowledge is often:
a.
b.
c.
Lacking: Detailed structural and functional characterization of individual genes and their products
is a very time consuming task, such that only a fraction of human genes have been subjected to
thorough analysis.
Dispersed: Although much of this knowledge is carefully written up and recorded, it is also
scattered over a number of literature sources in disparate formats, emphasis, styles and levels of
quality. This renders the interpretation of such information largely inaccessible to computational
recovery methods (for example, text mining).
Inapplicable: For a number of practical and ethical reasons, a substantial amount of research in
molecular biology has been carried out on non-human (model) organisms. This poses a problem
of applicability. For instance, are pathways in a mouse hepatic cell line identical to those in the
human liver? What lessons learnt from the study of eye development in Drosophila are applicable
to human embryology?
In collaboration with Reactome, the expert biologist plays a central role in overcoming these restrictions, while ensuring the quality of the model. For instance, identifying knowledge from model organisms that is applicable to humans would provide a combined solution to problems (a) and (c), making
up for the shortcomings of direct investigation (the computational techniques involved are discussed
in Box 1). In those non-human cases in which applicability to human is ascertained by the expert, the
Reactome protocol is to construct Events pertaining to the model organism interactions first. These are
annotated using the original literature reference as evidence. The corresponding molecules from human are then selected to create a new set of inferred Events that point to the equivalent lower organism
annotation as evidence.
The overall objective of the collaborating expert is to extend the Reactome graph model by creating new Pathways on a particular topic module in the style of a formal literature review process. The
structure and content of Pathways and Reactions embedded in this review are constructed under the
direct supervision of the expert to reflect current consensus in the field. This publication process is
completed through the review by a second independent expert who checks for quality and clarity and
suggests refinements, prior to release.
Bricks and Mortar: The Modelling Kit

Given the relationship of function with biological structure, much of the eloquence of the process
grammar in Reactome depends on the descriptive properties of the Physical Entity representation of
structure (Figure 1).
Where possible, Entities in Reactome are strictly referenced using external accession identifiers.
For instance, participating protein are cross referenced with accession identifiers to a number of wellestablished databases (for example, UniProt (Wu, Apweiler et al. 2006), KEGG (Kanehisa, Goto et al.
718
Box 1.
The need to draw reliable inferences from homologous biological systems has spurred considerable
efforts to create rigorous frameworks in comparative biology and quantify structural differences between
species. A considerable body of comparative work in molecular biology has been based on sequence
alignment and the premise that in most cases, 3D structure (and consequent function) is far better preserved than primary sequence (Dunbrack 2006). Orthology maps based on this approach can bridge
corresponding homologous genes between model organisms and human, thus providing across-species
indications about genes that are likely be functionally equivalent (for example, the OrthoMCL resource
(Chen, Mackey et al. 2006)).Increasingly sophisticated graph-based methods have been developed that
specifically track the process of sequence evolution:
a.
b.
Sequence profiles: The ability to find matching sequences that share a common evolutionary
origin has been greatly enhanced by the use of sequence profiles. In protein biology, for instance,
a family alignment can be converted into a hidden Markov Model (hMM) graph with so-called
match and insert nodes representing homologous amino acid positions (Figure 4). Such profiles
can match sequences, and more recently other sequence profiles, allowing the detection of novel
distant proteins that are likely to have similar structural and functional properties in common with
known ones (Debe, Danzer et al. 2006).
Phylogenetic trees: Given a model of evolution, these simple graphs depict relationships between
molecules based on sequence differences. A functional relationship has often been shown to exist
between two sequence families that have very similar phylogenetic trees to each other. Improved
methods that compare phylogenetic trees have recently enhanced the detection of shared evolutionary constraints, and thus the large-scale prediction of functional relationships between sequence
molecules (Jothi, Cherukuri et al. 2006).
Graph-based orthology calculations focus on the quantitative assessment of structural change during the process of mutation-driven molecular evolution (usually over considerable timescales). Pathway
graphs, by and large, are concerned with processes over a shorter time period, such as the qualitative
depiction of cell signaling cascades and small molecule metabolism that typically take place over the
lifetime of a cellular housing structure.
Advances in graph analysis methods in sequence evolution have contributed valuable insights into
biological systems through the creation of increasingly powerful homology detection and other useful prediction tools. However, the development of similar quantitative approaches to analyse pathway
processes on a large scale has yet to overcome two considerable obstacles:
1.
Distance functions between entities: Given a model of evolution, it is possible to calculate a

mathematical distance (i.e. an unambiguous and quantifiable measure of difference) between sequences, that is based upon the mutations that distinguish one from the other. In the case of small
molecules, a number of distance methods have been developed using a number of approaches (for
example, the Tanimoto co-efficient method based on the analysis of chemical group fingerprints
(Godden, Stahura et al. 2005)). In the more eventful Pathway model, gene products may assume
719
Box 1. (continued)
a number of structural states given the various changes in chemical composition they may undergo
(for example, fragmentation, glycosylation, phosphorylation, etc). It may be difficult, therefore, to
create a distance function that can reconcile and quantify this vast spectrum of changes spanning
both proteins and small molecules.
2. Rate equations: Kinetic pathway models use rate equations to provide a detailed and dynamic
mathematical description suitable for simulating a small system of molecules. They are usually
the result of detailed work on a small scale in an attempt to answer very specific questions about
a particular aspect of molecular biology. A significant collection of such models is now available
online (for example, the BioModels resource (Le Novere, Bornstein et al. 2006)). A series of practical limitations, however, confine the level of structural detail and number of molecules they can
incorporate, although the community is focusing its efforts to ensure the required compatibility
standards for such models to be merged (Le Novere, Finney et al. 2005).
2006), EnsEMBL (Birney, Andrews et al. 2006), and Entrez Gene (Wheeler, Barrett et al. 2007)). In
the case of molecules that are not sequence-based, the ChEBI database (Chemical Entities of Biological
Interest (de Matos, Ennis et al. 2007)) plays a key role by providing expertise and curatorial support to
the addition of new small molecules in Reactome.
While the re-arrangement of structure in a small molecule can be described in terms of a change from
one ChEBI accession ID to another, it is more complicated to represent protein modification in discrete
form. The states of phosphorylation or palmitoylation of a protein, to mention just two instances, can not
be distinguished on the basis of a change in its UniProt accession ID. The same holds true for protein
cleavage into fragments, as well as sequence polymorphism. The start and end amino acid position is
recorded in the case of fragmentation. If a particular residue is modified, the nature of the new chemical
group is referred to in terms of the corresponding ChEBI accession ID.
In Reactome, any shift from the original form of the protein results in the creation of a new Physical Entity that, however, will still retain a pointer to its primary UniProt ID. This feature enables the
interaction tracking of different states of the same molecule across the entire model, notwithstanding
the number of modifications it may have gone through.
A number of biological processes are strictly partitioned such that the transition of a molecule from
one compartment to another may have profound effects, and is therefore held under strict control (as
in the case of signaling triggered by the influx of calcium ions into the cytosol). As compartment type
is one of the basic defining features of a Physical Entity, a transport Reaction is able to simply map, as
input and output, two distinct Entities that refer to the same molecular accession ID but have different
localization properties.
The notion of ascribing functional equivalence to different structures is a key feature of the Reactome toolkit, as a specific role in a Reaction may be assumed equally well by a number of molecules.
For instance, different isoforms of regulatory and catalytic components of an enzyme dimer may exist
(for example, PI3K). Another example may involve a large family of hormones binding differentially
to a corresponding set of related receptors (for example, FGF receptors). On similar lines, it may be
720
required to represent the number of different molecules transported by the same membrane channel
(for example, bile salts co-transport with sodium). The use of Sets in Reactome does away with the
necessity of depicting every possible combinatorial Reaction instance, without losing any of the detail
such an Event is required to convey.
The formation of molecular complexes is a mainstay in representing a number of biological scenarios.
All types of Physical Entity, including small molecules, proteins, Sets and any other complex can be
used as a component for assembly. Such complexes are also linked to GO Cellular Component terms
and literature references, where applicable.
The potential descriptive space of the Physical Entity is therefore considerable, being roughly the
product of (1) the set of small molecules and chemical groups, (2) all possible protein fragments in all
species, and (3) all cellular compartments. Any number of Physical Entities may feature in one Reaction,
in an input, output or catalytic role. The skill essential to the Reactome curatorial process is matching
the requirements of the expert biologist using the appropriate descriptive instruments from this data
model palette.
Using Reactome
The layout of the Reactome website takes the form of a regular and and hierarchical presentation of
pathway annotation (Figure 5). Pathways are presented as a series of crosslinked panels containing
author-reviewed diagrams and suumaries, together with hyperlinked Reaction depictions. Less experienced users can access extensive documentation on how to best use Reactome resources. Through
the website, every Event can be individually exported in a number of well-established formats such as
SBML (http://sbml.org), BIOPAX (http://biopax.org) and Cytoscape (http://cytoscape.org) or repackaged
in PDF/RTF for printing and perusal.
The Reactome site also provides a number of other services and specific query functions. Visualization of the global context for every human molecular event is provided at the top graphical panel
of the website (Figure 3), known as the Sky, which lays out Pathway constellations of all Reactions
for user point-and-click interaction. The related SkyPainter provides a simple interface to highlight
Reactions on the Sky given a submitted list of recognized accession IDs for sequence, small molecule,
and other established data types (for example, GO, InterPro, Affymetrix, MIM etc). SkyPainter allows
exploration, for example, of the influence and effects of differential gene expression, as its identifiers are
linked to numeric qualifiers. Multiple numeric columns attached to particular submission are rendered
as an animated movie.
In order to provide more user-friendly and rapid data retrieval, Reactome has recently launched a
BioMart service that allows the formulation and handling of complex queries. A complex query may
take the form of, for example, what genes are involved in the apoptosis pathway? BioMart is a simple,
federated query system designed to allow efficient and speedy data retrieval, even from complex queries
(Kasprzyk, Keefe et al. 2004).
Given that Reactome curates specifically human Pathways, is may be useful to determine what proportion of these Events may also be found in lower organisms. The OrthoMCL method (Li, Stoeckert
et al. 2003) uses protein sequence matching and clustering to produce a set of orthologs and recent
paralogs between two species. The resulting orthology maps play a central role in automatically linking
inferred human Events to the underlying lower organism data., and suggesting pathways across which
their products may interact.
721
The Reactome toolkit consists of a number of complementary features, developed to facilitate access, maintainance, updating and broadcasting of pathway material. Both the Reactome database and
toolkit are open access, so it is possible to install Reactome software locally (http://www.reactome.
org/download/index.html) and to carry out customized operations for analysis and curation. Reactome
software uses MySQL (serves all Event network and Physical Entity data), and Application Programming
Interface (API) tools (available as Perl, Java and SOAP-based webservices kits). The Perl and Java API
classes drive the website and curation clients respectively. However, Reactome data tables can also be
accessed and manipulated directly by integrating such APIs into any implementation.
Conclusion
The classification of disease has long been a preoccupation and practice in the pursuit of both cause and
cure. Before the advent of microscopy, surgical definitions of pathology where based on the palpable
abnormality of anatomical structure (for example, developmental anomalies, hypertrophy, tumour
growth, atrophy) and compartmental connectivity (for example, ulceration, obstruction, oedema, laceration) (Russell, Williams et al. 2004). More recently, studies on changes in key physiological parameters
such as volume, pressure, tension, and temperature has improved understanding of the role played by
controlled adaptations in the electrical and material properties of tissue fabric.
Such introduction of concepts from studies in cellular and molecular biology has added depth and
nuance to the manner in which normality and disease is perceived. The genetic basis of disease is rooted
in the pathological variation of sequence (i.e. somatic or inherited mutations for example, (Greenman,
Stephens et al. 2007)) or dosage. The latter may be altered though (a) the mis-expression of genes that
belong to the host, or the significant expression of non-self genes (i.e. infecting parasites or mutated
self genes), or (b) compartmental breakdown (blockages or leaks). Both mechanisms lead to an altered
molecular localization and concentration.
The wealth of accessible information ingrained in the human Reactome events is therefore opportune, now that graph-based methods are increasingly relying on verified interaction networks to predict
protein function (Sharan, Ulitsky et al. 2007) and consequent disease phenotypes (Lage, Karlberg et al.
2007). The key integrative thrust of the Reactome effort is focused on establishing a deep and robust
connectivity between established external databases, as well as providing ample scope and means to
interpret this molecular network in a genomic context.
The interplay between genes and the environment plays an important role in the modern understanding of disease. The current definition of normality is increasingly based on the function of the highest
frequency alleles and the highest frequency gene combinations within a specific geographical population
(for example, case control studies (WTCCC 2007); the selective advantage of heterozygous sickle cell
anemia in malarial regions (Williams 2006)). Furthermore, specific pathologies appear to be associated
with disease gene modules that show a higher likelihood of physical interactions between their products,
as well as higher expression profiling similarity for their transcripts (Goh, Cusick et al. 2007).
In addition, ambitious exercises that survey and classify all known normal activity in molecular
biology have shown that, for biological activity to be defined on the ability to change structure, a clear
and consistent distinction between the two notions has to be kept. Yet, it is often not straightforward to
separate what a molecule is from what it does.
722
The successful elucidation of major disease processes depends on making full use of knowledge
acquired so far in the discovery of novel associations between the sequence and functional properties of key human molecules. However, the selective recovery and interpretation of information from
the literature is largely inaccessible to computational mining methods. Although much of biological
knowledge is carefully written up and recorded, it is also dispersed over a number of literature sources
in disparate formats, emphasis, styles and indeed levels of quality. Expertise is therefore required to
reclaim knowledge that is credible, well established and reliable. The collaboration between field biologists and the Reactome editorial style of curation guarantees more objectivity to this process and ensures
consistent standards throughout the model. In Reactome, the unit of knowledge sought for inclusion is
that molecular interaction or modification that has a definite and manifest biological purpose.
The principal function and purpose of many nascent polypeptide chains is to form a stable three-dimensional structure. The use of motifs of primary structure in hidden Markov Models has had a major
impact in identifying a broader range of sequences with the same ability, as well as our understanding of
the process of protein evolution. In Reactome, a finely granular data model maps a number of molecular
processes in a step-wise manner by tracking the series of structural conversions along interconnected
Events. Given the mapping of Reactions and Pathways to GO function and process terms, it is likely
that future developments in the quantitative comparison of molecular structures may identify those
recurring motifs of structural change that define specific biological activities.
Acknowledgment
Reactome is a joint project between the European Bioinformatics Institute (EBI), Cold Spring Harbor
Laboratories (CSHL) and the Gene Ontology Consortium (GO). It is supported by a grant from the
US National Institutes of Health (P41 HG003751), a grant from the European Union Sixth Framework
Programme (LSHG-CT-2003-503269), and a subcontract from the EBI Industry Programme.
The author acknowledges the Reactome team for a number of insightful discussions about pathway
informatics and helpful comments about the manuscript: Ewan Birney, Imre Vastrik, Esther Schmidt,
Bijay Jassal and David Croft at EBI, Lincoln Stein, Peter DEustachio, Gopal Gopinath, Marc Gillespie,
Lisa Matthews and Guanming Wu at CSHL, and Suzanna Lewis at GO.
References
Barnetche, T., Gourraud, P. A., et al. (2005). Strategies in analysis of the genetic component of multifactorial diseases; biostatistical aspects. Transpl Immunol, 14(3-4), 255-66.
Birney, E., Andrews, D., et al. (2006). Ensembl 2006. Nucleic Acids Res, 34(Database issue), D55661.
Chen, F., Mackey, A. J., et al. (2006). OrthoMCL-DB: Querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res, 34(Database issue), D363-8.
Commoner, B. (1971). The closing circle. New York: Knopf.
723
de Matos, P., Ennis, M., et al. (2007). ChEBI - Chemical entities of biological interest. NAR Molecular
Biology Database Collection, (646).
Debe, D. A., Danzer, J. F., et al. (2006). STRUCTFAST: Protein sequence remote homology detection
and alignment using novel dynamic programming and profile-profile scoring. Proteins, 64(4), 960-7.
Dunbrack, Jr., R. L. (2006). Sequence comparison and protein structure prediction. Curr Opin Struct
Biol, 16(3), 374-84.
Godden, J. W., Stahura, F. L., et al. (2005). Anatomy of fingerprint search calculations on structurally
diverse sets of active compounds. J Chem Inf Model, 45(6), 1812-9.
Goh, K. I., Cusick, M. E., et al. (2007). The human disease network. Proc Natl Acad Sci USA, 104(21),
8685-90.
Greenman, C., Stephens, P., et al. (2007). Patterns of somatic mutation in human cancer genomes.
Nature, 446(7132), 153-8.
Harris, M. A., Clark, J., et al. (2004). The gene ontology (GO) database and informatics resource. Nucleic
Acids Res, 32(Database issue), D258-61.
Hill, J. O. (2006). Understanding and addressing the epidemic of obesity: An energy balance perspective. Endocr Rev, 27(7), 750-61.
Hill, M. J. (1997). Intestinal flora and endogenous vitamin synthesis. Eur J Cancer Prev, 6(Suppl 1),
S43-5.
Jansen, G., Devaere, S., et al. (2006). Phylogenetic relationships and divergence time estimate of African
anguilliform catfish (Siluriformes: Clariidae) inferred from ribosomal gene and spacer sequences. Mol
Phylogenet Evol, 38(1), 65-78.
Jothi, R., Cherukuri, P. F., et al. (2006). Co-evolutionary analysis of domains in interacting proteins
reveals insights into domain-domain interactions mediating protein-protein interactions. J Mol Biol,
362(4), 861-75.
Jungermann, K., & Barth, C. A. (1996). Energy metabolism and nutrition. Comprehensive Human
Physiology, 2, 1425-1457.
Kanehisa, M., Goto, S., et al. (2006). From genomics to chemical genomics: New developments in
KEGG. Nucleic Acids Res, 34(Database issue), D354-7.
Kasprzyk, A., Keefe, D., et al. (2004). EnsMart: A generic system for fast and flexible access to biological data. Genome Res, 14(1), 160-9.
Klamt, S., Saez-Rodriguez, J., et al. (2006). A methodology for the structural and functional analysis
of signaling and regulatory networks. BMC Bioinformatics, 7, 56.
Lage, K., Karlberg, E. O., et al. (2007). A human phenome-interactome network of protein complexes
implicated in genetic disorders. Nat Biotechnol, 25(3), 309-16.
724
Le Novere, N., Bornstein, B., et al. (2006). BioModels database: A free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res,
34(Database issue), D689-91.
Le Novere, N., Finney, A., et al. (2005). Minimum information requested in the annotation of biochemical models (MIRIAM),.Nat Biotechnol, 23(12), 1509-15.
Lecomte, J. T., Vuletich, D. A., et al. (2005). Structural divergence and distant relationships in proteins:
evolution of the globins. Curr Opin Struct Biol, 15(3), 290-301.
Russell, R. C. G., Williams, N. S., et al. (2004). Bailey and Loves short practice of surgery. Hodder
Arnold.
Sarma, D. P. (1988). Dukes classification of rectal cancer. South Med J, 81(3), 407-8.
Sharan, R., Ulitsky, I., et al. (2007). Network-based prediction of protein function. Mol Syst Biol, 3,
88.
Talmud, P. J. (2004). How to identify gene-environment interactions in a multifactorial disease: CHD
as an example. Proc Nutr Soc, 63(1), 5-10.
Trayhurn, P., & Bing, C. (2006). Appetite and energy balance signals from adipocytes. Philos Trans R
Soc Lond B Biol Sci, 361(1471), 1237-49.
Vastrik, I., DEustachio, P., et al. (2007). Reactome: A knowledgebase of biological pathways and processes. Genome Biol, 8(3), R39.
Wheeler, D. L., Barrett, T., et al. (2007). Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res, 35(Database issue), D5-12.
Williams, T. N. (2006). Human red blood cell polymorphisms and malaria. Curr Opin Microbiol, 9(4),
388-94.
WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000
shared controls. Nature, 447(7145), 661-78.
Wu, C. H., Apweiler, R., et al. (2006). The Universal Protein Resource (UniProt), an expanding universe
of protein information.Nucleic Acids Res 34(Database issue), D187-91.
KEY TERMS
Biological Process: A biological process, as described by this set of GO terms, occurs through one
or more ordered assemblies of molecular functions. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one
distinct step.
Cellular Component: The cellular component ontology describes locations, at the levels of subcellular structures and macromolecular complexes.
725
Gene Ontology (GO): A collaborative effort to address the need for consistent descriptions of gene
products in different databases. It has developed three structured controlled vocabularies (ontologies)
that describe gene products in terms of their associated biological processes, cellular components and
molecular functions in a species-independent manner.
Homology: In biology, entities or their functional systems that share common ancestry are said to
be homologous.
Kinetic Model: A model is a conceptual representation of a system or set of experimental observations. A kinetic model permits the simulation of such a system to observe the behaviour of its quantitative features.
Molecular Function: A set of GO terms describing activities, such as catalytic or binding activities,
that occur at the molecular level. These represent activities rather than the molecules or complexes that
perform the actions. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products.
Orthology: Homology (see above) as applied to features shared between species.
Pathway: In Reactome, a Pathway is any grouping of related Reactions or Pathways, collectively
known as Events. An Event may be a member of more than one Pathway.
Physiology: The science of large-scale behaviour of biological entities and their related chemical
and physical phenomena.
PubMed: PubMed is a service of the U.S. National Library of Medicine that includes a large number
of citations from life science journals for biomedical articles back to the 1950s. PubMed includes links
to full text articles and other useful resources.
Reaction: In Reactome, a Reaction is defined as the conversion of input substrate molecules to output
product molecules in a single step.
726
Appendix
Figures
Figure 1. The interactions of five Complexes, one Set and two small molecules described over three
consecutive Reactions (filled blue circles) that form part of the larger Insulin Receptor Cascade Pathway.
Protein rectangles are shaded blue, small molecules are white. The reaction R3 depicts the autophosphorylation of the hormone-bound receptor. The Set represents three human IRS proteins that play a
role in binding the activated receptor at this stage in the cascade.
Figure 2. Reactome establishes a deep and robust connectivity between external dictionaries. Unique
and complex relationships are captured between Gene Ontology terms and curated database entries
for proteins and small molecules.
727
Figure 3. The Reactome home page showing the Transcription Pathway highlighted on the Sky
Figure 4. A sketch representing a type of hidden Markov Model used in sequence matching. M, I and
D labels denote Match, Insert and Delete nodes respectively
728
Figure 5. The Human Apoptosis Pathway in Reactome. The top-most panel on this web page shows the
highlighted set of Reactions in the Sky that form part of this Pathway. The hierarchy of Events that
make up this process is accessible through the panel that follows. A summary compiled by the expert
biologists for this module is also available. A series of links to related molecular and genomic database
records, as well as to orthology-based pathway predictions in lower organisms, is found in the lower
part of this page. A number of export options are visible in the very last section.
729
Section XII
Approaches
731
Chapter XLI
Entropy and Thermodynamics

in Biomolecular Simulation
Jorge Numata
Freie Universitt Berlin, Germany
abstract
Thermodynamics is one of the best established notions in science. Some recent work in biomolecular
modeling has sacrificed its rigor in favor of trendy empirical methods. Even in cases where physicsbased energy functions are used, entropy is forgotten or left for later versions. This text gives an
overview of the utility of a more rigorous treatment of thermodynamics at the molecular level in order to
understand protein folding and receptor-ligand binding. An intuitive understanding of thermodynamics
is conveyed: enthalpy is the quantity of energy, while entropy stands for its quality. Recent advances in
entropy estimation from information theory and physical chemistry are outlined as they apply to biological thermodynamics. The different enthalpic, entropic, and kinetic driving forces behind protein folding
and binding are detailed. Finally, some medical applications enabled by an understanding of the free
energy folding funnel concept are outlined, such as HIV-1 protease folding inhibitors.
Introduction
Thermodynamics is not new and its not trendy. Thermodynamics doesnt carry the glow of bioinformatics, pharmacokinetic modeling or metabolic networks. Should we care? In this chapter, I will argue
for the importance of thermodynamics and entropy in the study of biomolecules.
The field of systems biology deals with the emerging dynamical behavior of complex networks. It is
often possible to dispense with details and still understand the basic features of signaling and reaction
networks. But this should ideally not happen at the cost of ignoring thermodynamic principles. It is the
intention of this chapter to demystify the notions of thermodynamics relevant to biology and help the
reader acquire an intuitive grasp on the meaning of entropy.
Entropy and Thermodynamics in Biomolecular Simulation
The enzymes present in a given cellular compartment determine which reactions are catalyzed and
their kinetic rates. But they do not determine the directions of reactions or the amount of energy that
is stored, transferred, or is required to synthesize a given reactant. (Alberty, 2006) This is predicted by
thermodynamics as the direction of spontaneous processes, such as protein association events and the
extent of biochemical reactions. Thermodynamics quantifies equilibrium, phase changes and stability
using unmeasurable quantities like energy and entropy. These are coupled to experimentally measurable
ones, like temperature and pressure, through mathematical relationships. This way, thermodynamics
creates a self-consistent system of explanation for physicochemical transformations in micro- and
macromolecular systems.
The concept of free energy has established itself as the main criterion to predict if, and to what
extent, a process will occur in a spontaneous way. It represents the evolution of the qualitative idea of
chemical affinity, widespread until the 19th century. Free energy allows us to establish the equilibrium in chemical reactions and physicochemically driven processes such as non-covalent association.
Important processes governed by non-covalent bonding are hormone binding to receptors, mRNA
codon recognition on the ribosome (Almlf, Andr, & qvist, 2007) and protein-protein interactions.
Free energy allows us to predict the strength of such non-covalent interactions and the corresponding
equilibrium constants.
A reliable calculation of equilibrium constants for elementary reactions among biochemical metabolites and kinetic rates of enzymes from first principles would be an invaluable advance for the field of
systems biology. The methods presented in this review carry such potential.
The theoretical estimation of protein folding and ligand binding free energy is the subject of many
excellent reviews (Chipot, Andricioaei, Hummer et al., 2007; Gilson & Zhou, 2007; Lazaridis & Karplus,
2003; Shakhnovich, 2006; van Gunsteren, Bakowies, Baron et al., 2006). This review will concentrate
on conveying an intuitive understanding of entropy as it pertains to proteins at the molecular level,
sacrificing generality but not compromising on rigor.
Many interesting developments are occurring in the field of high-throughput virtual screening and
prediction of protein-protein interactions. In order to gain computational speed, some physical details
will be inevitably lost. My point on insisting on a thermodynamic treatment of biomolecular interactions is that we shouldnt forget what generations of chemists have learned and replace it with ad-hoc
formulations. Simplifications are useful and inevitable, so let us be guided by thermodynamics to propose models with solid foundations in physical chemistry. Simplified approaches to predict free energy
which nevertheless retain a solid theoretical foundation have earned an important place in engineering
thermodynamics of liquid mixtures (Mueller & Gubbins, 2001) and protein-protein binding (Audie &
Scarlata, 2007).
An intuitive notion of entropy

Internal energy and enthalpy quantify energy. Entropy measures the quality of that energy; the lower
its entropy, the more useful that energy is.
At first sight, the Earth seems to be kept alive by the energy arriving from the Sun. This is a superficial
understanding, because in the steady state the amount of energy arriving from the Sun and the amount
radiated back are the equal (ignoring for a moment the Earths internal energy sources). If the energy
stayed, the Earth would become unbearably warmer every day. As noted by (Schrdinger, 1944) in an
732
article directed to a lay audience, life is maintained by a constant influx of low entropy. He coined the
term negentropy, which in this context means that organisms are constantly expelling high entropy
and feed on negative entropy to survive. The earth receives low-entropy electromagnetic radiation,
which partly trickles down through the food chain and metabolic networks, and is ultimately emitted
back as high-entropy radiation. Yellow, high-frequency light arrives to the earth. Infrared, low frequency
radiation gets emitted. Consider Plancks blackbody radiation formula:
E = h
E=energy, h=Planck's constant and =Frequency. The arriving high frequency photons carry more
energy per photon than those leaving. To keep the balance of energy in the steady state, more photons
leave than those that arrive. A larger number of photons means more degrees of freedom, and thus the
higher entropy. For more on this, see Chap. 27 of (Penrose, 2005).
Entropy is dissipation and irreversibility. This can be illustrated with a waterfall analogy. A small
amount of water at the top of a mountain falls into the ocean, which is flattened and all at the same level.
In this process, its energy is not lost - it just becomes more dispersed. As it falls down, it may or may
not be used to drive an industrial or biological process.
Entropy is a measure of disorder. A crystal is an example of a low entropy material because of its
predictable regularity. This is not to say that particles in a crystal are static, but their displacements
due to thermal energy are relatively small. If we take a crystal vibrating in a lattice and heat it, it will
become a liquid and its entropy will rise. If we heat it further, it may become a gas that fills the whole
room. Can you now predict where each molecule is? The entropy (uncertainty) is now much larger than
in the crystal. We cannot go back to the crystal in a predictable fashion, putting everything back into
place the way it was before. This irreversibility results in an arrow of time that points just in one direction. (Bragg, Atkins, Grady et al., 2004).
Finally, entropy is a measure of multiplicity and variability within a system. It is a counting of states
in a logarithmic scale. Entropy was rediscovered in the context of telecommunications by (Shannon
& Weaver, 1948) to provide a measure for channel transmission capacity. Shannons ideas went on to
become the foundation of information theory, a whole branch of applied mathematics closely related
to statistics.
An intuitive connection between the quality of energy understanding of entropy and the multiplicity of states view from information theory can be gained with the following example: Consider
how we rub our hands together in a cold day. We use high quality energy gained from food to apply
very directed work, which is a collective effort of many muscle cells applying a force in the same direction (low multiplicity). It gets transformed into low quality energy that we perceive as a rise in
temperature. This transformed energy has a high multiplicity because it quickly becomes spread out
in all directions and involves the random, undirected vibrations of many particles. It is of lower quality
because it cannot be completely turned back into directed motion, as dictated by the Second Law of
Thermodynamics.
Today, the concept of entropy has found widespread application in science and engineering. The
generality of thermodynamics has afforded it a place in physical chemistry (Dill & Bromberg, 2003),
engineering (Bejan, 1997), astrophysics (Hawking, 1976) and of course the life sciences (Atkins & Paula,
2006). Entropy also lives a parallel life in statistics (Jaynes & Bretthorst, 2003) and information theory
(Cover & Thomas, 2006; Shannon & Weaver, 1948). However, entropy is still often misunderstood,
ignored or pointed as the cause of inexplicable results.
733
Misconceptions about energy and entropy

A folded protein in solution represents the optimized balance between minimum energy (enthalpy) and
maximum multiplicity of configurations and conformations (entropy) of all interacting particles (water,
ions, ligands and proteins). Here is a small list of misconceptions and wrong ideas regarding protein
thermodynamics which are nevertheless widespread:

Proteins have a special attraction for the minimum energy state. Collections of particles, including proteins, do not have an inherent preference for low energy states. It just happens that energy
is available in limited quantities at a given temperature. If a certain amino acid side chain is in a
very high energy state, it leaves no energy for the other particles. High energy conformations are
thus not intrinsically unfavorable; they are just very unlikely.
The protein folding problem is in essence an optimization problem: We just need to efficiently and
correctly locate the global minimum in the potential energy. For some decades, practical methods
for obtaining the potential energy of large molecules from their conformations have been available in the form of quantum chemical calculations and molecular mechanics force fields. In some
lucky cases, a favorable potential energy (enthalpic contribution) dominates the binding of rigid
ligands to rigid receptors. In general, however, free energy must account for both enthalpic and
entropic contributions. This means that receptor-ligand binding and protein folding cannot be
directly reduced to an optimization problem solvable through minimization algorithms.
Energy is a perfectly understood concept. Actually, we have no mechanistic understanding of what
energy is. We know that energy is conserved and we know, through the first law of thermodynamics, that its interconvertible. The first law of thermodynamics is nevertheless only the statement
of a fact, not a description of a mechanism. Understanding of the mechanism of conservation is
however not necessary for the application of the principle.
Entropy is a mysterious force of nature. Despite its reputation as an enigmatic property, entropy
is many ways less abstract than energy. Its a counting of states: a measure of the multiplicity of
ways a system can arrange itself into. Nature tends to spread out into all available possibilities.
This spreading out grows with time in real, irreversible processes (and as a matter of fact, it defines
the arrow of time itself).
Entropy drives all systems into disorder. The first definition of entropy we hear is that it is a measure
of disorder. This is actually correct. But the search for maximum entropy in one part of a system
may be the driving force for organization in another. This is indeed the case in protein folding,
where the hydrophobic effect maximizes the entropy of the water by collapsing the protein chain
into a more orderly, low conformational entropy state. The net result is the minimum free energy
of the whole system.
Thermodynamics: from steam engines to actin filaments

Although thermodynamics was born in the realm of industrial plants, its wide applicability has afforded
it a place in biology. For example, the thermodynamics of actin filaments has been explored (Sept &
McCammon, 2001). Actin proteins are self-assembling units that play key roles in muscle cells and in the
734
formation and reshaping of the cytoskeleton. The tools of computer simulation can be used to calculate
the thermodynamic driving forces behind processes like actin nucleation and polymerization.
Thermodynamics consists of a set of tools to reason about energies and entropies. The basic building
blocks are two laws and some multivariate calculus:
1st law (Energy balance):
dU = q + w
(1)
where U=internal energy; q= heat; w=work. d indicates U is a path-independent state variable, while
means that heat and work depend on the application path.
2nd law (Tot. Entropy never decreases):
dS 0
(2)
where S = entropy (Boltzmann, 1896; Clausius, 1865).

The First Law simply states that energy in all forms is conserved, and that it can be exchanged
through heat and work. The Second Law can be seen as half a conservation law, because entropy can
be created but not destroyed. (Falk, 1985)
By combining the First and Second Laws at constant number of particles (N), Volume (V) and
Temperature (T), we may obtain an expression describing the approach to equilibrium of any process.
This criterion is what we mean with free energy:
Free energy differential:
dF = dU TdS 0

(3)
Laws of Thermodynamics in Lay Terminology

1st Law: It is impossible to obtain something from nothing, but one may break even
2nd Law: One may break even but only at the lowest possible temperature
3rd Law: One cannot reach the lowest possible temperature
Implication: It is impossible to obtain something from nothing, so one must optimize resources
- From (Annamalai & Puri, 2002)
Free energy in processes involving proteins

Protein folding and receptor-ligand binding occur in a spontaneous and specific way because the folded
and bound states have a lower free energy than their denatured and unbound counterparts, respectively.
The Helmholtz free energy change F or the Gibbs free energy change G=F+PV predict the equilibrium constant (Keq) for folding and binding. For macromolecules solvated in incompressible fluids
like water, the volume term PV is negligible:
735
G F = U T S = k BT ln K eq
All processes occur at temperatures higher than -273.15oC, so the entropic term -TS will play a role.
For macromolecules and soft matter in general, understanding of the driving forces that together result
in a given free energy or binding constant requires consideration of flexibility and motion. Regarding
molecules as static entities has thwarted advance in understanding of biological thermodynamics.
Timeline of paradigms in protein flexibility and binding

Protein flexibility is quantified by conformational entropy. If we are able to develop a reliable way of
estimating the conformational entropy of macromolecules, we are in fact quantifying their conformational flexibility. But the importance of this flexibility is sometimes ignored in structural biology and
molecular mechanics studies. A timeline of the paradigm shift is shown in Table 1.
Predicting ligand binding and protein-protein interactions

The difference in free energy between two states tells us if a process will occur spontaneously, and to
what extent. For a thermodynamic state function such as free energy to be meaningful, the start and
end states should be clearly defined. For example, the stability of a protein against unfolding is given
Table 1. Timeline of paradigms
1894
Lock and key hypothesis of Emil Fischer. Protein and ligand are specific and reciprocal. Enzymes are rigid
templates of attractive and repulsive regions. (Demetrius, 1998)
1954
First computational chemistry simulations: Monte-Carlo of hard spheres. (Rosenbluth & Rosenbluth, 1954)
1958:
Induced fit model of Koshland. Ligands induce conformational changes in enzymes that cause a tighter fit.
1959
: X-ray crystal structure of a protein by Max Perutz. This in an invaluable advance. But it reinforced in many
the paradigm of the frozen, rigid structure.
1962
Everything that living things do can be understood in terms of the jiggling and wiggling of atoms.
(Feynman, 1962)
1965:
Allosteric Modulation model of Monod, Wyman & Changeux: Applicable to enzymes that aside from the
active site also posses other effector sites. Binding at all sites can cause conformational changes. (Monod,
Wyman, & Changeux, 1965)
1977
First molecular dynamics (MD) simulation published on protein BPTI. (McCammon, Gelin, & Karplus,
1977) Molecular dynamics has accelerated understanding of protein motion as essential for function and not
just accessory.
1981
: Public release of MD simulation programs. (Bernard R. Brooks, 1983) Conformational motion modes
may be separated with Principal Component Analysis, also known as the quasi-harmonic approximation to
entropy. Pioneered by (Karplus & Kushick, 1981)
2001:
Network of coupled promoting motions, a paradigm to explain the function of enzymes. (Billeter, Webb,
Agarwal et al., 2001; Hammes-Schiffer, 2006) Collective modes are reservoirs of entropy.
Present
Sequence-encodes-dynamics-encodes-function. (Bahar, Chennubhotla, & Tobi, 2007) Stabilization of sets

of functional conformational ensembles upon binding. The solvated ensemble consists of conformational
clusters (whose populations depend on their relative free energy). Upon binding, one of these clusters
becomes favored. By selecting an already existing collective motion mode, the entropy penalty upon binding
is reduced. This is called conformational selection (Lange, Lakomek, Fars, et al., 2008)
736
by Gunfolding= Gfolded Gdenatured. If Gunfolding is negative, thermodynamics will favor the folded state.
Similarly, the binding free energy for a ligand-receptor complex is:
Gbinding = Gcomplex (Gligand + G protein )= RT ln K a

A factor-of-ten increase in the binding affinity constant Ka implies a change of 1.3 kcal/mol in Gbinding
at room temperature. This additional stability can come from either enthalpic or entropic contributions
within the whole system (ligand, receptor protein, solvent, etc.)
It is essential to report experimental binding free energy with respect to a standard concentration
0
C , which is arbitrary (Mihailescu & Gilson, 2004). Likewise, theoretical methods that estimate binding
free energy should produce concentration-dependent predictions. If C0 is changed from 1M to 1nM, then
every experimental standard free energy of binding will be reported as RT ln(109) = 12 kcal/mol more
positive. (Gilson & Zhou, 2007). Consistent with the mass-action law, a favorable binding free energy
becomes more negative with increasing concentration. The equilibrium constant Ka is:
K a = e G
/( RT )
Ccomplex C 0
=
Creceptor Cligand
eq
The dependency on C0 is only removed when the relevant quantity is G.
Conformational entropy
The net enthalpic (H) and entropic (TS) contributions from all particles (solute and solvent) almost
cancel out in natural or properly engineered proteins (Zoete, Meuwly, & Karplus, 2005). Stability against
Figure 1. Proteins are molecular machines with motion networks that catalyze reactions. They obey the
laws of thermodynamics, as first laid out for steam engines. Photo of a sculpture taken by the author at
Tacheles, Berlin in 2006.
737
unfolding is typically around G= 5 to 15 kcal/mol (Keq = 10-4 to 1011). Upon folding, the solute becomes
more rigid and loses conformational entropy. This unfavorable contribution is typically TSconformational=
10 to 100 kcal/mol. Any estimation of free energy lacking this contribution will grossly overestimate
stability against unfolding.
The entropy Sconformational can be defined in terms of the probabilities of microstates. A microstate is
an individual conformation of a molecule:
Sconformational = k B pi ln pi
Here, pi represents the net probability of occupancy of a given microstate, including all correlations
and statistical dependencies that give rise to it. Some flawed methods to estimate conformational entropy
of proteins incorrectly assume independent motion of side-chain and additive rotamer entropy (Pickett
& Sternberg, 1993). It is known from information theory that ignoring correlations and statistical dependencies will cause an overestimation of entropy (Jaynes, 1965).
Some of the microstates of a molecule are strongly quantized, as is the case of fast vibrations with
frequencies much larger than kB T / . Others may behave more classically, with closely spaced energy
levels (Keeler, 2005), such as low frequency soft modes (the largest reservoirs of vibrational entropy).
Still others arise from very closely spaced, practically indistinguishable energy levels, such as rotational
and translational entropy. Linear correlation, and more generally non-linear statistical dependence among
all modes, will reduce their individual contributions and thwart additivity. As we need to make use of
statistics to define entropy through pi, it is convenient to review how probability entered the equations
of physical chemistry and thermodynamics.
The statistical in mechanics

Statistics deals with uncertainty and probabilities. For many, the mention of statistics brings up the idea
of a defeat of determinism. Shouldnt science be about fixed relationships and deterministic models?
Statistics is much more than smudging over our ignorance. If we want to predict the behavior of large
collections of particles (such as a protein), we need to quantify their collective behavior with statistics.
But if we want to understand the behavior of a single particle with quantum detail, it turns out we also
need probabilities. It seems that a purely deterministic description suffices only for very simple classical systems.
Chemistry and thermodynamics were regarded for a long time as an exclusively macroscopic science.
At the time when Ludwig Boltzmann was alive (late 1800s), it was hotly debated whether chemical
matter is particulate or a continuum. One of his main opponents was Walther Nernst. Boltzmann settled
the argument by reconciling the atomic theory of matter with the theory of heat. As chemical systems
were proven to be made up of enormous amounts of molecular or atomic units, Boltzmanns statistical approach which took into account the stochastic nature of microscopic processes in which sharply
defined macroscopic physical values become distributions proved to be the accurate way to treat them
(Lazar, 2003). But Boltzmann himself never lived to see the success of his theories. Its a sad fact that
he committed suicide in 1906 in the midst of a scientific consensus that matter is a continuum.
Albert Einstein once said, thinking about the rankings of scientific theories, that thermodynamics is
the only physical theory of universal content that within the framework of its basic notions will never
738
be toppled. This came from a man who overthrew classical mechanics twice. The published PhD dissertation (Einstein, 1906) deals with deterministic equations. This seems to have been a compromise,
as his PhD advisor Alfred Kleiner would not accept his molecular kinetic treatment of fluids, allegedly
because of its statistical nature. (Uffink, 2006) This idea is supported by the fact that during the previous years he had been publishing papers about entropy and thermodynamics with a strong statistical
component (Einstein, 1903).
Another realm of research that involves statistical distributions and probabilities is quantum mechanics.
Usually, in physics textbooks, the birth of quantum mechanics is credited to Plancks resolution of the
discrepancy between experiment and theory for the energy distribution for black body radiation through
the introduction of quantized energy. It was Plancks deep insight into thermodynamics that enabled
him to solve the problem by formulating it in terms of the entropy of oscillators. Max Plancks favorite
topic seemed to be the Second Law of Thermodynamics. For years, Planck had upheld a macroscopic
view of entropy and matter as a continuum. But with his resolution of the black body radiation problem,
Planck was not only introducing the Wirkungsquantum, but at the same time recognizing the need for a
statistical treatment. (Kragh, 2000) I was, however, at that time still too far oriented towards the phenomenological aspect to come to closer quarters with the connection between entropy and probability
[] I busied myself with the task of elucidating a true physical character for the [entropy] formula,
and this problem led me automatically to a consideration of the connection between entropy and probability, that is, Boltzmanns trend of ideas said Planck in his Nobel prize lecture (Planck, 1918). He
continued throughout his life to study and teach thermodynamics. (Planck, 1945).
The success of statistical mechanics and quantum physics has given statistics a new standing in science. Rather than being a replacement for mechanistic understanding, statistics is a powerful tool for
inductive and logical reasoning with complex data. (Jaynes & Bretthorst, 2003)
For all the success of statistics as a scientific tool, not everyone agrees about its indispensability.
Recently I had the chance to ask Nobel prize winner Robert Laughlin whether he thought information
theory will play a larger role in theoretical developments for emergent behavior of soft and biological matter in the mesoscopic, or middle way scale (Laughlin, Pinesdagger, Schmalian et al., 2000).
His answer was a clear no, and his explanation, because its not physical. He then pointed to the
assumption of equal probabilities of molecular microstates as useful but unfounded. Although the
maximum entropy method (Jaynes, 1957) provides a mathematical formalism and generalization of
equal a priori probabilities and the principle of insufficient reason, in the end this assumption is
not physically cemented.
However, it is possible to make a sharp distinction in statistical mechanics: the physical and the
statistical. We formulate our partial knowledge into a physical model. This model should deliver a correct enumeration of the states of a system and their properties. The statistical part is a straightforward
example of inference. (Jaynes, 1957)
Arieh Ben-Naim argues that entropy can be reduced to plain common sense, as:
1. The Second Law is basically a law of probability [as Boltzmann established].
2. The laws of probability are basically the laws of common sense [as Laplace said].
3. It follows from (1) and (2) that the Second Law is basically a law of common sense - nothing more.
(Ben-Naim, 2007)
739
Entropy measures the multiplicity of microstates

A microstate is an individual conformation of a molecule. According to the Boltzmann distribution, a
microstate of lower energy will be more populated than one of higher energy. Expressing the Boltzmann
distribution in terms of the probability of microstates i and j:
( Ei E j )
pi
= exp
for T > 0K
pj
k BT
(4)
E is a microscopic energy inherent to the conformation. E is a function of (N,V) but not of temperature or entropy themselves (T,S). Higher temperatures mean that thermal energy k BT will allow
significant occupation of higher energy levels. The system acquires an internal energy U according to
which microstates are actually populated. The macroscopic (average) internal energy U is the average
over all macrostates:
t
U = E = pk Ek
k =1
(5)
The toy model in Figure 2 captures many interesting properties of energy and entropy. This molecule has positively and negatively charged ends, which attract and may bind with a favorable energy
-. Since the compact conformation has the least energy, one could reach the conclusion that it is the
most populated. But what if the microstates of the open type are more numerous? This multiplicity of
states (W) is quantified by the entropy as S=k Bln(W).
Figure 2. A 4-bead toy model of a folding molecule. The compact state is unique and contains one
attractive interaction with a favorable energy of . The four open microstates have no long-range
interactions. The entropy can be calculated as Scompact=kBln(1) and Sopen=kBln(4). Based on Fig 10.1 of
(Dill & Bromberg, 2003)
740
The concept of free energy uses both U and S to assess the stability of each macrostate. A macrostate
is a collection of microstates with the same energy:
Helmholtz free energy F= U TS
We are interested in the relative stability between two such macrostates:
open
Fcompact
= U T S = k BT ln K eq
The equilibrium constant Keq gives the proportion of compact vs. open conformations:
F
U
K eq = exp
= exp
k BT
k BT
T S
exp
k BT
The unfolding free energy for Figure 3 is:

open
Fcompact
= (0 ( )) T (k B ln 4 k B ln1) = k BT ln 4
Keq for Figure 3 is:
k BT ln(4)

K eq = exp
exp
= 4 exp
k BT
k BT
k BT
The entropy looks difficult to interpret when expressed as S= k Bln(4)= 2.75 cal/(mol K). But its
contribution to the equilibrium constant is a dimensionless number measuring multiplicity W=4. This
means simply 4 times more open than compact conformations. In this light, entropy is more intuitive
and easier to understand than energy itself!
Energy and entropy can be interpreted likewise in protein folding. A conformational entropy change
of 33.3 cal/(mol K) means there exist Wdenatured/Wnative= exp(TS/k BT)= 1.9x107 times more denatured
conformations than native ones. I use the word denatured and not unfolded because the denatured state
is not a random chain. Just seen from this chain conformational entropy contribution, folding is unfavorable. For folding to occur, other interactions such as the hydrophobic entropy gain, and favorable
enthalpic interactions have to compensate for this large conformational entropy loss.
Stabilization by entropy
In Figure 3 we see the situation where two conformers of a protein x1 and x2 differ in their internal
energies. A conformer is a collection of macrostates with similar energies and conformations around a
major energy well. Conformer 1 has a lower internal energy U1, for instance due to having more favorable residue contacts than conformer 2. Because the energy difference U2 - U1= k BT is smaller than
thermal energy k BT, significant interconversion will occur.
Although each individual high energy macrostate with EU2 around the basin x2 is less likely (see Eq.
4), there are many more such states. The sheer multiplicity of states around x2 allows it to be significantly
populated. High multiplicity means high entropy, depicted here as the width of the well S2>S1.
741
Figure 3. The horizontal axis is a collective reaction coordinate measuring folding, for example
the radius of gyration. The conformation at x1 is more compact and has more energetically favorable
contacts than the one at x2. However, x2 may be favored by entropy. Figure based on (van Gunsteren et
al., 2006)
A biological example of conformational entropic stabilization, in this case of the folded state, occurs
in hyperthermophilic organisms. Their genes code for proteins enriched in positively charged amino
acids. But the positive charge is achieved more frequently through lysines rather than arginines. This
significant bias was recently explained (Berezovsky, Chen, Choi et al., 2005) through the much higher
number of rotamers available to lysine. Because the effective conformational freedom of lysine is
greater, it constitutes a reservoir of conformational entropy in the folded state, thereby stabilizing the
protein as a whole.
It should be mentioned that its actually the total free energy that tends to a minimum, and this includes the solvent, ions and any other relevant degrees of freedom and their interactions.
Enthalpy-entropy compensation in protein stability and

binding
Protein folding and binding involves non-covalent interactions. These have very narrow free energy
ranges. For biological processes, the net difference between states is often between 5 and 15 kcal/mol.
This is about 10 times more than thermal energy at physiological temperature RT= 0.6 kcal/mol. Ken
Dill set an error goal of 0.1 kcal/mol per amino acid for the estimation of free energy of proteins. (Dill,
1997) Such precision would allow us to use theoretical methods for folding, binding or other processes
with large conformational changes.
Enthalpy-entropy compensation is just enough for folding. Proteins have evolved in such a way that
the changes in enthalpy from intramolecular interactions, electrostatic solvation free energy, a favorable
hydrophobic effect and an unfavorable conformational entropy loss almost cancel out. This razor thin
window for compensation is bad news for simulation-based calculation of thermodynamic variables.
Each one of the aforementioned factors can contribute 10 to 100 kcal/mol. The methods for estimating
these enthalpic and entropic contributions should be extremely precise in order not to miss the total
742
free energy, which for a spontaneous process may be almost zero, but always negative. Thus the quest
for more precise methods.
The literature abounds with examples of half-successful attempts at estimating free energy of unfolding
from energetic contributions. For example, (Pace, Shirley, McNutt et al., 1996) estimate the unfolding
free energy of RNAase T1 from simulations as GNU =-15 kcal/mol, whereas experiment turned out +9
kcal/mol. Because the sign is negative, simulation is actually predicting that the denatured conformation will be more stable. Likely a lot more examples could be found in the hard drives of theoreticians
who didnt dare to publish them.
Configurational energy landscape, a.k.a. potential energy

surface
Statistics can help us enumerate configurations and calculate entropy. But we still need a physical model
to provide their energy (enthalpy). For small molecules, a fully quantum mechanical description is feasible. For macromolecules, however, a meaningful sampling of configurational space is best achieved
by employing classical potentials. It was recognized in the 1960s that the potential energy of a protein
can be practically evaluated as a function of its coordinates (Karplus, 2006), and that this function could
be obtained from a combination of quantum chemical data and experiments. This potential energy is
called a force-field in the context of molecular dynamics simulation programs. In other contexts, the
potential energy surface is called a configurational energy landscape.
All thermodynamic properties may be obtained from the partition function. This quantity represents the amount of states that are effectively accessible to the system at the given conditions. For the
conformational degrees of freedom of a molecule, the partition function may be obtained by locating
all the minima on the potential energy surface, and estimating the density of states around them, with a
method such as normal mode analysis (NMA) or Quasi-harmonic entropy with supralinear corrections
(Numata, Wan, & Knapp, 2007). Local minima are stationary points on the potential energy surface with
just positive or zero NMA frequencies. The free energy surface is a result of the network of stationary
points in the potential energy surface (enthalpy) and the density of states around them (entropy).
Free energy landscape, a.k.a. folding funnel

The free energy landscape or folding funnel is not the same as the potential energy landscape because
it includes entropy. It is the resulting free energy from energetic (enthalpic) and entropic components.
It represents the free energy surface including the solvent (e.g. water) and the solvated protein. This
folding funnel has thousands of dimensions even for a small peptide.
The folding funnel is attributed to Ken Dill and Hue Sun Chan (Dill & Chan, 1997). The idea is
conceptually very useful and has become popular. (Brooks_III, Gruebele, Onuchic et al., 1998) It has
also been misinterpreted. In a conversation with John Chodera (former coworker of Ken Dill), he pointed
out that some have navely interpreted the funnel by thinking that a simple mathematical optimization
algorithm could solve the protein folding problem. Of course, in actuality the optimization of the free energy landscape is not a simple problem and its result is not necessarily a single populated conformer.
743
Figure 4. Free energy folding funnel. This 3- dimensional version of the folding funnel you may be familiar with is a cartoon. The folding funnel depicts the change in enthalpy+effective solvation energy
(vertical axis) and conformational entropy (both horizontal axes) during protein folding.
As always, when projecting objects in reduced dimensions, there are ambiguities. One can consider
the example of a 2D map of the Earth. The position of the North Pole is not uniquely defined on the map.
In the free energy surface, one can ascribe a coordinate system locally to any one basin and perhaps a
few similar neighboring ones, but this does not apply throughout the configurational space. (Laughlin
et al., 2000). This projection into three dimensions is nevertheless conceptually useful. Figure credit:
Peter G. Wolynes (UCSD).
Kinetic constants from a molecular perspective

To obtain the kinetics of a molecular process, for example of a conformational transition, we must
locate all transition states that link the local minima. These are points in the potential energy surface
that possess a single imaginary frequency. Once these points are known, methods from statistical rate
theory such as kinetic Monte Carlo may be employed to quantify transitions. Steepest descent paths
provide an approximation of the actual routes the transitions will follow under thermal energy. Kinetic
constants are, in this light, accessible from knowledge of the minima and saddle points in the potential
energy surface. (Wales, 2005)
In order to study proteins, the potential energy surface should be an effective one that includes
interactions among all particles, like water, cofactors, ions, etc. For this purpose, an implicit model
approach such as Generalized Born (Chen, Im, & Brooks, 2006) is useful. It can, however, falsify the
kinetic transitions because of the lack of solvent viscosity.
Enthalpic, entropic and kinetic stabilization

Most processes are controlled by a balance between enthalpy and entropy, and enabled by kinetics
relevant to the timescales of cellular processes. For contrast, some extreme examples of processes
744
where either enthalpy, entropy or kinetics alone plays the major role, overriding the other influences,
are shown in Table 2.
Unimolecular reactions of small molecules need not consider entropy changes. Frequently, organic
chemists working with small molecules think of reaction free energies only in terms of internal energy
changes. This is admissible for unimolecular reactions and those obeying the Hammond principle
(Fersht, 1999) because covalent bond rearrangement in molecules made from tens of atoms can be
reasonably described by enthalpic changes. As Tomasz Borowski (Jagellonian University) recently put
it, calculating entropy from quantum chemistry in this field is almost cosmetic, as it usually doesnt tip
the balance for reactions.
The situation is very different for non-covalent binding of organic macromolecules consisting of
hundreds of atoms, like proteins and DNA. Both enthalpic and entropic contributions are sizeable and
neither should be neglected a priori.
Contributions to protein stability from molecular

interactions and conformational freedom
Protein thermodynamic stability is a delicate business, involving the interplay and correlations within
the protein and its environment. The following table (Table 3) of energetic contributions to free energy
is very rough and meant only for gaining a qualitative perspective.
Additivity of free energy

Contrary to earlier suggestions (Dill, 1997; Mark & van Gunsteren, 1994), it is theoretically sound to
mathematically define additive group contributions for bonds, interactions and solvation free energy
(Lazaridis & Karplus, 2003). Conformational entropy, on the contrary, depends on statistical correlations and dependencies spanning at least the whole protein complex, and cannot be generally divided
into group contributions (Numata et al., 2007).
Table 2.
Enthalpic

Entropic
Unimolecular chemical
reactions (Hammond
principle)
Kinetic
Binding of HIV protease
inhibitor Lopinavir (Shuman,
Hmlinen, & Danielson, 2004)
Conformational entropy controls
transport properties (diffusion,
thermal conductivity, shear
viscosity) of liquids (Laughlin et
al., 2000) (Kauzmann, 1943).
Lipid Membrane composition

per leaflet (Poznansky & Lange,
1978)
Beginning of amyloid fibril
formation (dimerization)
(Hwang, Zhang, Kamm et al.,
2004)
745
Table 3.
Contribution to free energy
Energy in kcal/mol
Covalent bond
-80 per bond
Hydrogen bond (included in electrostatic and water desolvation)
-2 to +1 per bond (Day, 1996; Zoete et al., 2005)
van der Waals interactions
-1 per bond
Electrostatic interactions
-5 per bond
Water electrostatic desolvation and the hydrophobic effect upon

folding or binding
+3 per amino acid
Conformational entropy TS (coil to folded chain, rigidification)
+1 to +10 per amino acid
Positive values are destabilizing. Compare to thermal energy kBT=0.6 kcal/mol
Covalent bonds and disulphide bridges

From the table, it is evident that covalent interactions are the strongest. Its only possible to break them
with the concerted effort of enzymes, and very rarely by thermal energy alone. This doesnt mean that
covalent bonds directly stabilize the folded state with their energy.
When calculating free energy differences, it is crucial to define the end states very precisely. Otherwise, one could think that a disulphide bridge contributes 80 kcal/mol to folding stability. The proper
way is to define the folding free energy between the denatured chain and the native folded conformation.
Both denatured and folded states already contain all covalent bonds, including the disulphide bridge,
so covalent bonds dont contribute much to the folded-denatured difference in internal energy except
through strain. Molecular mechanics calculations show that covalent bonding energy is very similar in
folded and denatured states, corresponding to different but relaxed conformations.
Also noteworthy is that the unfolding free energy is state function. This means that the stability of
the protein is independent of the exact path followed by the cellular synthesis mechanism that created
the protein and disulphide bridges. All we need to analyze are the ensembles of the folded and the denatured protein, both of which contain the same covalent bonds.
(Gelp, Orozco, Rueda et al., 2007) ran MD trajectories of hundreds of proteins in explicit water
and analyzed the conformational entropy with the (Schlitter, 1993) method. Several contained 1 to 3
disulphide bridges. The authors conclude that disulphide bridges do not restrict conformational space
sampling and do not lower entropy of the folded state compared to proteins without these bridges. The
stabilization from strategically placed disulphide bridges should then come from lowering the entropy
of the denatured state, making the folding entropy penalty smaller.
Electrostatic, van der Waals interactions and hydrogen

bonds
Electrostatic interactions are modeled based on the experimental observation expressed in Coulombs
law. Van der Waals forces are caused by the momentary polarization of molecules and are much shorter
ranged. Van der Waals forces are often modeled in molecular mechanics by the attractive part of a Lennard-Jones potential.
746
Hydrogen bonds (H-bonds) are not explicitly modeled in modern molecular dynamics force fields.
(MacKerell_Jr., 2004) Instead, hydrogen bonds arise naturally from electrostatic, van der Waals interactions and solvation free energy. They are thus not a separate energetic contribution in the current
analysis, but are worth discussing because the debate in the literature about their role in protein stability
has a long history.
One could navely think that the contribution of a hydrogen bond to folding stability can be evaluated as the interaction energy between the CO and NH groups. But the free energy should always be
defined as a difference between two states: for example folded vs. denatured. Even the enthalpy of an
intramolecular side chain-side chain hydrogen bond cannot be generalized because orientation and environment can play a big role. One can say the maximum strength of such a hydrogen bond is between
2 and 10 kcal/mol. But this is not the net free energy contribution.
The denatured state is partially unfolded, so many potential hydrogen bonding partners in the
polypeptide chain are satisfied by hydrogen bonds to water. Anthony Day (Day, 1996) mentions that
When the protein folds, these protein-to-water H-bonds are broken, and only some are replaced by
(often sub-optimal) intra-protein H-bonds. McDonald & Thornton showed that while only 1.3% of
backbone amino groups and 1.8% of carbonyl groups in proteins fail to H-bond (without any obviously
compensating interactions), 80% of main chain carbonyls fail to form a second hydrogen bond. Thus,
if one considers enthalpy terms alone, it would appear that hydrogen bonding is destabilizing to folded
protein structure. Proteins often prefer water hydrogen bonds of flexible and optimal orientations to
intramolecular ones.
Some authors suggest that H-bonds should generally make a favorable contribution to protein stability (Pace et al., 1996) and estimated this as -1.51.0 kcal/mol. (Day, 1996) More recent studies based
on calculating total solvation energies (for example through Poisson-Boltzmann) and total electrostatic
energies in the two states (folded-denatured or bound-unbound) suggest otherwise (Lazaridis & Karplus,
2003). Because frequently these contributions added together are positive (unfavorable), it can be concluded that hydrogen bonds dont necessarily make a big contribution to folding free energy. However,
they are still important in providing geometric directionality. (Zoete et al., 2005) Using a combination
of analysis of experimental data and simulations, (Honig, Yang, & Sharp, 1992) also concluded that the
effect of favorable hydrogen bonds is largely cancelled out by other factors.
Electrostatic solvation
Electrostatic solvation free energy stems from the charge interactions between the solute and the solvent
(e.g. water), plus possibly ions. It can be seen as the free energy change between vacuum and water. The
Poisson Boltzmann (PB) equation and its approximations such as Generalized Born (GB) are methods
rooted in physics to calculate the electrostatic solvation free energy. There exist many reviews on this
topic (Koehl, 2006; Warshel, Shara, Kato et al., 2006).
The catalytic activity of most proteins depends crucially on pH and the redox potential of the solution. The protonation states of titratable groups in the protein are variable and a function of the solvent
environment and ionic strength, as well as of the interaction among titratable groups themselves. Protonation and deprotonation of these side-chains may be calculated using electrostatic solvation free
energy and a Monte Carlo sampling scheme (Ullmann & Knapp, 1999). The Monte Carlo importance
sampling results in a Boltzmann weighting of the titration pattern. Additionally, the accompanying
747
hydrogen-bonding and salt-bridge conformational changes may also be considered in a self-consistent

way (Kieseritzky & Knapp, 2007).
PB and GB model the electrostatic solvation free energy, which includes an entropic component.
However, this is not the whole configurational entropy of the water. PB and GB are mean-field approaches
which do not account for directional interactions with the solute that alter configurational freedom. At
room temperature, the hydrophobic effect is mostly due to a change in this kind of entropy, and is thus
not included.
The hydrophobic effect

The hydrophobic effect is the major driving force affecting protein stability (see Chap. 17 (Fersht,
1999)). Ironically, much more effort has been dedicated to understanding the electrostatic aspects of
solvation.
At room temperature, the hydrophobic effect is primarily due to a maximization of the entropy of
water. The hydrophobic effect arises from a complex interplay of geometric, probabilistic and energetic
effects and remains to this day not fully understood.
Oil drops in water can be used as a model for the behavior of macromolecular solutes such as the
residues of a protein. Let us discuss the situation of oil and water at room temperature. After thoroughly
shaking a bottle with oil and water, oil drops are spread and to a good degree mixed with water. Closed
systems tend to maximize entropy, reaching a state where configurational freedom is highest, while
striking a balance with the enthalpic (energetic) cost. For the oil drops, this state of maximum disorder
would mean spreading out evenly through the liquid in one phase. At temperatures below ~25C, the
enthalpy of transfer of oil to water is, surprisingly, favorable. But as we know from experience, the oil
drops gather slowly and form a separate phase. Why?
From the point of view of the water, being mixed with the oil is not so comfortable. Water molecules
in the vicinity of oil prefer to align specifically pointing to other water molecules, because oil cannot form
hydrogen bonds with them. This state with fewer possibilities of arrangement lowers the entropy of the
solution, but is compensated by keeping the hydrogen bonds. The first-shell water molecules arrange in
a network of tangentially oriented O-H bonds. However, the number of such configurations is limited,
i.e. much smaller than the number of possible configurations water molecules can have in bulk water,
thus reducing the entropy. (Rispens, 2004). The oil droplets join together not because they intrinsically
dislike water. The oil collects in one phase to minimize contact area with the water, to minimize the
disturbance. This way the free energy of both phases is optimized.
The situation in the more constrained residues of a protein is similar. The formation of a hydrophobic
core is now recognized as an extremely important driving force in protein folding. At mesophilic temperatures, it also occurs in order to maximize solvent entropy. The net effect at physiological temperature
is a compact protein structure with a lower exposed surface area than the denatured conformation.
Experimentalists Concept of Hydrophobicity

A well established proxy for the hydrophobic effect is total solvent accessible surface area. As a first
approximation, this area is contributed by all components of the protein equally. If one were to suggest
to an experimental protein scientist that all amino acids are approximately as hydrophobic, this would be
748
met with surprise and downright rejection. The conflict is one of definition. The idea of hydrophobicity
that experimentalists are more familiar with is a combination of the hydrophobic effect (water reordering) and solvation free energy. This measures the effective behavior, and makes some side chains more
hydrophilic and some more hydrophobic. There are many tables available with the so-called hydropathicity
index of amino acids. The problem is that the effective behavior might not be transferable to all cases.
The tables differ very much in their classifications. For example, 5 amino acids are ranked in some
scales as most hydrophobic and then again as least hydrophobic in others. (Southall, Dill, & Haymet,
2002) It is more useful to separate this contribution into hydrophobic (water reordering) free energy and
electrostatic solvation free energy to make it more general and explore its microscopic origins.
Conformational freedom
As stated from several points of view in this review, conformational freedom of the protein and its
ligands makes an important contribution to stability. Loss of conformational entropy opposes folding
and binding.
One frequently unappreciated fact is that many proteins in the cell stably exist in a high entropy state.
They populate a partially folded state sometimes called a highly structured molten globule. Some
of them are speculated to acquire a more rigid structure only upon ligand binding. Such proteins are
underrepresented in the literature and the PDB database. In experiments, they are hard to discern from
on-path folding intermediates. In molecular dynamics simulation, their very high flexibility would demand prohibitively long trajectories to acquire reliable statistics (Gruebele, Ervin, Larios et al., 2002).
Theoretical methods to calculate conformational entropy

Conformational entropy may be estimated from energy minimized conformations (as in the case of
Normal Mode Analysis), or from thermodynamically equilibrated collections of conformations from
molecular dynamics (MD) or Monte Carlo (MC) simulations. Some methods to calculate conformational
entropy include:

Single conformation normal mode analysis (NMA). Best done in an implicit solvent to include
solvation free energy. It estimates a local harmonic density of states around the minimum. (Hollup,
Salensminde, & Reuter, 2005)
Quasi-harmonic approximation (QHA). It applies Principal Component Analysis to decompose
covariance. Quantum entropy is calculated from the harmonic oscillator entropy formula. (Andricioaei & Karplus, 2001)
Classical entropy from the k-nearest neighbor algorithm (kNN). It employs a technique from information theory to estimate statistical dependence beyond correlation. (Hnizdo, Darian, Federowicz
et al., 2007)
Quasi-harmonic (QHA) entropy corrected with kNN. (Numata et al., 2007)
Mining Minima 2 (M2): Decomposition of entropy contributions from local energy wells. (Chang,
Chen, & Gilson, 2007)
Other methods. (Meirovitch, 2007)
749
Experimental estimates of conformational entropy

Thermodynamic variables may be measured in the laboratory for example through microcalorimetry
(T dependence) and potentiometric titrations (pH dependence) (Pfeil & Privalov, 1976). For proteins,
calorimetric measurements yield total enthalpy and entropy of folding or association. However, it is not
straightforward to separate this into contributions coming from solvent and solute. This was tried with
limited success. (Makhatadze & Privalov, 1994)
An alternative method probes the internal dynamics of the protein upon binding with several binding
target domains. Six short peptides binding to the same protein (Calmodulin) were found by isothermal
titration calorimetry to have similar affinities (and thus similar free energies of binding). However, the total
enthalpic and entropic components of this free energy differ widely. Figure 5 shows this behavior.
The entropy value obtained from isothermal titration calorimetry experiments that can be read in the
figure above is a net entropy change for all of the following: Calmodulin protein, short peptide, water
(hydrophobic effect) and calcium ions. An experimental measurement of the isolated conformational
entropy of the protein or peptide is much more difficult to obtain.
Entropy is related to conformational variations, a proxy of which is the order parameter, a number
from 0 to 1 that indicates the freedom of a coordinate. In an innovative approach, order parameters are
estimated from NMR experiments. (Igumenova, Frederick, & Wand, 2006) This order parameter was
found to change for several side chain methyl groups upon binding of Calmodulin to the peptides, but
not for the Calmodulin backbone. The conformational entropy is then estimated as a function of the
order parameter.
Figure 5. Enthalpy (DH), entropy (-TDS) and free energy (DG) change for the formation of six calciumsaturated CaM-peptide complexes at 35C, as determined by isothermal titration calorimetry. Notice that
the binding free energy is almost identical among ligands, but the enthalpic and entropic contributions
vary. In the case of nNOS(p) there is even a slight net entropy gain upon binding (Frederick, Marlow,
Valentine et al., 2007).
750
Although this experiment is groundbreaking, it has several limitations. The most conspicuous absence is the conformational entropy change for the small binding peptides, which will surely contribute
(opposing) binding in a significant way because of the rigidification that occurs with the formation of
the complex. Nevertheless, a strong correlation was found by the authors between the total calorimetric
entropy change and the Calmodulin protein conformational entropy change estimated from NMR order
parameters. (Frederick et al., 2007)
Medical applications of protein thermodynamics

Most drugs currently on the market are agonists or antagonists that directly bind to the active site of a
protein. Antagonists suppress a cellular response, for example by competitively binding to the active
site but not activating a response. Antagonists provoke a conformational change that is not compatible
with catalytic activity or signal transduction. Agonists, on the other hand, bind a group of active sites
and activate a cellular response. Agonists elicit a conformational change that enables signal transduction and resembles the one occurring upon binding of the natural hormones or binding partners of that
receptor.
But not all drugs need to bind to the active site. Allosteric modulators bind outside the receptor
binding site proper and induce a change in binding affinity, and thus a change in activity. Allosteric
antagonists can induce a decrease in binding affinity of agonists by inducing a conformational change. It
is also possible that the allosteric deformation of the receptor increases affinity for an agonist, resulting
in allosteric synergism. (H Lllmann, 2000) Allosteric modulators directly affect the relevant protein
motion networks.
Most HIV-1 protease inhibitors to date are antagonists of the active site. Design of drugs less susceptible to resistance may be accomplished by altering the thermodynamics of stability and folding of
the protease (PR) dimer. Allosteric inhibitors bind to residues whose dynamics are coupled to the flap
opening-closing collective motion network. They may either keep the flap open or closed shut, inhibiting its cleaving activity. (Perryman, Lin, & McCammon, 2004) Another strategy is to inhibit folding
of the PR dimer; this has been achieved with peptides that bind and reshape the free energy landscape
to make inactive conformations thermodynamically stable. (Broglia, Levy, & Tiana, 2008)
Alzheimers, Hungtingtons (Numata, 2005) and Creutzfeld-Jakob (prion) diseases (Chiti & Dobson,
2006) all share protein misfolding and aggregation as a common feature. Recent experiments have
lent credibility to the hypothesis that -amyloid aggregates are causal in the pathogenesis, at least in
Alzheimers Disease. (Meyer-Luehmann, Spires-Jones, Prada et al., 2008) The normal folded and the
aggregated misfolded conformations represent two local minima in the free energy landscape. The
misfolded conformation is much lower in entropy, but is stabilized by enthalpy, mostly through tight
van der Waals interactions in a so-called steric zipper (Sawaya, Sambashivan, Nelson et al., 2007). The
two free energy minima are separated by a kinetic barrier to oligomerization. The design of compounds
that block aggregation will hopefully be assisted by a detailed understanding of the thermodynamics
and kinetics of misfolding.
751
Conclusion
An understanding of energies and entropies can lead to better methods to predict the phenomena that make
life possible. This information can assist us in the design of pharmacologically active substances. Entropy
has been often neglected in biomolecular simulation methods because it is perceived as mysterious and
intractable. It is hoped that this review may have helped the reader grasp an intuitive understanding of
it. Theoretical and experimental methods to estimate entropy are undergoing rapid advances, bringing
closer the prospect of a firmly grounded thermodynamic treatment of biological processes.
Acknowledgment
The author would like to thank Lev Levitin (Boston University) and Dennis J. Diestler (University of
Nebraska-Lincoln) for insightful and entertaining discussions about entropy.
References
Alberty, R. A. (2006). Biochemical thermodynamics - applications of mathematica.
Almlf, M., Andr, M., & qvist, J. (2007). Energetics of codon-anticodon recognition on the small
ribosomal subunit. Biochemistry, 46, 200-209.
Andricioaei, I., & Karplus, M. (2001). On the calculation of entropy from covariance matrices of the
atomic fluctuations. J Chem Phys, 115, 6289-6292.
Annamalai, K., & Puri, I. K. (2002). Advanced thermodynamics engineering: CRC Press.
Atkins, P., & Paula, J. D. (2006). Physical chemistry for the life sciences. New York: WH Freeman and
Co./Oxford University Press.
Audie, J., & Scarlata, S. (2007). A novel empirical free energy function that explains and predicts protein-protein binding affinities. Biophys Chem, 129, 198-211.
Bahar, I., Chennubhotla, C., & Tobi, D. (2007). Intrinsic dynamics of enzymes in the unbound state and
relation to allosteric regulation. Curr Opin Struct Biol., 17, 633-640.
Bejan, A. (1997). Advanced engineering thermodynamics (2nd edition). New York: Wiley Interscience.
Ben-Naim, A. (2007). Entropy demystified: The second law reduced to plain common sense.
Berezovsky, I. N., Chen, W. W., Choi, P. J., & Shakhnovich, E. I. (2005). Entropic stabilization of proteins and its proteomic consequences. PLoS Computational Biology, 1(4), e47.
Bernard, R., Brooks, R. E. B., Barry, D., Olafson, D. J., States, S., & Swaminathan, M. K. (1983).
CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem, 4(2), 187-217.
752
Billeter, S. R., Webb, S. P., Agarwal, P. K., Iordanov, T., & Hammes-Schiffer, S. (2001). Hydride transfer
in liver alcohol dehydrogenase: Quantum dynamics, kinetic isotope effects, and role of enzyme motion.
J Am Chem Soc, 123, 11262-11272.
Boltzmann, L. (1896). Vorlesungen ber gastheorie. Leipzig: J.A. Barth.
Bragg, M., Atkins, P., Grady, M., & Gribbin, J. (2004). BBC radio 4 -in our time- the second law of
thermodynamics.
Broglia, R. A., Levy, Y., & Tiana, G. (2008). HIV-1 protease folding and the design of drugs which do
not create resistance. Current Opinion in Structural Biology, 18, 60-66.
Brooks_III, C. L., Gruebele, M., Onuchic, J. N., & Wolynes, P. G. (1998). Chemical physics of protein
folding. PNAS, 95, 11037-11038.
Chang, C.-E. A., Chen, W., & Gilson, M. K. (2007). Ligand configurational entropy and protein binding. PNAS, 104(5), 1534-1539.
Chen, J., Im, W., & Brooks, C. L. (2006). Balancing solvation and intramolecular interactions: Toward
a consistent generalized born force field (CMAP opt. for GBSW). J Am Chem Soc, 128, 3728-3736.
Chipot, C., Andricioaei, I., Hummer, G., Pande, V., Mark, A., & Simonson, T. (2007). Free energy
calculations: Theory and applications in chemistry and biology.
Chiti, F., & Dobson, C. M. (2006). Protein misfolding, functional amyloid, and human disease. Annu.
Rev. Biochem., 75, 333-366.
Clausius, R. (1865). ber verschiedene fr die Anwendung bequeme Formen der Hauptgleichungen der
mechanischen Wrmetheorie. Annalen der Physik, 201, 353-400.
Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.).
Day, A. (1996). The source of stability in proteins (Doctoral presentation at Birkbeck University, London). from http://www.cryst.bbk.ac.uk/PPS2/projects/day/TDayDiss/
Demetrius, L. (1998). Role of Enzyme-Substrate Flexibility in Catalytic Activity: an Evolutionary
Perspective. J Theor. Biol., 194, 175-194.
Dill, K. A. (1997). Additivity principles in biochemistry. J Biol Chem, (272), 701-704.
Dill, K. A., & Bromberg, S. (2003). Molecular driving forces: Garland Science.
Dill, K. A., & Chan, H. S. (1997). From Levinthal to pathways to funnels. Nature Struc Biol, 4, 10-19.
Einstein, A. (1903). Eine theorie der grundlagen der thermodynamik. Annalen der Physik, 14(S1), 135153.
Einstein, A. (1906). Eine neue Bestimmung der Molekldimensionen. Annalen der Physik, 19.
Falk, G. (1985). Entropy, a resurrection of caloric-a look at the history of thermodynamics. Eur J Phys,
6, 108-115.
753
Fersht, A. (1999). Structure and mechanism in protein science: A guide to enzyme catalysis and protein
folding. Freeman.
Feynman, R. (1962). The Feynman lectures on physics, 1, 3-6.
Frederick, K. K., Marlow, M. S., Valentine, K. G., & Wand, A. J. (2007). Conformational entropy in
molecular recognition by proteins. Nature, 448, 325-330.
Gelp, J. L., Orozco, M., Rueda, M., Ferrer-Costa, C., Meyer, T., Prez, A., Camps, J., & Hospital, A.
(2007). A consensus view of protein dynamics. PNAS, 104, 796-801.
Gilson, M. K., & Zhou, H.-X. (2007). Calculation of protein-ligand binding afnities. Annu. Rev. Biophys. Biomol. Struct., 36, 21-24.
Gruebele, M., Ervin, J., Larios, E., Osvth, S., & Schulten, K. (2002). What causes hyperfluorescence:
Folding intermediates or conformationally flexible native states? Biophys J, 83(1), 473-483.
H Lllmann, A. Z., Mohr, K., Bieger, D. (2000). Color atlas of pharmacology (2nd ed.). Stuttgart:
Thieme.
Hammes-Schiffer, S. (2006). Enzyme motions inside and out. Science, 312, 208-209.
Hawking, S. (1976). Black holes and thermodynamics. Phys Rev D, 13, 191-197.
Hnizdo, V., Darian, E., Federowicz, A., Demchuk, E., Li, S., & Singh, H. (2007). Nearest-neighbor
nonparametric method for estimating the configurational entropy of complex molecules. J Comput
Chem, 28(3), 655-668.
Hollup, S. M., Salensminde, G., & Reuter, N. (2005). WEBnm@: A Web application for normal mode
analyses of proteins. BMC Bioinformatics,, 6(52).
Honig, B., Yang, A.-S., & Sharp, K. A. (1992). Analysis of the heat capacity dependence of protein
folding. J Mol Biol, 227(3), 889-900.
Hwang, W., Zhang, S., Kamm, R. D., & Karplus, M. (2004). Kinetic control of dimer structure formation in amyloid fibrillogenesis. PNAS, 101(35), 12916-12921.
Igumenova, T. I., Frederick, K. K., & Wand, A. J. (2006). Characterization of the fast dynamics of protein
amino acid side chains using NMR relaxation in solution. Chem Rev, 106(5), 1672-1699.
Jaynes, E. T. (1957). Information theory and statistical mechanics (part 1). Phys. Rev., 106(4), 620630.
Jaynes, E. T. (1965). Gibbs vs. Boltzmann entropies. Am J Phys, 33(5), 391-398.
Jaynes, E. T., & Bretthorst, G. L. (2003). Probability theory: The logic of science.
Karplus, M. (2006). Spinach on the ceiling: A theoretical chemists return to biology (Autobiography).
Annu. Rev. Biophys. Biomol. Struct., 35, 1-47.
Karplus, M., & Kushick, J. N. (1981). Method for estimating the configurational entropy of macromolecules. Macromolecules, 14(2), 325-332.
754
Kauzmann, W. (1943). The nature of the glassy state and the behavior of liquids at low temperatures.
Chem Rev, 43(2), 219-256.
Keeler, J. (2005). Molecular energy levels and thermodynamics (Handout for Part IB Chemistry A at
Cambridge University, UK). from http://www-teach.ch.cam.ac.uk/teach/IBA/
Kieseritzky, G., & Knapp, E. W. (2007). Optimizing pKA computation in proteins with pH adapted
conformations. Proteins: Structure, Function, and Bioinformatics, (Early online).
Koehl, P. (2006). Electrostatics calculations: Latest methodological advances. Current Opinion in
Structural Biology, 16, 142-151.
Kragh, H. (2000). Max Planck: the reluctant revolutionary. Physics World.
Lange, O., Lakomek, N.A., Fars, C., Schrder, G.F., Walter, K.F.A., Becker, S., Meiler, J., Grubmller,
H., Griesinger, C., de Groot, B.L. (2008). Recognition dynamics up to microseconds revealed from an
RDC-derived ubiquitin ensemble in solution. Science, 320, 1471-1475.
Laughlin, R. B., Pinesdagger, D., Schmalian, J., Stojkovic, B. P., & Wolynes, P. (2000). The middle
way. PNAS, 97(1), 32-37.
Lazar, T. (2003). Book Reviews: Molecular driving forces: Statistical thermodynamics in chemistry and
biology. By K. A. Dill, S. Bromberg. Macromolecular Chemistry and Physics, 204(14), 1800.
Lazaridis, T., & Karplus, M. (2003). Thermodynamics of protein folding: A microscopic view. Biophysical Chem, 100, 367-395.
MacKerell_Jr., A. (2004). Empirical force fields for biological macromolecules: Overview and issues.
J Comput Chem, 25, 1584-1604.
Makhatadze, G. I., & Privalov, P. L. (1994). Hydration effects in protein unfolding. Biophys Chem, 51,
291-309.
Mark, A. E., & van Gunsteren, W. F. (1994). Decomposition of the free energy of a system in terms
of specific interactions: Implications for theoretical and experimental studies. J Mol Biol, 240(2), 167176.
McCammon, J., Gelin, B., & Karplus, M. (1977). Dynamics of folded proteins. Nature, 267, 585-590.
Meirovitch, H. (2007). Recent developments in methodologies for calculating the entropy and free energy
of biological systems by computer simulation. Curr Opin Struct Biol., 17, 181-186.
Meyer-Luehmann, M., Spires-Jones, T. L., Prada, C., Garcia-Alloza, M., Calignon, A. d., Rozkalne, A.,
Koenigsknecht-Talboo, J., Holtzman, D. M., et al. (2008). Rapid appearance and local toxicity of amyloid
beta plaques in a mouse model of Alzheimers disease. Nature, 451, 720-724.
Mihailescu, M., & Gilson, M. K. (2004). On the theory of noncovalent binding. Biophys J, 87, 23-26.
Monod, J.-P., Wyman, J., & Changeux. (1965). On the nature of allosteric transitions: A plausible model.
J Mol Biol, 12, 88-118.
755
Mueller, E. A., & Gubbins, K. E. (2001). Molecular-based equations of state for associating fluids: A
review of SAFT and related approaches. Ind. Eng. Chem. Res., 40, 2193-2211.
Numata, J. (2005). Conformational search of Huntingtin in the early steps of aggregation (MSc. Thesis).
Freie Universitt Berlin.
Numata, J., Wan, M., & Knapp, E.-W. (2007). Conformational entropy of biomolecules: Beyond the
quasi-harmonic approximation. Genome Informatics, 18, 192.
Pace, C., Shirley, B., McNutt, M., & Gajiwala, K. (1996). Forces contributing to the conformational
stability of proteins. FASEB Journal, 10, 75-83.
Penrose, R. (2005). The road to reality: A complete guide to the laws of the universe: Vintage books
USA.
Perryman, A. L., Lin, J.-H., & McCammon, A. (2004). HIV-1 protease molecular dynamics of a wildtype and of the V82F/I84V mutant: Possible contributions to drug resistance and a potential new target
site for drugs. Prot Sci, 13, 1108-1123.
Pfeil, W., & Privalov, P. (1976). Thermodynamic investigations of proteins: I. Standard functions for
proteins with lysozyme as an example. Biophys Chem, 4(1), 23-32.
Pickett, S. D., & Sternberg, M. J. (1993). Empirical scale of side-chain conformational entropy in protein
folding. J Mol Biol, 231(3), 825-839.
Planck, M. (1918). Nobel lecture: The genesis and present state of development of the quantum theory
[Electronic Version], from http://nobelprize.org/nobel_prizes/physics/laureates/1918/planck-lecture.
html
Planck, M. (1945). Treatise on thermodynamics (German 7th, English 1st ed.): Dover.
Poznansky, M., & Lange, Y. (1978). Transbilayer movement of cholesterol in phospholipid vesicles under
equilibrium and non-equilibrium conditions. Biochim Biophys Acta, 506(2), 256-264.
Rispens, T. (2004). Cycloadditions in weakly and highly organized aqueous media. Rijksuniversiteit
Groningen.
Rosenbluth, M. N., & Rosenbluth, A. W. (1954). Further results on monte carlo equations of state (Hard
spheres). J Chem Phys, 22, 881-884.
Sawaya, M. R., Sambashivan, S., Nelson, R., Ivanova, M. I., Sievers, S. A., Apostol, M. I., Thompson,
M. J., Balbirnie, M., et al. (2007). Atomic structures of amyloid cross-bold beta spines reveal varied
steric zippers. Nature, 447, 453-457.
Schlitter, J. (1993). Estimation of absolute and relative entropies of macromolecules using the covariance
matrix. Chem Phys Letters, 215, 617-621.
Schrdinger, E. (1944). What is life? The physical aspect of the living cell.
Sept, D., & McCammon, J. A. (2001). Thermodynamics and Kinetics of Actin Filament Nucleation.
Biophys J, 81(2), 667-674.
756
Shakhnovich, E. (2006). Protein folding thermodynamics and dynamics: Where physics, chemistry and
biology meet. Chem. Rev., 106(1559-1588).
Shannon, C. E., & Weaver, W. (1948). A mathematical theory of communication. Bell Syst. Tech. J.
Shuman, C. F., Hmlinen, M. D., & Danielson, U. H. (2004). Kinetic and thermodynamic characterization of HIV-1 protease inhibitors. J Mol Recognit, 17(2), 106-119.
Southall, N. T., Dill, K. A., & Haymet, A. D. J. (2002). A view of the hydrophobic effect. J Phys Chem
B, 106, 521-533.
Uffink, J. (2006). Insuperable difficulties: Einsteins statistical road to molecular physics. Studies in
History and Philosophy of Modern Physics.
Ullmann, M., & Knapp, E. W. (1999). Electrostatic models for computing protonation and redox equilibria in proteins. Eur Biophys J, 28, 533-551.
van Gunsteren, W. F., Bakowies, D., Baron, R., Chandrasekhar, I., Christen, M., Daura, X., Gee, P.,
Geerke, D. P., et al. (2006). Biomolecular modeling: Goals, problems, perspectives. Angew. Chem. Int.
Ed., 45(25), 4064-4092.
Wales, D. J. (2005). Energy landscapes and properties of biomolecules. Physical Biology, 2, S86-S93.
Warshel, A., Shara, P. K., Kato, M., & Parson, W. W. (2006). Modeling electrostatic effects in proteins.
Biochimica et Biophysica Acta, 1764(11), 1647-1676.
Zoete, V., Meuwly, M., & Karplus, M. (2005). Study of the insulin dimerization: Binding free energy
calculations and per-residue free energy decomposition. Proteins: Structure, Function, and Bioinformatics, 61, 79-93.
KeY TermS
Configurational Energy Landscape: Also known as the Potential energy surface. It is a highly
dimensional surface depicting the potential energy of a molecule against each distance coordinate.
Configurational Entropy: It is the conformational entropy plus the entropy due spatial rearrangements. For example, water has an internal molecular conformational freedom (vibrational, rotational
and translational) and can also become rearranged in several hydrogen bonding constellations, giving
it configurational freedom.
Conformational Entropy: A measure of the internal freedom of a molecule, having contributions
from vibrations (like stretching), rotations and translations. It is a logarithmic measure of the density
of states (multiplicity) of a macrostate or conformer.
Conformer: Collection of macrostate conformations of a molecule or protein with similar energies
around a major energy well. Well-known examples are the boat and chair conformers of cyclohexane
and glucose.
757
Entropy: A counting of the states available to a system in a logarithmic scale. In thermodynamics,

this is multiplied by Boltzmanns constant k B, which is only due to our arbitrary choice for the units of
temperature. Entropy in information theory and thermodynamics is equally a measure of the multiplicity of states.
Folding Funnel: Also known as the free energy landscape. In its reduced 3D projection, it is a cartoon of what actually is a highly dimensional free energy surface. The folding funnel is a representation of the change in enthalpy (vertical axis) and conformational entropy (both horizontal axes) during
protein folding.
Free Energy: A criterion for stability which predicts the direction of a process connecting two states.
A system will undergo a process spontaneously if it lowers its free energy. For example, a partially unfolded, denatured protein in a high free energy state will fold to its native state with lower free energy.
Free energy encompasses together energetic (enthalpic) and entropic driving forces.
Hydrophobic Effect: The tendency of large solutes to gather together when solvated in associating
liquids. The hydrophobic effect causes phase separation of oil drops in water at room temperature mostly
to avoid disturbing the hydrogen bonding pattern of water. The disturbance of hydrogen bonding by
oil causes a lowering of entropy. Phase separation occurs to raise the water entropy and lower the total
free energy of the whole solution. The hydrophobic effect is the main driving force for protein folding,
causing polypeptide chains to collapse onto themselves. Generalized to associating solvents other than
water, it is known as the solvophobic effect.
Information Theory: A branch of applied mathematics, closely related to statistics, where the central
quantity is entropy. It is widely used in communications and thermodynamics to analyze the multiplicity
of states and arrangements of a system.
Macrostate: A collection of individual conformations (microstates) with the same energy. Experimentally accessible.
Microstate: Individual, unique conformation of a molecule with a given energy. Experimentally
indistinguishable from other conformations with identical energy and similar conformation.
758
759
Chapter XLII
Model Development and

Decomposition in Physiology
Isabel Reinecke
Zuse Institute Berlin, Germany
Peter Deuflhard
Zuse Institute Berlin, Germany; Freie Universitt Berlin, Germany;
abstract
In this chapter some model development concepts can be used for the mathematical modeling in
physiology as well as a graph theoretical approach for a decomposition technique in order to simplify
parameter estimation are presented. This is based on the human menstrual cycle. First some modeling
fundamentals are introduced that are applied to the model development of the human menstrual cycle.
Then it is shown how a complex mathematical model can be handled if a large number of parameters
are used where the parameter values are not known for the most part. A method is presented to divide
the model into smaller, disjoint model parts. At the same time, it is shown how this technique works in
the case of the human menstrual cycle. The principles for model development and decomposition can
be used for other physiological models as well.
Introduction
This chapter presents how a complex mathematical model of physiological processes (e.g. control systems in the human body) can be developed. Complexity makes successful parameter estimation difficult
which is why a possibility to simplify this problem is shown.
In the first part of this chapter, Background, some modeling techniques are presented which are used
in the second part, Development of a Complex Mathematical Model for the Human Menstrual Cycle. In
the third part, Decomposition of Complex Mathematical Models in Physiology, it is demonstrated, how
Model Development and Decomposition in Physiology
complex mathematical models can be divided reasonably into smaller parts in order to simplify parameter
estimation. This decomposition method is applied to the model for the human menstrual cycle.
Background
First, some principle modeling concepts are introduced that could be useful in the modeling of physiological processes and that are used to construct the complex mathematical model of the human menstrual
cycle. The concept of compartmentalization of the considered body parts and how the connections between the compartments can be modeled, for example, via receptor binding and feedback mechanisms,
is described. If the biochemical mechanisms are known, simple reaction kinetics can be used and if
enzymes catalyze the reaction, simple enzyme kinetics are applied. Taking into account the fact that the
different elements of the system influence each other with a certain delay, delay differential equations
instead of ordinary differential equations are used.
Compartmentalization of the Human Body

The human body is not a closed homogeneous system; it consists of organs and tissues etc. in which
different processes take place. In order to reduce biological complexity, the body parts that are essential
in the processes are extracted and divided into discrete body elements, referred to as compartments
(Andersen, 1991) that are interconnected via the shared blood system (Luecke & Wosilait, 1979), here
called transport compartment. The characteristics of compartments are that isolated processes take
place, but at the same time they can interact with each other. The model formulating the relations
between these compartments is called compartmental model (Takeuchi et al., 2007) and this process
of organizing the human body in compartments is referred to as compartmentalization which is the
concept of pharmacokinetic modeling (Andersen, 1991). More precisely, physiological based compartments are used in this context since the compartments are based on the actual anatomy and physiology
(Andersen, 1991).
Interfaces between the Compartments

The compartments are non-closed systems and can influence each other. The question arises, how exactly
the exchange takes place and what possibilities there are for interrelations between compartments. On
the one hand, it is possible that only coarse interrelations such as inhibiting or stimulating effects are
known. Then these feedback mechanisms can be modeled by Hill functions. On the other hand, it is
possible to model on a biochemical basis via e.g. receptor binding.
Receptor Binding and Recycling

It is often the case that the hormone which is synthesized in one compartment reaches another compartment through the blood circulation and binds to its corresponding receptor. They form a complex and
thereby the receptor becomes activated:
Ractivatable + H (H-Ractivated )
760
Usually, the receptor does not return directly to the activatable state after dissociating from the
hormone, but becomes first desensitized or inactivatable:
Ractivatable Ractivated Rinactivatable Ractivatable
Feedback Regulation
If a substrate S stimulates another substrate P, we have:
S increases
S decreases
P increases
P decreases
where the italics denote the corresponding concentrations now and in the following. The inhibition of
S on P can be described by:
S increases
S decreases
P decreases
P increases
A simple method to model these observations mathematically, is the appliance of Hill functions. They
have values between 0 and 1, thus they are bounded. Additionally, the Hill functions provide a sort of
switch, enabling to model the event that the concentration has to exceed a certain threshold value in
order to be effective. This switch is the more distinctive the higher the Hill coefficient is. At this switch
the values of the Hill function change between values near 0 to values near 1.
The Hill function for stimulation and the resulting dynamics of P if S is stimulatory are given by:
(S )
(S;T,n ):= T S n
1 + (T )
n
Sn
T n + Sn
d
P = p + h + ( S ; T , n),
dt
p + IR +
and the Hill function for inhibition and the resulting dynamics of P if S is inhibitory are given by:
1
Tn
h (S;T,n ):=
=
n
n
n
1 + (TS ) T + S
d
P = p h ( S ; T , n),
dt
p IR +
where T IR+ denotes the threshold value and n 1 the Hill coefficient. It is h + (T + ; T + , n) = h (T ;T , n) = 12 .
If S exerts an inhibitory effect at low concentrations and a stimulatory effect at high concentrations
on the dynamics of P, then we have the biphasic Hill function (Reinecke & Deuflhard, 2007):
h , + ( S ) := h ( S ; T , T + , n) := h ( S ; T , n) + h + ( S ; T + , n),
T <T+
The minimum is given by TS , + :=(T T +) 2, i.e. TS , + is the threshold value for the switch from inhibition to stimulation (Reinecke & Deuflhard, 2007). The biphasic effect is shown in Figure 1.
The dynamics of P are given by:
761
,+
Figure 1. Biphasic Hill function h , + () with the threshold value TS
d
P = p , + h ,+ ( S ; T , T + , n),
dt
p ,+ IR +
Metabolic Processes
Many processes in the human body are only partly explored and this limits the modeling possibilities.
In this context, neither molecular processes, nor genetic, nor spatial effects in cells are incorporated.
Here it is considered in detail if the biochemical reaction mechanisms are known to a certain extent.
Reaction Kinetics
If reaction mechanisms are given, then they can be modeled by simple reaction kinetics if no enzymes
are involved.
Consider the irreversible reaction:
k
simple reaction kinetics for the concentrations of the product P and of the substrate S can be assumed:
d
P = k S ,
dt
d
d
S = P,
dt
dt
k IR +
If we consider the reversible overall reaction:
762
k1
S P
k1
then simple reaction kinetics lead to the product and substrate rates:
d
P = k1 S k1 P,
dt
d
d
S = P,
dt
dt
k1 , k1 IR +
Enzyme Kinetics
Many reactions in the human body are catalyzed by enzymes. To describe the dynamics in this case,
the mechanism called Michaelis-Menten is chosen. This is the simplest and most common approach
for enzyme catalyzed reactions.
By introducing the enzyme E, the mechanism for the irreversible overall reaction (see above) is
given by:
k1
E + S (ES)
k2
P+E
k1
where (E-S) denotes the complex. Assuming constant total enzyme concentration and the quasi-stationary state for (E-S), the reaction rates for the product P and the substrate S can be modeled by:
d
S
P = k2 Etotal
,
dt
KM + S
d
d
S = P,
dt
dt
K M :=
k1 + k2
, k1 , k1 , k2 IR +
k1
where Etotal := E + ( ES ) denotes the total enzyme concentration and KM the Michaelis-Menten constant.
Compared to the simple reaction kinetics, we notice that the dynamics of P is sigmoidally dependent
on S, instead of linearly.
In the case of the reversible reaction, we have:
k1
k2
k1
k2
E + S (ES)
P+E
Assuming again a constant total enzyme concentration and quasi-stationary state for (E-S), the
dynamics of P depends no longer only on S, but also on P:
k k
k2 S 1 2 P
k1
d
,
P = Etotal
k
dt
K M + S + 2 P
k1
d
d
S = P,
dt
dt
k2 IR +
For simplification of notation, we introduce for the irreversible reaction:

S
, p irrev := ( p1irrev , p2irrev )T IR 2+
f irrev ( S , E ; p irrev ) := p1irrev E irrev
p2 + S
763
irrev
irrev
It is p1 = k2 and p2 = K M . For the reversible reaction, we define:
f rev ( P, S , E ; p rev ) := E
p1rev S p2rev P
,
p3rev + S + p4rev P
p rev := ( p1rev ,, p4rev )T IR 4+
k1 k2 rev
k
, p3 = K M , and p4rev = 2 (Reinecke & Deuflhard, 2007).
k1
k1
Normally, the reactions in the human body are reversible. But to simplify the modeling of this
mechanism, it is possible to neglect the reverse reaction, if the rate constant of the reverse reaction
is very small compared to the rate constant of the forward reaction: k1 k-1. The advantage is a lower
number of parameters.
It is p1rev = k2 , p2rev =
Delay Differential Equations

By the compartmentalization, essential spatial effects are incorporated. Usually the dynamics of the
concentrations are well described by differential equations.
If there are body parts where only few molecules participate in the processes as e.g. in the cell, then
the modeling by differential equations is not adequate. Cells play a certain role in the modeling of the
human menstrual cycle, e.g. in the follicular development. However, a high number of cells are involved
in this case and it is the total effect that is interesting, rather than the processes within a single cell.
In order to consider the time delays, e.g. due to transport effects, they can be incorporated into the
differential equations by using dependency of t instead of t with the delay IR+. The system of
delay differential equations is given by
d
y (t ) = f (t , y (t ), y (t 1 ),, y (t
dt
))
where y : IR IR n, f : IR IR n(m+1) IR n, n,m IN, and i IR+, i = 1,...,m, denote the constant
delays.
Development of a Complex Mathematical Model for the Human

Menstrual Cycle
This section shows an example of how to model processes in the human body, namely the modeling
of the human menstrual cycle. Using models for feedback mechanisms, reaction and enzyme kinetics,
we obtain differential equations for the dynamics of the involved elements as, for example, hormones,
enzymes, receptors and follicular masses. Moreover, a special feature of the human menstrual cycle
is presented. From the hypothalamus, the hormone GnRH is released in pulses that cannot simply be
modeled by use of differential equations but by use of a stochastic approach.
Despite the relevance of this subject, few mathematical models exist dealing with the human menstrual cycle or parts of it (e.g. Blum et al., 2000; Chvez-Ross et al., 1997; Clment et al., 2001; Harris,
2001; Lacker & Akin, 1988; Washington et al., 2004). One of these models (Harris, 2001) is taken as
basis in the further modeling since it is the most elaborate of the models for the human menstrual cycle.
It comprises the dynamics in the pituitary and in the ovaries and calculates progesterone, estradiol,
764
and inhibin, essential components of the control system, as linear combinations of follicular masses.
Unfortunately, it offers no modeling approach for the GnRH pulse generator in the hypothalamus. A
complex mathematical model for the human menstrual cycle that is based on this model is developed
in Reinecke & Deuflhard (2007). It is presented in the following.
Basically the processes take place in three different areas, the hypothalamus, the pituitary, and the
ovaries (compartments) that are interconnected through the blood circulation (transport compartment).
The compartment model of this control system is shown in Figure 2. On the basis of this figure, the
model is developed bit by bit by replacing the compartment and the connections between compartments
by a detailed modeling approach. The final model is presented in a figure at the end of this section.
Hypothalamus
It is possible that the substrates are not released at a constant rate into the blood system, but in pulses.
The question arises whether the deterministic modeling approach by differential equations consider this
effect. Modeling by a stochastic approach seems to be more suitable in this case.
Figure 2. A compartmental model of the human menstrual cycle. There are three compartments (solid),
the hypothalamus, the pituitary and the ovaries that share the blood system. It is distinguished between
the pituitary portal system (hypophysic blood system) and the remaining blood system both represented
by transport compartments (dashed). In the hypothalamus the hormone GnRH, in the pituitary the gonadotropins luteinizing hormone (LH) and follicle simulating hormone (FSH), and in the ovaries the
steroids progesterone (P4) and estradiol (E2) as well as the hormone inhibin (Ih) are synthesized.
765
In the hypothalamus, the processes are determined by a mechanism called GnRH pulse generator
resulting in pulsatile release of GnRH into the pituitary portal system. The GnRH neurons are integrated
in the neural network and represent the output neurons (Herbison, 1997). GnRH is stored in the neurons
and released when the neurons become stimulated.
This neural network could be modeled explicitly (Gordan et al., 1998). The main reason for avoiding
this is the uncertainty of the exact processes in the hypothalamus concerning the regulation within the
menstrual cycle (Herbison, 1997; Skinner et al., 1998). Another reason is the immense effort required
for modeling the neural network in a decent manner. Instead, a stochastic approach is chosen to model
the overall effect of the pulsatile release of GnRH (Keenan & Veldhuis, 1998).
Basically there are two components of this mechanism that are modeled:
GnRH pulse frequency (pulse time points)
GnRH pulse mass (the amount of GnRH that is stored between two pulse time points and that is
released per pulse)
The pulse time points are modeled by a renewal process:

T0 = 0, T j +1 = T j + S j +1 ,
j IN 0
where {Sj}jIN denotes a sequence of independent and identically distributed random variables with the
distribution function f. If flexibility in the regulation of the pulse pattern is desired, the most natural
approach, the choice of the Poisson process is not suitable, as opposed to the Weibull density for the
survival time between two pulses:
f ( s ) = P [s | () ] = ( ( s ) s )
exp(( ( s ) s) )
If a sort of stochastic time transformation of a Weibull

renewal process is exerted, i.e. transformation
s
of the deterministic term (s) into the stochastic term
(r ) dr, we obtain:
T j 1
f ( s ) = P s | T j 1 , () = ( s ) (r ) dr
T
j1
exp (r ) dr
j1
By the stochastic time transformation we have dependency of the pulse time points.
Feedback regulation of the GnRH pulse frequency and mass by progesterone and estradiol represents the interface from the ovaries to the pituitary. The steroids are released into the blood and reach
the hypothalamus through the blood circulation. They influence the work of the neural network in a
complicated and not fully known way. Only inhibitory, stimulatory or a combined influence can be
observed. These feedback mechanisms are modeled by Hill functions.
The function () describes the pulse intensity and allows the feedback regulation of the pulse time
points. Progesterone has a negative effect and estradiol a positive effect (at high concentrations) on the
GnRH pulse frequency (Greenspan & Strewler, 1997; Keck et al., 2002), Eq. 1 in Table 2 at the end of
this chapter (Reinecke & Deuflhard, 2007):
766
d
(t ) = (t ) = h P4 (t
dt
P4
)(
); TP4freq , nPfreq
1 + h + E2 (t
4
E2
); TEfreq
, nEfreq
2
2
))
max
max
IR +
where ( s ) := (r ) dr denotes the cumulative pulse intensity function.

0
The pulse time points are calculated by:

Tj
(r ) dr = ( ln(1 U j )) ,
IR +
U j ~U [0,1],
T j 1
Estradiol affects also the amount of GnRH pulse mass MGnRH,j that is stored between two pulses
Tj-1 and Tj. It can be assumed that estradiol is inhibitory at low concentrations and stimulatory at high
concentrations which is why the biphasic Hill functions is applied (Reinecke & Deuflhard, 2007), Eq.
2 in Table 2:
d
M GnRH (t ) = h ( E2 (t E2 ); TE2 ,1 , nE2 ,1 ) + h + ( E2 (t
dt
M GnRH, j = M GnRH (T j ) M GnRH (T j 1 )
E2
); TE2 ,2 , nE2 ,2 ) M max , M max IR + ,
GnRH Concentration in the Pituitary Portal System

The GnRH concentration mainly plays a role in the pituitary portal system (Moenter et al., 1992) and
it is negligible in the remaining blood system. Additionally to the release of GnRH in pulses, there is a
basal GnRH secretion at a constant rate bFSH. The GnRH mass is distributed in the blood volume VPPS. It
is assumed that the distribution does not occur at a constant rate but attains a peak with short delay after
the start of the release that is the last pulse time point and which can be modeled by use of a Gamma
density. GnRH binds to its receptor in the pituitary at the rate GnRH and it is eliminated at the rate GnRH.
Thus, we obtain the following differential equation (Reinecke & Deuflhard, 2007), Eq. 3 in Table 2:
M GnRH, j
b
d
GnRH (t ) = GnRH +
(t T j )
dt
VPPS
VPPS
GnRH
GnRH (t ) RGnRH (t )
GnRH
GnRH (t )
where t [Tj, Tj+1) for all j IN and:

am
t m 1 exp( a t ) if t 0
(t ) = (m)
if t < 0
0
a, m IR + , m > 1
denotes the standard Gamma density. The masses of the foregoing pulses are not considered since it is
assumed that:
j 1
M GnRH ,i
i =1
VPPS
(t Ti ) <<
M GnRH , j
VPPS
(t T j ),
t [T j , T j +1 )
767
Pituitary
The hormone GnRH reaches the pituitary through the pituitary portal system where it binds to its receptor activating the stimulation of the release of LH and FSH that are synthesized in the pituitary.
Interface from Hypothalamus to Pituitary

The receptor binding can be modeled as presented in Section Background where (RGnRH d) denotes the
desensitized state. To make the total receptor concentration more dynamic, we assume that GnRH exerts
a negative feedback not only on the number of free receptors but also on the total number of receptors
(Reinecke & Deuflhard, 2007). The dynamics are described by (Eqs. 4 to 6 in Table 2):
T
GnRH
= r4 GnRH
+ r3 ( RGnRH d ) r1 RGnRH
TGnRH + GnRH
d
RGnRH
dt
d
(GnRH RGnRH ) = r1 RGnRH r2 (GnRH RGnRH )
dt
d
( RGnRH d )
= r2 (GnRH RGnRH ) r3 ( RGnRH d ), r1 , r2 , r3 , r4 , TGnRH IR +
dt
where the first term in the equation for RGnRH has values in (r4, r4 ), GnRH IR+, depending on the
GnRH concentration.
Synthesis of FSH and LH

The changes in the stored gonadotropin depend on two effects: the synthesis and the release. The synthesis as well as the release of the gonadotropins is regulated by progesterone, estradiol, and inhibin
serving as interface from the ovaries to the pituitary. Inhibitory and stimulatory and a combination of
these effects can be observed that are modeled by Hill functions presented above. It is assumed that is
a basal synthesis (Heinze et al., 1998) at the rate bLH for LH and bFSH for FSH. In both cases, the release
is proportional to the receptor complex concentration and its stored mass. The following assumptions
for the synthesis and release of LH and FSH can be made (Reinecke & Deuflhard, 2007):
Progesterone inhibits the LH synthesis and stimulates the LH release (Harris, 2001).
Estradiol stimulates the LH synthesis and inhibits the LH release at low concentration and stimulates
it at high concentration (Anderson, 1996; Rasgon et al., 2003; Swerdloff et al., 1972; Yen, 1991).
Progesterone stimulates the FSH release (Harris, 2001).
Estradiol inhibits the FSH release at low concentration and stimulates it at high concentration
(Anderson, 1996; Rasgon et al., 2003; Swerdloff et al., 1972; Yen, 1991).
Inhibin inhibits the FSH synthesis and release (Harris, 2001; Keck et al., 2002; Magoffin & Jakimiuk, 1997).
From this it follows for the stored LH in the pituitary (Eq. 7 in Table 2):
d
PLH = synLH P4, P , E2, E relLH P4, P , E2, E ,(GnRH RGnRH ), PLH
4
2
4
2
dt
768
where:
synLH = bLH + h1 ( P4 ) h1+ ( E2 ) synLH ,max , synLH ,max IR +
relLH = h3+ ( P4 ) (h2 ( E2 ) + h2+ ( E2 ) )relLH ,max (GnRH RGnRH ) PLH , relLH ,max IR +
and for the FSH pool (Eq. 9 in Table 2):
( )
d
PFSH = synFSH Ih Ih relFSH P4, P , E2, E , Ih Ih ,(GnRH RGnRH ), PFSH
4
2
dt
where:
synFSH = bFSH + h3 ( Ih) synFSH ,max , synFSH ,max IR +
relFSH = h5+ ( P4 ) (h4 ( E2 ) + h4+ ( E2 ) )h5 ( Ih) relFSH ,max (GnRH RGnRH ) PFSH , relFSH ,max IR +
FSH and LH Concentrations in the Blood

The gonadotropins FSH and LH leave the pituitary and enter the blood circulation. To obtain their
concentration, their mass is divided by the volume of distribution VB. In the blood, the gonadotropins
are metabolized proportionally to their concentration. The LH and the FSH concentration is given by
(Eq. 8 in Table 2):
d
1
LH =
relLH E2, E +
2
dt
VB
LH ,1
, P4, P +
4
LH ,1
,(GnRH RGnRH ), PLH , LH ,1 clearLH ( LH )
where:
clearLH =
LH
LH ,
LH
IR +
and (Eq. 10 in Table 2):
d
1
FSH =
relFSH E2, E +
2
dt
VB
FSH ,1
, P4, P +
FSH
IR +
FSH ,1
, Ih Ih +
FSH ,1
,(GnRH RGnRH ) FSH ,1 , PFSH , FSH ,1
clearFSH ( FSH )
where:
clearFSH =
FSH
FSH ,
Ovaries
A couple of immature follicles start the follicular development in every menstrual cycle normally resulting in ovulation at mid-cycle which enables reproduction. The processes mainly take place in two
cell types of the follicles, the granulosa and the theca cells. These cells express different receptors and
769
enzymes (Conley & Bird, 1997; Greenspan & Strewler, 1997; Strauss III, 1999; Strauss III & Williams,
1999, Chabbert-Buffet & Bouchard, 2002). In these follicles, progesterone and estradiol as well as other
steroids are synthesized.
Follicular Development
The growth and maturation of the follicles in the ovaries are not modeled via receptor binding but directly.
We assume that the follicular development divided into nine stages is only influenced by FSH and LH
and by their own masses. In the follicular phase (secondary to Graafian follicle), the follicles mature and
grow resulting normally in one dominant follicle. During ovulation (ovulatory and luteinizing follicle),
the ovum is released and the follicle becomes a corpus luteum and finally a corpus albicans. Growth is
assumed for the first three phases Fs, Ft, and Fg (Reinecke & Deuflhard, 2007):

Secondary follicle (Eq. 11 in Table 2):
d
Fs (t ) = d1 h + FSH
dt
Tertiary follicle (Eq. 12 in Table 2):
d
Ft (t ) = d3 FSH
dt
Graafian follicle (Eq. 13 in Table 2):
d
Fg (t ) = d 5 FSH
dt
Ovulatory and luteinizing follicle (Eqs. 14 and 15 in Table 2):
2
FSH ,2
5
FSH ,2
FSH ,2
)(
foll
foll
; TFSH
, nFSH
+ d 2 FSH
Fs (t ) + d 4 FSH
LH
6
LH ,2
LH
Ft (t ) + d 6 LH
d
M o (t ) = d 7 LH LH8 ,2 Fg (t ) d8 M o (t )
dt
d
M l (t ) = d8 M o (t ) d9 M l (t )
dt
Corpus luteum (Eqs. 16 to 18 in Table 2):
d
Le (t ) = d9 M l (t ) d10 LH LH9 ,2 Le (t )
dt
d
Lm (t ) = d10 LH LH9 ,2 Le (t ) d11 LH LH10 ,2 Lm (t )
dt
d
Ll (t ) = d11 LH LH10 ,2 Lm (t ) d12 Ll (t )
dt
Corpus albicans (Eq. 19 in Table 2):
d
La (t ) = d12 Ll (t ) d13 La (t )
dt
770
3
FSH ,2
1
FSH ,2
4
LH ,2
7
LH ,2
d3 FSH
d 5 FSH
d 7 LH
8
LH ,2
2
FSH ,2
5
FSH ,2
)F (t )
s
LH
)F (t )
g
6
LH ,2
)F (t )
t
The parameters i, i = 1,...,9, can reduce or increase the influence of the gonadotropins LH and FSH
on the different follicular stages.
Interface from Pituitary to Ovaries

The receptor binding and recycling is described in the section entitled Background. In this case, there
are two inactivatable states: phosphorylated denoted by R FSH p and internalized denotes by R i (Clment et al., 2001). The receptor recycling for the FSH receptors in the ovaries can be modeled by (Eqs.
20 to 23 in Table 2):
d
= kFSH ( FSH RFSH ) + krFSH RiFSH k+FSH FSH FSH ,2 RFSH
RFSH
dt
d
( FSH RFSH ) = k+FSH FSH FSH ,2 RFSH ( FSH + kFSH )( FSH RFSH )
dt
d
( FSH RFSH p ) = FSH ( FSH RFSH ) kiFSH ( FSH RFSH p)
dt
d FSH
Ri
= kiFSH ( FSH RFSH p ) krFSH RiFSH ,
dt
kFSH , k+FSH krFSH , kiFSH , FSH , FSH ,2 IR +
and analogously for the LH receptors (Reinecke & Deuflhard, 2007) , Eqs. 24 to 27 in Table 2:
d
RLH
= kLH ( LH RLH ) + krLH RiLH k+LH LH LH ,2 RLH
dt
d
( LH RLH ) = k+LH LH LH ,2 RLH ( LH + kLH )( LH RLH )
dt
d
( LH RLH p ) = LH ( LH RLH ) kiLH ( LH RLH p )
dt
d LH
= kiLH ( LH RLH p ) krLH RiLH ,
Ri
dt
kLH , k+LH krLH , kiLH , LH , LH ,2 IR +
In both cases, simple reaction kinetics is used as presented in the section entitled Background, where
the reaction from the activatable receptor to the activated receptor (complex) is assumed to be reversible
as opposed to the remaining reactions that are assumed to be irreversible (Clment et al., 2001).
Total Enzyme Concentrations and Enzyme Activation

Enzymes catalyze the steroidogenesis. Their concentrations depend on the follicular stage (Chabbert-Buffet & Bouchard, 2002; Straus III, 1999). The total enzyme concentrations are assumed to be
proportional to the follicular masses Fs,...,Ll (Reinecke & Deuflhard, 2007):
771
3 HSDtotal
= f lincom ( Ft ,, Ll ; e1 ),
e1 = (e11 ,, e17 )T IR 7+
17 HSDtotal = f lincom ( Fs ,, Ll ; e2 ), e2 = (e21 ,, e28 )T IR 8+

= f lincom ( Ft ,, Ll ; e3 ),
e3 = (e31 ,, e37 )T IR 7+
P 45017OH ,total = f lincom ( Ft ,, Ll ; e4 ),
e4 = (e41 , , e47 )T IR 7+
P 450aromtotal = f lincom ( Fs , , Ll ; e5 ),
e5 = (e51 ,, e58 )T IR 8+
P 450 scctotal
where:
n
f lincom ( A; p ) = pi Ai ,
i =1
p IR n+ ,
A IR n+ , n IN
They depend on Fs if they are expressed in the granulosa cells only.

The enzymes are activated proportionally to the total enzyme concentration and the receptor complex
concentration and cleared proportionally to its concentration (Eqs. 28 to 32 in Table 2):
d
3 HSDa
dt
d
17 HSDa
dt
d
P 450 scca
dt
d
P 45017OH ,a
dt
d
P 450aroma
dt
= a1 3 HSDtotal ( LH RLH ) a2 3 HSDa

= a3 17 HSDtotal ( FSH RFSH ) a4 17 HSDa
= a5 P 450scctotal ( LH RLH ) a6 P 450scca
= a7 P 45017OH ,total ( LH RLH ) a8 P 45017OH ,a
= a9 P 450aromtotal ( FSH RFSH ) a10 P 450aroma ,
IR + i = 1,,10
Steroidogenesis
A good source for the steroidogenesis (biosynthesis of the steroids) is the KEEG database (KEGG
PATHWAY Database, 2006). These reactions are catalyzed by enzymes which is why Michaelis-Menten kinetics presented in the section entitled Background is applied. The dynamics of the gestagens
pregnenolone and progesterone occurring in the theca and granulosa cells are modeled by (Eqs. 33 and
34 in Table 2):
d
preg = c1 P 450scca chol f rev ( prog , preg ,3 HSDa ; p1 )
dt
f irrev (preg , P 45017OH , a ; k1 ) c2 preg
d
prog = f rev ( prog , preg ,3 HSDa ; p1 ) f irrev (prog , P 45017OH , a ; k2 ) c3 prog
dt
where chol denotes the cholesterol concentration that is assumed to be constant. In the theca cells, the
dynamics of the gestagens 17-hydroxypregnenolone and 17-hydroxyprogesterone and of the androgens
DHEA, androstenedione, and testosterone are given by (Eqs. 35 to 39 in Table 2):
772
d
17 preg = f irrev (preg , P 45017OH , a ; k1 ) f rev (17 prog ,17 preg ,3 HSDa ; p2 )
dt
f irrev (17 preg , P 45017OH , a ; k3 ) c4 17 preg
d
17 prog = f irrev (prog , P 45017OH , a ; k2 ) f rev (17 prog ,17 preg ,3 HSDa ; p2 )
dt
f irrev (17 prog , P 45017OH , a ; k4 ) c5 17 prog
d
DHEA = f irrev (17 preg , P 45017OH , a ; k3 ) f irrev (DHEA,3 HSDa ; k5 ) c6 DHEA
dt
d
andro = f irrev (17 prog , P 45017OH , a ; k4 )+ f irrev (DHEA,3 HSDa ; k5 )
dt
f rev (test , andro,17 HSDa ; p3 ) f irrev (andro, P 450aroma ; k6 ) c7 andro
d
test
dt
= f rev (test , andro,17 HSDa ; p3 ) f irrev (test , P 450aroma ; k7 ) c8 test
Finally, the dynamics of the estrogens estrone and estradiol occurring only in the granulosa cells
are described by (Eqs. 40 and 41 in Table 2):
d
estro = f irrev (andro, P 450aroma ; k6 ) f rev (estra, estro,17 HSDa ; p4 ) c9 estro
dt
d
estra = f irrev (test , P 450aroma ; k7 ) + f rev (estra, estro,17 HSDa ; p4 ) c10 estro
dt
Inhibin and Steroid Concentrations in the Blood

The steroids are synthesized in the cells of the follicles, the theca and granulosa cells. At first, they enter
the follicular fluid in the ovary, but then a part of them leaves the ovary and reaches the blood system
where their mass is distributed in the blood volume VB. The steroids are cleared proportionally to its
concentration in the blood (Eqs. 42 to 49 in Table 2):
s
d
steroid B (t ) = steroid steroid (t
dt
VB
steroid
steroid
steroid B (t ), ssteroid ,
steroid
IR +
where steroidO denotes the steroid concentration in the ovaries, steroidB the steroid concentration in the
blood, and steroid the delay where steroid stands for the concentrations of progesterone, 17-hydroxypregnenolone, 17-hydroxyprogesterone, DHEA, androstenedione, testosterone, estrone, and estradiol.
The inhibin concentration is modeled as linear combination of follicular masses analogously to
Harris (2001), (Eq. 50 in Table 2):
Ih(t ) = ih1 + ih2 Ft (t ) + ih3 Lm (t ) + ih4 Ll (t ),
ihi IR + , i = 1,, 4
Final Model
The final model consists of 49 delay differential equations describing the dynamics of the GnRH pulse
generator, hormones, receptors, active enzymes, and follicular masses. Additionally, 6 auxiliary equa773
tions model the dynamics of the total enzymes and inhibin concentrations. The model is visualized in
Figure 3 and the model equations are listed in Table 2.
Decomposition of Complex Mathematical Models in Physiology

When modeling complex physiological processes and control systems in the human body such as the
human menstrual cycle, a large number of parameters can arise. Often their values are not known, which
is why parameter estimation is necessary, albeit expensive and possibly not successful. That is why it
could be useful to consider possibilities to simplify the problem of parameter estimation.
Suppose that the mathematical model consists in a system of (delay) differential equations:
d
y = f ( y, y 1 , y m ; p )
dt

(1)
where y : IR IR n, y =: (y1,...,yn)T, f : IR IR n(m+1) IR n, f =: ( f1,...,fn)T, n IN, m IN0, and i IR+,

n
i = 1,...,m, denote the constant delays if m 1, and p IR +p, np IN the vector of parameters. We call
y1,...,yn the elements of the mathematical model.
In the following a method is presented that uses the availability of experimental data for the decomposition of the complex mathematical model. This method is applied implicitly in Harris (2001) where
the mathematical model consisting in 13 differential equations is divided into two disjoint model parts.
Parameter estimation is performed separately for the two parts using approximations of experimental
data as input and afterwards they are recomposed to a dynamic model. In order to simplify parameter
estimation or even to make it possible, decomposition of the complex model could be helpful. Instead
of a high-dimensional problem, we obtain several smaller models that can be treated more easily and
successfully.
The procedure for the parameter estimation can be summarized in the following way:
1.
2.
3.
Model decomposition into disjoint model parts

Parameter estimation for a selection of parameters by using experimental data as input
Recomposition of the model parts to the original model with the estimated parameter values
A graph theoretical approach can help identifying these model parts. Graphs are frequently used to
model a binary relationship between objects (Godsil & Royle, 2001). We base our definition of a graph
on the direct dependency of the right hand side of the system on its elements. If the model parts are
determined, the corresponding differential equations belonging to the elements of these model parts
can be solved separately if the required input is available.
At first the definition and method of the model decomposition is presented. In the subsequent sections
the different steps to obtain the model decomposition are presented. This procedure is shown for the
model of the human menstrual cycle presented in the preceding section and the results are visualized.
Representation of the Mathematical Model by a Graph

First some terms that are used to describe the graph theoretical approach are introduced (Diestel, 2005;
Beineke & Wilson, 1997; Godsil & Royle, 2001):
774
Figure 3. The dependencies within the complex model. An arrow is given, when the dynamics of the
element where the arrow ends is dependent on the element where the arrow starts.
HY P OT HA L A MUS
G NR H
P UL S E
G E NE R ATOR
G nR H mass
G nR H f r eq
P IT UITA R Y P OR TA L S Y S T E M
G nR H
P IT UITA R Y
R GnRH
R GnRH -d
rec eptors
(G nR H-R GnRH )
P LH
PFS H
B L OOD
E2
FS H
P4
LH
Ih
17-pregB
DHE A B
tes tB
17-progB
es troB
androB
OVA R IE S
Fs
R LH
R LH
i
R Fi S H
R FS H
rec eptors
Ft
(LH-R LH )
(F S H-R F S H )
(LH-R LH -p)
(F S H-R F S H -p)
follic ular development
Fg
Mo
P 450s cc tota l
P 450 17 -OH,total
3 -HS D tota l
17 -HS D tota l
P 450aromtota l
enzymes
Ml
P 450s cc a
P 450 17 -O H,a
3 -HS D a
Le
Lm
17 -HS D a
P 450aroma
T HE C A
G R A NUL OS A
cholO
Ll
pregO
17-pregO
DHE A O
La
progO
17-progO
androO
tes tO
es troO
es tra O
T WO
CELL
T HE OR Y
s teroids
775
Definition 1. A graph is a pair G = (V , E ) of disjoint sets with E V 2 , where V 2 denotes the set
of all subsets of V that consist of two elements. The elements of V are called the vertices of the graph
G and the elements of E its edges, where an edge is an unordered pair of distinct vertices. An edge
e = {vi v j } E (short e = vi v j or e = v j vi ) connects the two vertices vi , v j V .
A directed graph is a graph G = (V,E) together with the two maps init: E V and ter : E V .
To every edge e, an initial vertex init (e) and a terminal vertex ter (e) are assigned. A directed edge
is also called arc. If init (e) = ter (e) , then we call e a loop.
The system of differential equations is represented here by the directed graph G = (V , E ) . The vertices
are then given by V := {v1 , , v n }, the set of elements, where v i := arg( y i ) , i = 1, , n , correspond
to the elements of the system y i , i = 1, , n , respectively. An arc from one vertex to a second vertex
is given if the right hand side of the element represented by the second vertex is directly dependent on
the element represented by the first vertex:
m
f j
f j
+
E = v i v j V 2 |
0 .
y i k =1 yt k ,i
Define the set of parameters by: P := p1 , , pn p .

The definition of the adjacency matrix is needed in the following (Diestel, 2005; Beineke & Wilson,
1997; Godsil & Royle, 2001).
Definition 2. Two vertices vi , v j V of the graph G are called adjacent if vi v j E . The adjacency
matrix A = (aij ) nn of G is defined by:
1 if vi v j E
aij :=
0 else.
In our definition of G, loops are possible, but they are not important for this approach, which is why
they are neglected. The following definition is also needed (Diestel, 2005):
Definition 3. If V ' V and E ' E , then G ' = (V ' , E ' ) is a subgraph of G, written G ' G .
That means that here the graph G is reduced to the subgraph G ' = (V , E ' ) G where
E ' := E \ {e E | init (e) = ter (e)}. The corresponding adjacency matrix is then given by A ' := (a 'ij ) nn
where a 'ij := aij for all i, j = 1, , n , i j , and a 'ii := 0 for all i = 1, , n , i.e. the diagonal of A is set
to zero.
In order to perform parameter estimation, experimental data are necessary for fitting the parameters. As in the case of the human menstrual cycle (Reinecke & Deuflhard, 2007), experimental data
describing the data over the cycle are not given for all elements of the system. The following definition
is useful and simplifies the formulations. From now on the elements and their corresponding vertices
are used equivalently.
Definition 4. We say, an element v V of the system in Eq. 1 is called an exp-element if usable
experimental data are given for this element over the considered time span. Otherwise it is called a
non-exp-element.
776
Suppose that we have N exp-elements v1exp , , v Nexp V where v exp

= vi j , j = 1, , N , with
j
1 i1 < < i N n and N IN, N n . Then V exp := {v1exp ,, vNexp } is called the set of exp-elements.
Based on these exp-elements we decompose the set of elements. It is essential that experimental
data for at least two elements from {y1 , , y n } are available, i.e. N 2 , in order to be able to obtain
at least two model parts.
Summarized, we require from the model parts that emerge through the model decomposition the
following properties:

The model parts, denoted by Vi part , i = 1, , N , and the parts of the corresponding parameters,
denoted by Pi part , i = 1, , N , where N N , should be disjoint in order to avoid double parameter estimation.
Every model part Vi part , i = 1, , N , should contain at least one exp-element in order to be able
to perform parameter estimation.
Moreover, there should be at least one model part Vi part , i {1, , N }, that does, if at all, need
input from exp-elements, i.e. Vi inp = V inp ,exp , and no approximations of elements from other model
parts that are non-exp-elements, i.e. Vi inp ,app = f , and that furthermore does not need parameter
input, i.e. Pi part = f , in order to have a model part to start the parameter estimation.
It is possible that not all elements can be assigned to a model part. These elements form the rest set
V rest and the corresponding parameters the parameter rest set P rest .
Definition 5. We define as the model decomposition the partition of the set of elements V into the
model parts V1 part , , V Npart
V and the rest set V rest V and the partition of the parameters into the
part
P and the parameter rest set P rest P , N N , N 2 , if:
parameter parts P1 , , PNpart
(i ) V = Vi part V rest
i =1
(ii ) Vi part V jpart =

(iii ) Vi
part
rest
(iv) Pi part Pjpart =

(v) Pi part P rest =
(vi ) Vi
part
(vii ) V
rest
exp
exp
i, j = 1, , N i j
i = 1, , N
i, j = 1, , N i j
i = 1, , N
i = 1, , N
(viii ) i {1, , N }: Vi inp , app =
Pi inp =
The model presented in the preceding section consists of 49 elements. Experimental data are available for the elements y1 (GnRH pulse frequency), y8 (LH), y10 (FSH), y42 (progesterone), y43 (estradiol),
y44 (17-hydroxypregnenolone), y45 (17-hydroxyprogesterone), y46 (DHEA), y47 (androstenedione), y48
(testosterone), and y49 (estrone). Since the dynamics of inhibin, for which experimental data are also available, is not described by a differential equation (it is calculated as a linear combination of the follicular
masses y12, y17, and y18), it is represented by the element y50. The additional arcs e for y50 are given if:
777
f i
0
y50
init (e) = 50 ter (e) = i
y50
0
yi
init (e) = i ter (e) = 50
Hence there is a total of N = 50 elements. Furthermore, there are n p = 208 parameters and N =
12 exp-elements:
Vexp = {1,8,10,42,43,44,45,46,47,48,49,50}
The graph for the model in Table 2 is visualized in Figure 4.
Determination of Predecessors and Grouping

Interpolations of the experimental data for these exp-elements can replace the exp-elements in the
simulation of the mathematical model. That is why the connections between these exp-elements and the
remaining elements can be eliminated. Then all predecessors of the vertices representing exp-elements,
the sets of predecessors, V jpre, j = 1,...,N, are determined in order to obtain a first not necessarily unique
assignment of the elements to the exp-elements.
The elimination of arcs leads to a further reduction of the graph G. Define G := (V , E ) G ,
where:
E = E \ {e E | init (e) V exp }

The corresponding values of the adjacency matrix are set to zero in order to obtain the adjacency
matrix of the subgraph G : A := (aij ) nn where aij := aij for all i V exp and aij := 0 for all i V exp ,
j = 1, , n .
Sets of Predecessors
Another definition for the relation of two vertices is needed in order to assign the elements to exp-elements (Diestel, 2005; Beineke & Wilson, 1997; Godsil & Royle, 2001).
Definition 6. A walk (of length k) is a non-empty graph P = (V,E) of the form:
V = {v0 , v1 ,, vk }, E = {v0 v1 , v1v2 ,, vk 1vk }

It is called a vivj-walk if the first vertex in the walk is given by vi and the last by vj. A walk is a called
a path if the vertices are pairwise distinct.
Definition 7. The vertex vi V is a predecessor (of order k IN) of vj V, if there is a vivj-walk
(of length k) in G.
778
Figure 4. Visualization of the graph for the model given in Table 2 (without loops)
Then all predecessors of an exp-element form a set:

Definition 8. The sets of predecessors for v exp
V exp , j = 1, , N , are given by:
j
exp
V jpre := {v V | (v = v exp
j ) (vv j path G )}
The set of parameters belonging to the element vi denoted by Pi, i = 1,...,n, is defined by:
f
Pi := p j P | i 0
p j
Definition 9. The set of parameters belonging to the set of predecessors V jpre , j = 1, , N , is

given by:
n jpre
Pj
pre
:= PV pre ( l ) , n jpre :=| V jpre |

l =1
where V jpre is understood as tuple and V jpre (l ) denotes the l-th component if the elements of V jpre are
sorted by ascending index for all l = 1, , N .
The elements that cannot be assigned to a set of predecessors form a separate set.
Definition 10. The elements that belong to no set of predecessors form the rest set:
779
V rest := V \ V jpre
j =1
and the corresponding parameters the parameter rest set:

N
P rest : = P \ Pjpre
j =1
The vertex vl , l {1, , n} , is a predecessor of order 1 of the exp-element v exp

= vi j , j {1, , N }
j
,1
, if alpre
:= al,i j = 1 , i.e. if there is an arc from vl to vi j . Predecessors of order 2 of the exp-ele,j
ments are given by:
,2
,1
pre ,1
pre ,1
alpre
:= max alpre
, l = 1, , n,
,j
, j , a l ,1 a1, j , , a l , n a n , j
j = 1, , N ,
,2
since if alpre
= 1 , there is an arc from vl to vi j or a path {vl vl ' , vl ' vi j } E of length 2 for a l ' {1, , n} .
,j
,k
Generally, the vertex vl is a predecessor of order k , k 2 , of vi j V exp if alpre
= 1 , where:
,j
,k
, k 1
, k 1
, k 1
alpre
:= max alpre
, al,1 a1pre
, , al,n a npre
, l = 1, , n,
,j
,j
,j
,j
j = 1, , N .
We define:
,k
A pre,k := alpre
,j
l =1,, n , j =1,, N
A pre := A pre,n 1 ,
IR
n N
k = 1, , n 1,
A pre =: (alpre
, j ) l =1,, n , j =1,, N .
It follows that the vertex vl , l {1, , n} , is a predecessor of vi j V exp , j {1, , N } , if alpre

, j = 1.
Thus the sets of predecessors can be determined by:
V jpre = V exp vl V | alpre

,j =1,
j = 1, , N .
The algorithmic calculation of A pre and V jpre , j = 1, , N , is shown in Algorithm 1. Per definition contains each set of predecessors one and only one exp-element, whereas the rest set contains no
exp-elements.
By executing Algorithm 1 for the model in Table 2, we obtain the sets of predecessors shown in Box
1. Only the vertex v19 is no predecessor of an element of V exp , thus the rest set consist of this element:
Vrest = {19}
Groupings
Under certain conditions it can be useful to merge sets of predecessors which is presented in this section.
Consider the case that two sets of predecessors match except for the exp-elements:
780
Algorithm 1. Algorithm for the determination of Apre and of the sets of predecessors V j , j = 1,...,N.
pre
for j = 1,...,N do
for l = 1,...,n do
alpre
, j = al ,i j ;
end
for k = 2,...,n1 do
for l1 = 1,...,n do
for l2 = 1,...,n do
pre
pre
al1pre
, j = max( al1 , j , al1 ,l2 al2 , j );
end
end
end
V jpre = V jexp ;
for l1 = 1,...,n do
if alpre
, j = 1 then
V jpre = V jpre {l};
end
end
end
Box 1.
j
V jpre
2, 3, 4, 5, 6, 7, 8
2, 3, 4, 5, 6, 9, 10
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 42
11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 43
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36, 44
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36, 45
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36, 37, 46
11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
47
10 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
48
11
11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 49
12 11, 12, 13, 14, 15, 16, 17, 18, 50
781
exp
V jpre \V exp = V jpre
,
\V
j , j {1, , N }, j j
By the concept of grouping that we present in the following, we avoid that the emerging disjoint model
parts consist in V j1pre and {v exp
j }, i.e. that one of these model parts consist only in the exp-element.
Definition 11. The sets V j1pre , ,V jlpre of predecessors, l {1,...,N}, for j1,...,jl {1,...,N} form a grouping:
l
V jgroup := V jkpre
k =1
if they match except for the exp-elements:

V jkpre \V exp = V jkpre
\V exp , k , k {1,, l}, k k
If l = 1 then V jgroup = V j1pre.

There are N N groupings where the assignment of the sets of predecessors to groupings is unique.
Each grouping contains at least one exp-element.
The calculation of the groupings V jgroup , j = 1,, N is presented in the Algorithm 2.
By executing Algorithm 2 for the model in Table 2, we obtain the groupings shown in Box 2. The
number of sets is reduced from N = 12 to N = 9. It is Vi group = Vi pre for j = 1,...,4, V5group = V5pre V11pre,
Algorithm 2. Algorithm for the determination of the groupings

for j = 1,...,N do
V jlength =| V jpre |;
V jtemp = V jpre ;
end
N = 0;
for j = 1,...,N do
if | V jtemp |> 0 then
N = N + 1;
VNgroup
= V jpre ;
'
for l = j + 1,...,N do
if | Vl temp |> 0 then
if V jpre \V jexp = Vl pre \Vl exp then
VNgroup
= VNgroup
Vl exp ;
'
'
temp
Vl
= ;
end
end
end
end
end
782
V6group = V6pre V7pre, V7group = V8pre, V8group = V9pre V10pre, V9group = V12pre.
Determination of Disjoint Model Parts

Our objective is a disjoint partition of the model. First of all, it is necessary to define the criteria for a
reasonable partition of the model. For example, it could be desired that the maximal dimension of the
model parts is not too large or that the maximal number of parameters to be estimated is rather small.
We define as the first criterion to determine the grouping with minimal number of parameters. If the
minimum is not unique, then the second criterion is applied to find the grouping with minimal number
of elements in the set of groupings with minimal number of parameters. If the minimum is still not
unique, the element with the smallest index of these groupings is chosen since in this case the choices
are considered to be equivalent.
In order to apply the first criterion it is necessary to define the set of parameters belonging to one
grouping:
group
j
:=
n group
j
P
l =1
V jgroup ( l )
, n group
:=| V jgroup |
j
where V jgroup (l ) is defined as the l-th component if the elements of V jgroup are sorted by ascending
index.
The idea consists in removing the elements of the first chosen grouping from the other grouping.
Then the next grouping is chosen according to the criteria mentioned above and again, the elements
of this grouping are removed from the remaining groupings. This procedure is continued until there is
only one grouping left.
That means that first:
Box 2.
j
V jgroup
2, 3, 4, 5, 6, 7, 8
2, 3, 4, 5, 6, 9, 10
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 42
11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 43, 49
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36, 44, 45
11, 12, 13, 14, 15, 16, 17, 18, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36, 37, 46
11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
47, 48
11, 12, 13, 14, 15, 16, 17, 18, 50
783
1
group
:= arg
|
j min
min | Pj
j
1
is calculated. If | j min
|> 1 , then:
,1 := arg
j min
| V jgroup |
min
1
j jmin
,1 |> 1 , then the element with smallest index is taken: j min

,1 := j min
,1 (1) . For simplicity
is determined. If | j min
1
1
,1 and j min
,1 also by j min
in the following. Define as the model and parameter part for j min
denote j min
V j1part := V jgroup
, Pj1part := Pjgroup
.
1
1
min
min
min
min
then the other groupings of both the elements and the parameters are reduced:
1
V jgroup \V jgroup
, Pjgroup \Pjgroup
, j jmin
.
1
1
min
min
2
In the second step the same procedure as above must be conducted. Assume for simplicity that j min
is unique. Then it is:
2
jmin
:= arg
| Pjgroup \ Pjgroup
1
min
1
min
j
j
min
and the emerging model and parameter parts are given by:
V jpart
:= V jgroup
\ V jgroup
, Pjpart
:= Pjgroup
\ Pjgroup
.
2
2
1
2
2
1
min
min
min
min
min
min
Again, the remaining groupings of elements and parameters are reduced:
(V
(P
)
)
1
2
\V jgroup
\V jgroup
= V jgroup \ V jgroup
1
2
l
, j jmin j jmin
min
min
min
l
=
1
group
1
2
\Pjgroup
\Pjgroup
= Pjgroup \ Pjgroup
1
2
l
j jmin j jmin .
j
min
min
min
l
=
1
group
j
This procedure can be continued until there is only one element grouping and one parameter grouping left which form the last model and parameter part (after reduction).
Definition 12. Generally, the model parts and the parameter parts are defined by:
V j1part := V jgroup
, Pj1part := Pjgroup
1
1
min
min
min
min
where:
1
jmin
:= arg min
| V jgroup | (1)
j arg (min| Pkgroup |)
and for l = 2, , N ' :
784
l 1
l 1
V jpart
\ V jgroup
, Pjpart
\ Pjgroup
= Pjgroup
= V jgroup
l
l
l
l
l
l
,
min
min
min
min
min
min
l =1
l =1
where:
l 1
l
jmin
:= arg
min
V jgroup \ V jgroup
1
(1)
min
l 1
l =1
Pkgroup \ P group
j arg k jmin
l
l
l=1 jmin
min l <l
The algorithm for the determination of the model parts is presented in Algorithm 3.
It follows from their construction that the emerging model parts and the parameter parts are disjoint:
Vi part V jpart =
Pi part Pjpart = , i, j = 1, , N , i j
Moreover the rest set is disjoint to all model parts and the decomposition of the model is complete:
N
n
j =1
part
j
+ n rest = n, n rest :=| V rest |, n jpart := | V jpart |, j = 1,, N
Analogously to the groupings, each model part contains at least one exp-element.
The model parts can be regarded as disjoint subgraphs of G . Define:
E jpart := {vv E | v, v V jpart }

and G jpart := (V jpart , E jpart ) for all j = 1, , N .
By executing Algorithm 3 for the model in Table 2, we obtain the disjoint model parts in Box 3.
Determination of Required Input

Finally, it is necessary to determine which input is necessary in order to solve the model parts separately.
Definition 13. The sets of required input for the model part V jpart , j = 1, , N , are defined by:
}{
V jinp := v V \ V jpart | v V jpart : vv E = vl V \ V jpart | vl V jpart : all = 1

and the parameter input by:
n j
:= PV part ( l ) \ Pjpart .
l =1 j
part
inp
j
Two types of input can be distinguished: V jinp ,app , j = 1, , N , for the input by approximations of
non-exp-elements from other model parts and V jinp ,exp , j = 1, , N , for the input by approximations
of experimental data. It is:
785
Algorithm 3. Algorithm for the determination of the model parts V jpart and of the parameter parts Pjpart,
j = 1,...,N.
for j = 1,...,N do
V jpart = V jgroup ;
Pj
part
|V jgroup |
l =1
PV group ( l ) ;
j
end
= {1, , N };
N min
do
while N min
% first criterion:
( min
jmin = arg
|
j =1,,| N min
| PNpart
( j) | ;
min
if | jmin |> 1 then
% second criterion:
);
arg min | V part

jmin = jmin
( jmin ( j )) |
N min
j =1,,| jmin |
if | jmin |> 1 then

jmin = jmin (1);
end
end
| do
for l = 1, ,| N min
if l jmin then
part
part
VNpart
( l ) = VN min
( l ) \VN min
( jmin ) ;
min
part
part
PNpart
( l ) = PN min
( l ) \PN min
( jmin ) ;
min
end
end
= N min
\jmin ;
N min
end
V jinp = V jinp ,exp V jinp , app
V jinp ,exp V jinp ,app = , j {1, , N '}.
The sets V jinp ,app and V jinp ,exp , j = 1, , N ' , are given by:
V jinp ,app := V jinp \ V exp , V jinp ,exp := V jinp \ V jinp ,app .
786
Box 3.
j
V jpart
2, 3, 4, 5, 6, 7, 8
9, 10
24, 25, 26, 27, 28, 30, 31, 33, 34, 42
40, 41, 43, 49
35, 36, 44, 45
37, 46
20, 21, 22, 23, 29, 32, 38, 39, 47, 48
11, 12, 13, 14, 15, 16, 17, 18, 50
Due to the construction of the model parts, there is a j {1,...,N} such that V jinp , app = and
P = : V j1part = . Thus we can start the parameter estimation with this model part, as it does not need
min
any input from other model parts (if at all input of approximations of experimental data). After parameter
estimation of this model part, approximations of its elements can be generated that can be used as input
for the remaining model parts in the further parameter estimation. If there are several model parts that
do not require input (except for exp-elements), then parameter estimation can be performed parallelly.
Box 4 displays the required input for the model parts of the model in Table 2. Hence it is possible
part
part
to start parameter estimation with the two model parts V1 and V9 since V1inp , app = P1inp = and
V9inp , app = P9inp = .
inp
j
Result for the Model of the Human Menstrual Cycle

As result, we obtain the different disjoint model parts as well as the arcs between the model parts.
There is an arc from a first model part to a second model part, if there is an arc of one vertex of the first
model part to a vertex of the second model part. The result is presented in Table 1 as well as visualized
in Figure 5. We can see that the model parts can be classified in three groups of model parts that could
be treated parallelly in parameter estimation:
V1 part
V2part
V3part
V9part
V4part
V6part
V7part
V8part
V5part
Since the order for the parameter parts is more restrictive:
(( P
part
P2part P3part ), P9part ) P4part P6part P7part P8part P5part
where P1 part P2part P3part and P9part can be treated parallely, the model parts must be determined in
this order, too.
787
Box 4.
j
V jinp ,exp
V jinp ,app
Pjinp
42, 43
1, 42, 43
7, 8
42, 43, 50
7, 8, 38
12, 13, 14, 15, 16, 17, 18
38, 70
29, 32, 38, 39
38, 173, 174, 176, 177
28, 31, 33, 34
38, 150, 151, 153, 154
28, 31, 35
38, 160, 161
10
11, 12, 13, 14, 15, 16, 17, 18, 28, 31, 36, 37
38, 163, 164, 166, 167
8,10
Table 1. The results of the model decomposition for the mathematical model with 49 differential equations and an equation for inhibin (n = 50) as well as of the nP = 208 parameters (Table 2)
V jpart / V rest V exp V jpart
Pjpart / P rest
V jinp ,exp V jinp ,app
Pjinp
| V jpart | | Pjpart |
1,...,8
42,43
2,...,8
9,...,40
1,42,43
7,8
32
9,10
10
41,...,54,59,60
42,43,50
7,8,38
16
24,...,28,30, 31,
33,34,42
42
93,...,106,117,...,
134,145,...,155,
185,186,187
12,...,18
38,70
10
46
40,41,43,49
43,49
179,...,184,188,
189,190,206,207, 208
29,32,38,39
38,173,174,
176,177
12
35,36,44,45
44,45
156,...,165,
191,...,196
28,31,33,34
38,150,151,
153,154
16
37,46
46
166,167,168,197,
198,199
28,31,35
38,160,161
20,...,23,29,32,
38,39,47,48
47,48
88,...,92,107,...,
116,135,...,144,
169,...,178,
200,...,205
10
11,...,18,28,31,
36,37
38,64,163,
164,166,167
10
41
11,...,18,50
50
55,...,58,
61,...,86
8,10
30
10
19
87
18
86
788
Figure 5. Visualization of the subgraphs G jpart, j = 1,...,N, representing the model parts without the rest
set V rest. The arc from Gipart to G jpart is given if Vi part V jinp . If Vi part V jinp ,exp then the arc is dotted,
otherwise it is solid. The subgraphs are represented by undirected graphs.
Conclusion
This chapter concentrates on the modeling and the first steps in the determination of parameter values.
First some concepts are presented that can be used for modeling of physiological processes. Then these
concepts are applied to the case of the human menstrual cycle. In the third part of this chapter it is
shown how this constructed model can be decomposed. Again, this concept is shown for the case of
the human menstrual cycle.
Although this topic concerns a considerable part of the population, comparatively little is done in the
modeling of the human menstrual cycle. This complex model which has been developed in Reinecke &
Deuflhard (2007) and which is presented in the section entitled Development of a complex mathematical model for the human menstrual cycle is only the beginning in the field of mathematical modeling of
the human menstrual cycle. There are many possibilities to expand this model in order to make it more
realistic and to offer a larger number of applications. For example, the action of oral contraceptives can
be included in the modeling process where more compartments such as the stomach, the intestine, and
the liver could be incorporated into the compartmental model. If the model gives a realistic representation of the essential processes in the body, we have many possibilities to examine different ways to
influence the cycle in order to achieve the desired results. The existing models are smaller and include
only the essential components of the system limiting the possibilities of manipulations.
789
A significant problem in the development of these rather complex and large models is the rising
number of parameters, especially in the human body since in many cases experimental data are not
available. Even coarse values are possibly not known making it difficult to determine the remaining
parameter values by parameter estimation methods.
One possibility to simplify parameter estimation is presented here. It is easier to cope with the
problem if the dimension of the system is small. By the method of model decomposition using available
experimental data, we obtain several smaller models. Even parallel parameter estimation is possible
under certain conditions.
In order to be able to apply this procedure, it is necessary that experimental data are provided for at
least two of the elements. The more experimental data is available, the smaller the model parts become
on average. The mathematical basis is, for example, a system of differential equations. With the help
of these model parts, it is possible to perform parameter estimation for smaller dimensions. At the end,
the model parts can be recomposed replacing the input of approximations of experimental data by the
corresponding elements of other model parts resulting in a dynamic model.
The detailed procedure for the determination of the model parts can be summarized as follows:
1.
2.
3.
4.
5.
6.
7.
8.
Transformation of the mathematical model into the corresponding graph

Calculation of the adjacency matrix
Defining the set of exp-elements
Reducing the graph by removing the arcs starting in exp-elements
Determining the sets of predecessors, the elements whereupon the exp-elements depend
Merging sets of predecessors to groupings if they match except for the exp-element
Subsequently finding the grouping with the smallest number of parameters (if unique) and removing
the elements of this grouping from the other groupings in order to obtain disjoint model parts
Determining the direct predecessors of the elements of the model parts and distinguish between
exp-elements and elements that need to be approximated (non-exp-elements)
Finally, we obtain a decomposition of the model into disjoint model parts. This decomposition is not
necessarily complete. It is possible that there are elements that do not belong to one of the model parts
forming the rest set. These elements cannot be estimated.
An example of the functionality is presented showing that it is possible to reduce the dimension of
the problem n = 50 to N ' = 10 problems with a dimension of order in the range of 1 to 10. The number
of parameters of each parameter part is between 6 and 46 instead of np = 208. And it is partly possible
to perform partly parameter estimation parallelly which reduces the total time for solving this optimization problem.
This occurs if there is more than one model part that, if at all, needs input from approximations of
experimental data. That means that in this case parameter estimation can be performed independently
from the other model parts. If there is more than one part with these conditions, it is possible to start
parameter estimation with all of these model parts simultaneously. This could reduce the time that is
needed to perform parameter estimation for all model parts and for the entire model in the end.
790
Table 2. Model equations. The 49 delay differential equations, one equation for the exp-element y50
and auxiliary functions are listed. In the first column, the delay differential equations are consecutively
numbered. The corresponding equations are given in the second column.
No
Equation
p3p4
y43 (t p8 ) p6
d
+
p2
y1 (t ) = p4
1
p6
p6
p4
dt
p3 + y42 (t p7 )
p5 + y43 (t p8 )
1
p1
y1 (T j ) y1 (T j 1 ) = ( ln(1 U j )) , U j ~U [0,1], j IN
2
p31p11
y43 (t p8 ) p13
d
+
p9
y2 (t ) = p11
p13
p13
p11
dt
p12 + y43 (t p8 )
p10 + y43 (t p8 )
y2 (T j ) y2 (T j 1 ) p18p19
p
d
y3 (t ) = 16 +
(t T j ) p19 1 exp( p18 (t T j ))

dt
p17
p17
( p19 )
p15 y3 (t ) y4 (t ) p14 y3 (t ), t T j , T j +1 ), j IN
p y3 (t )
d
y4 (t ) = p22 23
+ p21 y6 (t ) p20 y3 (t ) y4 (t )
dt
p23 + y3 (t )
d
y5 (t ) = p20 y3 (t ) y4 (t ) p24 y5 (t )
dt
d
y6 (t ) = p24 y5 (t ) p21 y6 (t )
dt
p26p27
y43 (t p8 ) p29
y42 (t p7 ) p36
d
y7 (t ) = p25 + p27
p
30
dt
p26 + y42 (t p7 ) p27 p28p29 + y43 (t p8 ) p29
p35p36 + y42 (t p7 ) p36
p31p32
y43 (t p8 ) p34
p32
+
p32
p33p34 + y43 (t p8 ) p34
p31 + y43 (t p8 )
p37 y5 (t ) y7 (t )
p
p31p32
y43 (t p8 p39 ) p34
d
y8 (t ) = 37 p32
+
p32
p34
p34
dt
p38 p31 + y43 (t p8 p39 )
p33 + y43 (t p8 p39 )
y (t p7 p39 ) p36
p36 42
y5 (t p39 ) y7 (t p39 ) p40 y8 (t )
p35 + y42 (t p7 p39 ) p36
y42 (t p7 ) p51
p52p53
p42p43
d
y9 (t ) = p41 + p43
44
dt
p42 + y50 (t p45 ) p43
p50p51 + y42 (t p7 ) p51 p52p53 + y50 (t p45 ) p53
p46p47
y43 (t p8 ) p49
p47
+
p47
p48p49 + y43 (t p8 ) p49
p46 + y43 (t p8 )
p54 y5 (t ) y9 (t )
791
10
p
p46p47
y43 (t p8 p59 ) p49
d
y10 (t ) = 54 p47
+
y5 (t p59 ) y9 (t p59 )
p47
p49
p49
dt
p38 p46 + y43 (t p8 p59 )
p48 + y43 (t p8 p59 )
y (t p7 p59 ) p51
p52p53
p51 42
p60 y10 (t )
p50 + y42 (t p7 p59 ) p51 p52p53 + y50 (t p45 p59 ) p53
11
y (t p64 ) p63
d
y11 (t ) = p61 p63 10
+ p65 y10 (t p64 ) p67 p66 y10 (t p64 ) p68 y11 (t )
dt
p62 + y10 (t p64 ) p63
12
d
y12 (t ) = p66 y10 (t p64 ) p68 y11 (t ) + p69 y10 (t p64 ) p72 y8 (t p70 ) p73
dt
p71 y10 (t p64 ) p74 y8 (t p70 ) p75 y12 (t )
13
d
y13 (t ) = p71 y10 (t p64 ) p74 y8 (t p70 ) p75 y12 (t ) + p76 y8 (t p70 ) p78 p77 y8 (t p70 ) p79 y13 (t )
dt
14
d
y14 (t ) = p 77 y8 (t p 70 ) p79 y13 (t ) p80 y14 (t )
dt
15
d
y15 (t ) = p80 y14 (t ) p81 y15 (t )
dt
16
d
y16 (t ) = p81 y15 (t ) p82 y8 (t p 70 ) p83 y16 (t )
dt
17
d
y17 (t ) = p82 y8 (t p 70 ) p83 y16 (t ) p84 y8 (t p 70 ) p85 y17 (t )
dt
18
d
y18 (t ) = p84 y8 (t p 70 ) p85 y17 (t ) p86 y18 (t )
dt
19
d
y19 (t ) = p86 y18 (t ) p87 y19 (t )
dt
20
d
y20 (t ) = p88 y21 (t ) + p89 y23 (t ) p90 y10 (t p64 ) y20 (t )
dt
21
d
y21 (t ) = p90 y10 (t p64 ) y20 (t ) ( p91 + p88 ) y21 (t )
dt
22
d
y22 (t ) = p91 y21 (t ) p92 y22 (t )
dt
23
d
y23 (t ) = p92 y22 (t ) p89 y23 (t )
dt
24
d
y24 (t ) = p93 y25 (t ) + p94 y27 (t ) p95 y8 (t p70 ) y24 (t )
dt
25
d
y25 (t ) = p95 y8 (t p70 ) y24 (t ) ( p96 + p93 ) y25 (t )
dt
792
26
d
y26 (t ) = p96 y25 (t ) p97 y26 (t )
dt
27
d
y27 (t ) = p97 y26 (t ) p94 y27 (t )
dt
28
d
7
y28 (t ) = p98 p99 + i y11+ i (t ) y25 (t ) p99 y28 (t )

dt
i =1
29
d
8
y29 (t ) = p107 p108+ i y10 + i (t ) y21 (t ) p108 y29 (t )

dt
i =1
30
d
7
y30 (t ) = p117 p118+ i y11+ i (t ) y25 (t ) p118 y30 (t )

dt
i =1
31
d
7
y31 (t ) = p126 p127 + i y11+ i (t ) y25 (t ) p127 y31 (t )

dt
i =1
32
d
8
y32 (t ) = p135 p136 + i y10 + i (t ) y21 (t ) p136 y32 (t )

dt
i =1
33
p y (t ) p147 y34 (t )
y33 (t )
d
y33 (t ) = p145 y30 (t ) y28 (t ) 146 33
p150 y31 (t )
dt
p148 + y33 (t ) + p149 y34 (t )
p151 + y33 (t )
p152 y33 (t )
34
35
p y (t ) p147 y34 (t )
y34 (t )
d
y34 (t ) = y28 (t ) 146 33
p153 y31 (t )
p155 y34 (t )
dt
p148 + y33 (t ) + p149 y34 (t )
p154 + y34 (t )
y33 (t )
p y (t ) p157 y36 (t )
d
y35 (t ) = p150 y31 (t )
y28 (t ) 156 35
dt
p151 + y33 (t )
p158 + y35 (t ) + p159 y36 (t )
p160 y31 (t )
36
y35 (t )
p162 y35 (t )
p161 + y35 (t )
y34 (t )
p y (t ) p157 y36 (t )
d
y36 (t ) = p153 y31 (t )
+ y28 (t ) 156 35
dt
p154 + y34 (t )
p158 + y35 (t ) + p159 y36 (t )
p163 y31 (t )
37
38
y36 (t )
p165 y36 (t )
p164 + y36 (t )
y35 (t )
y37 (t )
d
y37 (t ) = p160 y31 (t )
p166 y28 (t )
p168 y37 (t )
dt
p161 + y35 (t )
p167 + y37 (t )
y36 (t )
y37 (t )
d
y38 (t ) = p163 y31 (t )
+ p166 y28 (t )
dt
p164 + y36 (t )
p167 + y37 (t )
y29 (t )
p169 y38 (t ) p170 y39 (t )

y38 (t )
p173 y32 (t )
p175 y38 (t )
p171 + y38 (t ) + p172 y39 (t )
p174 + y38 (t )
793
39
p y (t ) p170 y39 (t )
y39 (t )
d
y39 (t ) = y29 (t ) 169 38
p176 y32 (t )
p178 y39 (t )
dt
p171 + y38 (t ) + p172 y39 (t )
p177 + y39 (t )
40
y38 (t )
p y (t ) p180 y41 (t )
d
y40 (t ) = p173 y32 (t )
y29 (t ) 179 40
p183 y40 (t )
dt
p174 + y38 (t )
p181 + y40 (t ) + p182 y41 (t )
41
y39 (t )
p y (t ) p180 y41 (t )
d
y41 (t ) = p176 y32 (t )
+ y29 (t ) 179 40
p184 y41 (t )
dt
p177 + y39 (t )
p181 + y40 (t ) + p182 y41 (t )
42
p
d
y42 (t ) = 185 y34 (t p186 ) p187 y42 (t )
dt
p38
43
p
d
y43 (t ) = 188 y41 (t p189 ) p190 y43 (t )
dt
p38
44
p
d
y44 (t ) = 191 y35 (t p192 ) p193 y44 (t )
dt
p38
45
p
d
y45 (t ) = 194 y36 (t p195 ) p196 y45 (t )
dt
p38
46
p
d
y46 (t ) = 197 y37 (t p198 ) p199 y46 (t )
dt
p38
47
p
d
y47 (t ) = 200 y38 (t p201 ) p202 y47 (t )
dt
p38
48
p
d
y48 (t ) = 203 y39 (t p204 ) p205 y48 (t )
dt
p38
49
p
d
y49 (t ) = 206 y40 (t p207 ) p208 y49 (t )
dt
p38
50
y50 (t ) = p55 + p56 y12 (t ) + p57 y17 (t ) + p58 y18 (t )
References
Andersen, M. E. (1991). Physiological modelling of organic compounds. Annals of Occupational Hygiene, (3), 309-321.
Anderson, L. (1996). Intracellular mechanisms triggering gonadotrophin secretion. Reviews of Reproduction, 1, 193-202.
Beineke, L. W., Wilson, R. J., (Eds.) (1997). Graph connections. Oxford Science Publications.
Blum, J. J., Reed, M. C., Janovick, J. A., & Conn, P. M. (2000). A mathematical model quantifying GnRHinduced LH secretion from gonadotropes. Am. J. Physiol. Endocrinol. Metab., 278, E263-E272.
794
Chabbert-Buffet, N., & Bouchard, P. (2002). The ormal human menstrual cycle. Reviews in Endocrine
and Metabolic Disorders, 3, 173-183.
Chauvet, G. (2004). The mathematical nature of the living world. The Power of Integration. World
Scientific.
Chvez-Ross, A., Franks, S., Mason, H. D., Hardy, K., & Stark, J. (1997). Modelling the control of
ovulation and polycystic ovary syndrome. Journal of Mathematical Biology, 36, 95-118.
Clment, F., Monniaux, D., Stark, J., Hardy, K., Thalabard, J. C., Franks, S., & Claude D. (2001). Mathematical model of FSH-induced cAMP production in ovarian follicles. Am. J. Physiol. Endocrinol.
Metab., 281, E35-E53.
Conley, A. J., & Bird, I. M. (1997). Minireview: The role of cytochrome P450 17-hydroxylase and 3hydroxysteroid dehydrogenase in the integration of gonadal and adrenal steroidogenesis via the 5 and
4 pathways of steroidogenesis in mammals. Biology of Reproduction, 56, 789-799.
Diestel, R. (2005). Graph theory. (3rd edition). Electronic edition. Springer.
Godsil, C., & Royle, G. (2001). Algebraic graph theory. Springer.
Gordan, J. D., Attardi, B. J., & Pfaff, D. W. (1998). Mathematical exploration of pulsatility in cultured
gonadotropin-releasing hormone neurons. Neuroendocrinology, 67, 2-17.
Greenspan, F. S., & Strewler, G. J., (Eds.). (1997). Basic & clinical endocrinology. 5th edition. Appleton
& Lange.
Harris, L. A. (2001). Differential equation models for the hormonal regulation of the menstrual cycle.
PhD Thesis: North Carolina State University.
Heinze, K., Keener, R. W., & Midgley Jr., A. R. (1998). A mathematical model of Luteinizing hormone release from ovine pituitary cells in perifusion. Am. J. Physiol. Endocrinol. Metab., 275, E1061-E1071.
Herbison, A. E. (1997). Noradrenergic regulation of cyclic GnRH secretion. Reviews of Reproduction,
2, 1-6.
Keck, C., Neulen, J., Behre, H. M., & Breckwoldt, M. (2002). Endokrinologie, Reproduktionsmedizin,
Andrologie. 2nd edition. Georg Thieme Verlag.
Keenan, D. M., & Veldhuis, J. D. (1998). A biomathematical model of time-delayed feedback in the human
male hypothalamic-pituitary-leydig cell axis. Am. J. Physiol. Endocrinol. Metab., 275(1), E157-E176.
KEGG PATHWAY Database (2006). http://www.genome.jp/kegg/pathway.html
Lacker, H. M., & Akin, E. (1988). How do the ovaries count? Mathematical Biosciences, 90, 305-332.
Luecke, R. H., & Wosilait, W. D. (1979). Drug elimination interactions: Analysis using a mathematical
model. Journal of Pharmacokinetics and Biopharmaceutics, 7(6), 629-641.
Magoffin, D. A., & Jakimiuk, A. J. (1997). Inhibin A, Inhibin B and Activin A in the follicular fluid of
regularly cycling women. Human Reproduction, 12(8), 1714-1719.
795
Moenter, S. M., Brand, R. C., & Karsch, F. J. (1992). Dynamics of Gonadotropin-releasing hormone
(GnRH) secretion during the GnRH surge: Insights into the mechanism of GnRH surge induction.
Endocrinology, 130(5), 2978-2984.
Rasgon, N. L., Pumphrey, L., Prolo, P., Elman, S., Negrao, A., Licinio, J., & Garfinkel, A. (2003). Emergent
oscillations in mathematical model of the human menstrual cycle. SNS Spectrums, 8(11), 805-814.
Reinecke, I., & Deuflhard, P. (2007). A complex model of the human menstrual cycle. Journal of Theoretical Biology, 247(2), 303-330.
Skinner, D. C., Evans, N. P., Delaleu, B., Goodman, R. L., Bouchard, P., & Caraty, A. (1998). The negative feedback actions of progesterone on gonadotropin-releasing hormone secretion are transduced by
the classical progesterone receptor. Proc. Natl. Acad. Sci., 95, 10978-10983.
Strauss III, J. F. (1999). The synthesis and metabolism of steroid hormones. In S. S. C. Yen, R. B. Jaffe,
R. L. Barbieri (Eds.), Reproductive endocrinology. physiology, pathophysiology, and clinical management, 125-154, 4th edition. W.B. Saunders Company.
Strauss III, J. F., & Williams, C. J. (1999). The ovarian life cycle. In S. S. C. Yen, R. B. Jaffe, R. L. &
Barbieri (Eds.). Reproductive endocrinology. physiology, pathophysiology, and clinical management,
213-253, 4th edition. W.B. Saunders Company.
Swerdloff, R. S., Jacobs, H. S., & Odell, W. D. (1972). Synergistic role of progestogens in estrogen
induction of LH and FSH surge. Endo, 90(6), 1529-1536.
Takeuchi, Y., Iwasa, Y., & Sato, K., (Eds.) (2007). Mathematics for life science and medicine. Springer.
Washington, T. M., Blum, J. J., Reed, M. C., & Conn, P. M. (2004). A mathematical model for LH release in response to continuous and pulsatile exposure of gonadotrophs to GnRH. Theoretical Biology
and Medical Modelling, 1(9), 1-17.
Yen, S. S. C. (1991). The human menstrual cycle: Neuroendocrine regulation. In S. S. C. Yen, R. B. Jaffe
(Eds.), Reproductive endocrinology. physiology, pathophysiology, and clinical management, 191-217,
3rd edition. W.B. Saunders Company.
key TERMS
Biphasic Hill Function: At low values the biphasic Hill function is decreasing and at high values
increasing.
Compartmentalization: The human body is divided into compartments, open systems that are
interconnected.
Delay Differential Equation: The dynamics are not only dependent on the time point t but on time
points that lie in the past.
796
Positive and Negative Feedback: If substrates are regulated by other substrates that inhibit or
stimulate, we speak of positive and negative feedback, respectively.
Michaelis-Menten Mechanism: If reactions are catalyzed by enzymes, then the simplest approach
for the mathematical modeling is the Michaelis-Menten mechanism.
Model Decomposition: That is the partition of the mathematical model into disjoint model parts in
order to simplify the parameter estimation.
Pulse Generator: There are substances that are not released at a constant rate but in pulses. In the
hypothalamus there is the GnRH pulse generator responsible for the pulsatile release of GnRH. It is
appropriate to choose a stochastic approach for the mathematical model.
Receptor Recycling: The unbound receptors are not active. They become activated when reacting
with its ligand. After having accomplished its objective it is not degraded but returns first in an inactivatable state and then in the unbound, activatable state.
797
798
Chapter XLIII
A Pandemic Avian Influenza

Mathematical Model
Mohamed Derouich
Facult des sciences Oujda-Morocco, Morocco
Abdesslam Boutayeb
abstract
Throughout the world, seasonal outbreaks of influenza affect millions of people, killing about 500,000
individuals every year. Human influenza viruses are classified into 3 serotypes: A, B, and C. Only influenza
A viruses can infect and multiply in avian species. During the last decades, important avian influenza
epidemics have occurred and so far, the epidemics among birds have been transmitted to humans; but
the most feared problem is the risk of pandemics that may be caused by person-to person transmission.
The present mathematical model deals with the dynamics of human infection by avian influenza both in
birds and in humans. Stability analysis is carried out and the behaviour of the disease is illustrated by
simulation with different parameters values.
INTRODUCTION
Worldwide, seasonal outbreaks of influenza (also known as flu) affect millions of people, killing about
500,000 individuals every year (WHO, 2005). Human influenza viruses are classified into three serotypes: A, B and C. Only influenza A viruses are known to infect and multiply in avian species. These
viruses present 16 HA (haemagluttinin) and 9 NA (neuraminidase) subtypes (H1N1, H2N2, H3N2,
H5N1, H7N7,....) (Alexander, 2004).
At the domestic poultries, the infection by viruses of avian influenza provokes two main forms of
illness characterized by an extremely weak and extremely elevated virulence. The first form weakly
pathogen only provokes some benign symptoms (tousled feather, less frequent punter) and can pass
A Pandemic Avian Influenza Mathematical Model
easily unobserved. The second form pathogen has consequences well more serious. It propagates very
quickly in raisings and of which the mortality rate can approach 100%, the death often occurring in
the 48 hours.
The wide spread of influenza in poultry and wild birds during the last decade and the occurrence of
human influenza infection has raised the question of pandemics. For a pandemic to start, three conditions are required: a novel influenza virus subtype must emerge against which the general population
has in its majority no immunity; the virus infects humans and causes serious illness; and the new virus
should have a high rate of person-to-person transmission (WHO 2005; Ferguson, 2004). Three major
pandemics occurred during the last century. In 1918, the Spanish flu has killed an estimated 40-50 million people, in 1957 the pandemic (Asian flu) killed about 2 million people and the Hong Kong flu killed
an estimated one million people in 1968. Although influenza pandemics are considered inevitable, the
avian epidemics that occurred during the last decade, starting in 1997 (Hong Kong), have not engendered
pandemics. Studies have shown that direct contact with diseased poultry was the source of infection and
found no evidence of person-to-person spread of the virus. However, due to the potential for cross-species of avian and human influenza viruses and the possibility of viruses reassortment, the high rates of
mortality among the few cases observed recently (Table 1) could lead to devastating pandemics (Yuen,
2005; Kuiken, 2006; Smith, 2006). Consequently, the risk of pandemics and its corollaries remains on
the agenda of national and international health bodies.
Table 1. Cumulative number of confirmed human cases of avian influenza A/(H5N1) reported to WHO
(last update: 6 June 2007). Total number of cases includes number of deaths. WHO reports only laboratory-confirmed cases. All dates refer to onset of illness. (WHO, 2007).
Country
2003
2004
2005
2006
2007
Total
cases
deaths
cases
deaths
cases
deaths
cases
deaths
cases
deaths
cases
deaths
Azerbaijan
Cambodia
China
13
25
16
Djibouti
Egypt
18
10
16
34
14
Indonesia
20
13
55
45
24
21
99
79
Iraq
Lao Peoples
Democratic
Republic
Nigeria
Thailand
17
12
25
17
Turkey
12
12
Viet Nam
29
20
61
19
93
42
Total
46
32
98
43
115
79
47
31
310
189
799
Mathematical models have been used for infectious diseases in general and for influenza in particular
(Alexander, 2004; Ferguson, 2004; Hethcote, 2000; Mena-Lorca, 1992; Derouich, 2006). In the case of
avian influenza, deterministic models were used for comparing interventions aimed at preventing and
controlling influenza pandemics (Ferguson, 2004; Carrot, 2006) and stochastic models were proposed
to model and predict the worldwide spread of pandemic influenza (Colizza, 2006; Colizza, 2007).
In the present chapter, we propose a mathematical model to study the dynamics of human infection
by avian influenza. The model deals with both infections (avian and human). Stability analysis is given
and simulation is carried out with different parameters values. The model illustrates in particular, the
importance of parameters such as the average number of adequate contacts of a human susceptible with
infected human and the average number of adequate contacts of a human susceptible with infected birds
in determining the incidence of the disease and consequent preventive strategies.
FORMULATION OF THE MODEL AND STABILITY ANALYSIS

Parameters of the Model
Let N and N0 denote the human and bird population size. In this model death is proportional to the
population size with rate constant and we suppose that N and N0 are constant.
The human population (respectively bird population) of size N (resp. N0) is formed of Susceptible S,
of Infective I and of Removed R (resp. S0 and I0).
(
I
I
+ 0 0 ) S is the human incidence, i.e. the rate at which susceptible individuals become infectious.
N
N0
If the time unit is days, then the incidence is the number of new infection per day. The daily contact rate
is the average number of adequate contacts of a human susceptible with infected human per day, The
daily contact rate 0 is the average number of adequate contacts of a human susceptible with infected
birds per day,
I
I
is the infectious fraction of the human population and 0 is the infectious fraction of
N0
N
the bird population. Time units of weeks, months or years could also be used.
Similarly
'
0 0
I
S0 is the bird incidence and
N0
'
0
is the average number of adequate contacts of a bird
susceptible with other birds per day. The man life span is taken equal to 25 000 days (68.5 years), and
the one of the bird is about 2500 days. The other parameters used in the model are birth rate constant
(); contact rate, human to human (); effective contact rate, bird to human (0); effective contact rate,
'
bird to bird ( 0); human life span (1/); and host infection duration (1/( + ).
Equations of the Model

A schematic representation of the model is shown in Figure 1.
In human we consider SIRS compartmental model that is to say that human susceptible individuals
become infectious then removed with temporary immunity after recovery from infection and susceptible
when again immunity fades away, in bird population we consider SI compartmental model.
The model is governed by the following equations:
800
Figure 1.
S
( I+
S0
I )/ N
0 0
'
0 0
I /N
0
0
I0
0
R
R
Human population:
dS
dt
dI
dt
dR
dt
dN
dt
= N ( +
I
I
+ 0 0 )S + R
N
N0
=(
I
I
+ 0 0 )S (
N
N
I ( + ) R
)I
N I
Bird population:
dS0
= 0 N0 (
dt
'
dI 0 = 0 I 0 S
0
dt
N0
0 +
'
0 0
I
) S0
N0
0 0
Introducing the proportions: s =
S
I
R
I
S
; i = ; r = ; s0 = 0 ; and i0 = 0 .
N
N
N
N0
N0
So with the conditions s+i+r=1 and s0+i0= 1 i.e r=1-(s+i) and s0=1 i0, the two previous systems
become:
801
ds
dt
di
dt
dr
dt
di
0
dt
( + i+
=( i+
i )s + r
0 0
i )s (
+ +
0 0
)i
= i ( + )r
=
i (1 i0 )
'
0 0
0 0
Equilibrium Points
Theorem 1
Let R =
0
0
and
+ +
, then the previous system admits the following equilibrium points:
If R 1 there are two equilibrium points:

1. The trivial state E1(1,0,0,0) is the only equilibrium which is locally asymptotically stable for
1.
+ +
( + )
, ( ( + + ))
,
i,0
2. The endemic equilibrium E1'
( + )( + + )
( + )
which is locally asymptotically stable for 1.

If R > 1 then an endemic equilibrium E2 (sh , ih , iv , nh ) will be the equilibrium point that is locally
asymptotically stable.
Proof
Equilibrium Points
The equilibrium points satisfy the following relations:
( + i + 0i0 ) s + r = 0
( i + 0i0 ) s ( + + )i = 0
i ( + )r = 0
'
0 i0 (1 i0 ) 0 i0 = 0
From the equation (4) we have: i0 = 0 or i0 = 1
From the equation (3) we have: r =
1. For i0 = 0:
From the equation (2) we have: i = 0 or s =

802
0
'
0
0
'
0
( R 1) where R =
i.
+ +
If i = 0 so r = 0 and s = 1. Then the equilibrium point is E1(1,0,0,0)

+ +
from the equation (1) we have:
If s =
'
0
0
(1)
(2)
(3)
(4)
i=
( ( + + ))
( + )
( + )( + + )
. So i>0 for
+ +
>1
In this case the equilibrium point is:
+ +
E1'
2.
( ( + + ))
For i0 = 1
0
'
0
( + )
( + )( + + )
i,0
( + ) .
+ +
From the equation (1) we have: s = 1
i
( + )
From the equation (2) we have:
+ +

2
i + ( + + )
( + )
+ +
Since s = 1
We have Q(0) =
Q(
0
'
0
+ +
i
0 0
i 0 and i 0 then i
( + )
i +
( + )
( + )
0,
( + )( + + )
i =0
0 0
( R 1) and
( + )
( + )( + + )
)=
( + )( + + )
( + )
( + + )<0
where
+ +
Q(i ) =
2
i + ( + + )
( + )
+ +
i
0 0
i +
( + )
0 0
When R 1 the value of the polynomial Q(i) is negative at the end points of the interval
( + )
0,
therefore there are no roots in this interval.

( + )( + + )
If R > 1 then Q(0) > 0 therefore there exists a unique root in the interval which implies the existence of
a unique equilibrium points: E2 (sh , ih , iv , nh )

Stability Proof
The matrix of linearization (Jacobian matrix) is given by:
s
( + i + i0 )
( + + )
0
i + i0
J =
0
( + )
0
0
0
s
0s
0
0
0 (R 1 2
'
0
0
i0 )
803
1.
For the point E1 the matrix J becomes:
0
J =
0
( + + )
0
( + )
0 ( R 1)
0
So E1 is stable if and only if R 1.

2. For the point E1' the matrix J becomes:
s
( + i )
( + + )
0
i
J =
0
( + )
0
0
0
0s
0 ( R 1)
0
Thus the eigenvalues of matrix J are 1 = (1R) and the root of the polynomial:
P ( ) = ( +
+ ) ( + i + )(( + +
P( ) =
si +
(( + + ) + + i + ( + )) + ( ( + + ) + ( + + ) i ) +

( ( + )( + + ) + i (( + )( + + ) ) )
3
+ )) +
si
If we put
A=( + + )+ + i+( + )
B = ( + + )+( + + ) i
C = ( + )( + + ) + i (( + )( + + )
So: AB > C , A > 0 , B > 0 and C > 0 then following Routh-Hurwitz conditions for the polynomial P,
the state E2 is locally asymptotically stable for R 1.
3. The local stability of E2 is demonstrated in the same way as for E1'.
RESULTS AND DISCUSSION

Stability analysis and values of the threshold were obtained. Simulation was carried out with different
values of the parameters and illustration was given by Figures 2-5.
Figure 2 illustrates the behaviour of the solutions in the case of endemic equilibrium.(R > 1).
Figure 3 and Figure 4 give a comparison of the human infectious according to different values of
the effective contact rate (birds to human).
Figure 5 shows the typical behaviour of the solutions indicating that the rate of (human) susceptible,
infectious and removed, as well as the avian infectious approaches, asymptotically, the trivial equilibrium (R 1).
The wide spread of avian influenza in birds poses two main risks for human and birds health. The
first risk of direct infection in birds and human when the virus passes from birds to birds and to human.
804
Figure 2.
= 0.0004 ,
= 0.035,
= 0.00004 , = 0.25,
= 0.1,
= 0.002
Figure 3.
= 0.0004 ,
= 0.035,
= 0.00004 , = 0.25,
= 0.1,
= 0.002
805
Figure 4.
Figure 5.
806
= 0.04,
= 0.04,
= 0.035,
= 0.00004 , = 0.25, = 0.1,
= 0.002
= 0.035,
= 0.00004 , = 0.25, = 0.1,
= 0.002,
= 0.075
The second risk is that the virus may change into a form that is highly infectious for birds and/or for
humans. However, as stressed in the introduction section, the most crucial case is the eventual occurrence of a pandemic, caused by person-to-person spread of the virus. As indicated by the simulation of
different patterns, the dynamics of the disease is mainly determined by the average number of adequate
contacts of a human susceptible with infected human and the average number of adequate contacts of
a human susceptible with infected birds. These two parameters constitute essential keys to preventive
strategies against pandemics.
REFERENCES
Alexander, M. E., Bowman, C., Moghadas, S. M., Summers, R., Gumel, A. B. & Sahai, B. M. (2004).
A vaccination model for transmission dynamics of influenza. SIAM Journal of Applied Dynamical
Systems, 3(4), 503-524.
Bradt, D. A. & Drummond, C. M. (2006). Avian influenza pandemic threat and health systems response.
Emergency Medicine Australia, 18(5-6), 430-443.
Carrot, F., Luong, T., Lao, H., Sall, A. V. , Lajaunie, C. & H. Wadernagel. (2006). A small-world-like
model for comparing interventions aimed at preventing and controlling influenza pandemics. Biomedical Central Medicine, 4, 26-28
Colizza, V., Barrat, A., Barthelemy, M., Valleron, A. J., & Vespignani, A. (2006). The modelling of global
epidemics: stochastic dynamics and predictability. Bulletin of Mathematical Biology, 68, 1893-1921.
Colizza, V., Barrat, A., Barthelemy, M., Valleron, A. J., & Vespignani, A. (2007). Modelling the
worldwide spread of pandemic influenza: Baseline case and containment intercations. Plos Medicine,
4(1) e13
Derouich, M., & Boutayeb, A. (2006). Dengue fever: Mathematical modelling and computer simulation.
Ferguson, N. M., Fraser, C., Donnelly, C. A., Ghani, A. C., & Anderson, R. M. (2004). Public health
risk from the avian H5N1 influenza epidemic. Science, 304, 968-969.
Heatcote, H. W. (2000). The mathematics of infectious diseases. SIAM Review, 42(4), 599-653.
Jaime Mena-Lorca, & Heatcote, H. W. (1992). Dynamic models of infectious diseases as regulators of
population sizes. Journal of Mathematical Biology, 30, 693-716.
Kuiken, T., Holmes, E. C., McCauley, J., Rimmelzwaan, G. F., Williams, C. S., & Grenfell, B. T. (2006).
Host species barriers to influenza virus infections. Science, 312, 394-397.
Smith, D. J. (2006). Predictability and preparedness in influenza control. Science, 312, 392-394.
WHO. (2007). Avian influenza (bird flu)-Fact sheet. Retrieved June 06, 2007, from http://www.who.
int/csr/disease/avian\_influenza
807
WHO. (2007). Avian influenza: assessing the pandemic threat. Retrieved June 06, 2007, from http://www.
who.int/csr/disease/influenza/H5N1-9reduit.pdf
Yuen, K. Y. & Wong, S. S. M. (2005). Human infection by avian influenza A H5N1. Hong Kong Medical Journal, 11, 189-199.
KeY TERms
Avian: Related to birds.
Immunity: Inherited, induced or acquired (by vaccine) resistance to infection by a specific pathogen.
Incidence: The number of new cases of a specific disease occurring during a given period (a year
in general).
Influenza: An acute contagious viral infection characterized by inflammation of the respiratory
tract and by fever, chills and muscular pain.
Mathematical Model: An abstract model using equations to describe the behaviour of a system
(biological, physical, ).
Pandemic: Epidemic affecting a large proportion of a population over a wide geographical region.
Susceptible: Lacking immunity and resistance and consequently at risk of infection.
Stability: The condition of being resistant to changes and perturbations.
808
809
Chapter XLIV
Dengue Fever:
A Mathematical Model with

Immunization Program
Mohamed Derouich
Abdesslam Boutayeb
abstract
Dengue fever is a re-emergent disease affecting more than 100 countries. Its incidence rate has increased
fourfold since 1970 with nearly half the worlds population now at risk. In the chapter, a mathematical
model with immunization is proposed to simulate the succession of 2 epidemics with variable human
populations. Stability of the equilibrium points is carried out and simulation is given for different parameters settings.
INTRODUCTION
At the dawn of the third millennium, the world population is facing a double burden of non communicable diseases (NCDs) and infectious diseases (Boutayeb, 2006). NCDs, once known as the disease of
the rich, are now also affecting developing countries where Cardio-Vascular Diseases (CVDs), cancer
and diabetes are flourishing (WHO, 2003; Boutayeb, 2005; Parkin, 1999). In parallel, infectious diseases
continue to be the major causes of mortality and morbidity in low and middle income countries, where,
well known existing, emerging and re-emerging diseases like tuberculosis, cholera, meningitis, hepatitis, malaria, dengue, yellow fever, AIDS, Ebola, SARS and others are causing suffering and mortality to a wide population. Among the infectious diseases, dengue fever, especially known in Southeast
Asia, is now endemic in more than 100 countries world-wide. Its incidence has increased fourfold since
1970 and nearly half the world population (2.5-3 billion) is now at risk. It estimated that more than 50
Dengue Fever
million people are infected every year of which half a million of Dengue Haemorrhagic Fever (DHF)
(DengueNet, 2007; Reprot, 2002; Teixeira, 2002). The two recognised species of the vector transmitting
dengue are Aedes aegypti and Aedes albopictus. The first is highly anthropophilic, thriving in crowded
cities and biting primarily during the day while the later is less anthropophilic and inhabits rural areas.
Consequently, the importance of dengue is twofold:

With increasing urbanisation, crowded cities, poor sanitation and lack of hygiene, environmental
conditions foster the spread of the disease which, even in the absence of fatal forms, breeds significant economic and social costs (absenteeism, immobilisation, debilitation, medication).
The potential risk of evolution towards the haemorrhagic form and the dengue shock syndrome
with high economic costs and which may lead to death.
Many authors have presented the disease as a major health problem either for the last decades of
the 20th century or for the third millennium (Gubler, 1997; Gubler, 2002). The need for research and
surveillance is often dealt with and many authors have stressed that DF/DHF is still perceived as unimportant and receives little attention despite its social and economic impact being similar to some of
the most visible infectious diseases (Meltzer, 1998; Coleman, 2004).
Different mathematical models were proposed. In general, they use compartmental dynamics with
Susceptible, Exposed, Infective and Removed for human; and Susceptible and Infective for mosquito.
SEIRS models were considered with an evaluation of the impact of ultra-low volume (ULV) insecticide
applications on dengue epidemics (Newton, 1992). The values of basis parameters used in simulation by
these authors constituted a data source (Table 1) for other authors. A general model with the population
of susceptible and infectious human assumed constant and facing only one virus was considered by
Esteva and Vargas (1998). These authors also proposed models where the human population was supposed to grow exponentially and to have a constant disease rate (Esteva, 1999), two serotypes of virus
and variable human population and the impact of vertical transmission and interrupted feeding on the
dynamics of the disease (Esteva, 2000; Esteva 2003).
In previous papers, while pointing out that the idea of two viruses coexisting in the same epidemic
is controversial, mathematical models with constant human population (N h ) and two different viruses
acting at separated intervals of time were considered by the authors (Derouich, 2003; Derouich, 2004).
The case of variable human population (N h ), was also considered Derouich, 2006). Building on that, the
present chapter introduces a compartment of vaccinated people and hence considers a SVIR model.
FORMULATION OF THE MODEL AND STABILITY ANALYSIS

Parameters of the Model
Let Nh and Nv denote the human and vector population size. In this model death is proportional to the
populations size with rate constant h and we assume a constant h due to births and immigrations. So
dN h
= h
dt
810
Nh
Dengue Fever
Whereas for vector population we suppose that Nv is constant. The human population (respectively
of mosquito population) of size Nh (resp. Nv) formed of Susceptibles Sh, of vaccinated V h, of Infective
Ih and of Removed R h (resp. Sv and Iv).
The model supposes a homogeneous mixing of human and mosquito population so that each bite has
an equal probability of being taken from any particular human. While noting bs the average biting rate
of susceptible vectors, phv the average transmission probability of an infectious human to a susceptible
vector, the rate of exposure for vectors is given by:
(phv Ihbs)/Nh
It is admitted that some infections increase the number of bites by the infected mosquitoes in relation
to the susceptible, therefore, we will assume that the rate of infected mosquito bites bi is greater than
the one of the susceptible mosquitoes bs.

Noting pvh the average transmission probability of an infectious vector to human and Iv the infectious vector number, the rate of exposure for humans is given by: (pvhIv bi)/Nh. So the adequate
contact rate of human to vectors is given by: Chv=phv.bs.
The adequate contact rate of vectors to human is given by:Cvh=pvh.bi.
h is the rate at which susceptible individuals receive the vaccine; h is the rate at which vaccinebased immunity wanes. The man life span is taken equal to 25 000 days (68.5 years), and the one of the
vector is of 4 days. Other parameters values are given in Table 1.
Equations of the Model

A schematic representation of the model is shown in Figure 1.
We consider a compartmental model that is to say that every population is divided into classes, and
that one individual of a population passes from one class to another with a suitable rate. Up to now
there is no vaccine against dengue viruses but research is going on and the eventuality of an immunization program is not excluded in the medium term. In this study we investigate the effect of such an
immunization option.
Table 1. Definitions and values of basis parameters used in simulations [19]

Name of the parameter
Transmission probability of vector to human
Transmission probability of human to vector
Bites per susceptible mosquito per day
Bites per infectious mosquito per day
Effective contact rate, human to vector
Effective contact rate, vector to human
Human life span
Vector life span
Host infection duration
Notation
phv
pvh
bs
bi
Chv
Cvh
1/h
1/v
1/(h + h)
Base value
0.75
0.75
0.5
0.1
0.375
0.75
25000 days
4 days
3 days
811
Dengue Fever
Figure 1. Schematic diagram: Compartments of human and vector populations
First Epidemic
The model is governed by the following equations:
Human population:
C vh I v
dS h
dt = h ( h + h + N ) S h +
h
dVh
= h S h ( h + h )Vh
dt
dI h C vh I v
S h ( h + h + h )I h
=
Nh
dt
dRh
= h I h h Rh
dt
dN h
= h h Nh hIh
dt
812
Vh
Dengue Fever
Vector population:
C hv I h
dS v
dt = v N v ( v + N ) S v
dI v = C hv I h S I
v
v v
dt
Nh
Introducing the proportion: s h =

sv =
Sh
h /
; vh =
h
Vh
h /
; ih =
h
Ih
h /
; rh =
h
Rh
h /
; nh =
h
Nh
h /
;
h
I
Sv
; and iv = v .
Nv
Nv
So with the conditions sh+vh +ih+rh=nh and sv+iv1 i.e. rh=nh-( sh+vh +ih) and sv =1 -iv, the two previous
systems become:
ds h
C m
= h ( h + h + vh iv ) s h +
nh
dt
dv h
= h s h ( h + h )v h
dt
dih C vh m
iv s h ( h + h + h )ih
=

nh
dt
div C hv
(1 iv ) v iv
=
nh
dt
dn
h = h h nh h I h
dt
h h
with m =
Nv
h /
Equilibrium Points
Theorem 1
The previous system admits two equilibrium points:

If
( R 1) p the trivial state E1
h
h
+p (
+ p)(
h
h
,0,0,1 is the only equilibrium.

+ h)
( R 1) > p then an endemic equilibrium E2 (sh , vh , ih , iv , nh ) will be the equilibrium point, where
mC hv C vh
h
R=
and p = h
v( h + h +
h)
h + h
If
Proof
The equilibrium points satisfy the following relations:
h
C vh m
iv ) s h +
nh
v = 0
h h
(1)
813
Dengue Fever
s (
h h
C vh m
sh (
nh
nh
)v h = 0
Chv
(1 iv )
nh
h
)ih = 0
(3)
i = 0
(4)
I = 0
(5)
v v
h h
C hv ih
.
C hv ih + v nh
From the equation (4) we have: iv =

From the equation (5) we have: nh =
h h
, so iv =
From the equation (2) we have: v h =

From the equation (1) we have: (
h C hv
C hv ih
h )i h +
(1
)) s h =
C vh m
iv s h.
nh
( h + h +
h + p
C vh m
iv s h = ( h + h + h )ih then s h = h
nh
C m
From the equation (3) we have: vh iv s h ( h + h + h )ih = 0 thus:
nh
( h + h + h )ih
h
h C hv i h
( h +
C vh m
h
(
)
i
C
i
+
h
h h
h hv
h h
v h
h + p
On the other hand,
.
v
s h,
p
ih
(1 + )(Chv
h
(2)
p
ih2 Chv Cvh m + (1 + )( h Chv 2
h
h
) ih +
)ih
Chv Cvh m
+ h+ h
)ih = 0 then
h
h
(1 +

On the other hand we have R =
mC hv C vh
v( h + h +
then mC hv C vh =
p
) = 0
(6)
) R. Therefore
if we put:
p
A = (1 +
)(Chv
B = 2(1 +
(1 +
C=
) hChv
)R
(R 1
then the solutions of this equation (6) are ih = 0 and the roots of the polynomial Q (iv ) = Ai h2 + Bi h + C .
Since s h =
814
+ h+
h + p
)ih
0,
0 and ih 0 then ih
v
h
,
+ h
Dengue Fever
we have Q(0) =
(R 1
) and:
)=
Q(
h
When
0,
(
(
+ p )( h + h )
[(
2
h + h +
h)
+ C hv ]< 0
( R 1) p the value of the polynomial Q(ih) is negative at the end points of the interval
therefore there is no root in this interval.

h + h + h
v
If h ( R 1) > p then Q(0) > 0 therefore there exists a unique root in the interval which implies the
existence of a unique equilibrium points: E2 (sh , vh , ih , iv , nh ).
Stability
Theorem 2

For R
+p
the state E1 is globally asymptotically stable (ie lim I h (t ) = 0).

t
For R >
+p
the state E2 is locally asymptotically stable.
Proof
1. The point E1:
A B
For E1 the matrix of linearization (Jacobian matrix) is giving by: J E1 =
0 D with
( h + h )
A=
( h + h )
h
and
D=
C vh
C vh m h
h + p
v
0
Then the characteristic polynomial of A is given by:
PA ( ) =
+ (2
) +
and the characteristic polynomial of D is given by:

PD ( ) = (
+ )
+(
)++
) +
)(1
R
)
+ p
h
h
815
Dengue Fever
Thus the eigenvalues of matrix D are

2
+(
)++
) +
=
v
and the root of the polynomial:
)(1
R
)
+p
h
h
So E1 is stable if and only if the coefficients of polynomial:

2
+(
satisfy R
+
h
+p
)++
) +
)(1
R
)
+p
h
h
2. The point E2 for h = 0:

The local stability of E2 is governed by the matrix of linearization (Jacobian matrix) of E2 is given by:
So from the equation (1) we have:

From the equation (3) we have: (
From the equation (4) we have:
C vh m
iv = h
nh
sh
)=
C hv ih C hv ih
=
+
n h iv
nh
C vh m
iv s h
nh ih
v
Then the matrix became:
Therefore the eigenvalues of the matrix J E2 are h and the roots of the polynomial
q ( ) = 3 + A 2 + B + C , where:
A=
B=
C=
816
+ h v h mC vh s h iv C hv ih
+
+
,
sh
ih nh
iv n h
C hv ih mC vh C hv s h iv mC vh iv ( h nh
+
+
s h iv n h
nh2
ih nh2
h
mC vh C hv (nh s h + s h iv ) mC vh C hv s h ih
nh3
nh3
i s
h h h
Dengue Fever
So: AB > C, A > 0 , B > 0 and C > 0 then following Routh-Hurwitz conditions for the polynomial q,
+p
the state E2 is locally asymptotically stable for R > h
.
Second Epidemic
In the same way as in the previous section we suppose the onset of a second epidemic with another
virus. But in this case, we may assume that a proportion of the population of susceptibles is globally
immunized against the four serotypes or partially immunized against one, two or three viruses.
But in this model we suppose that the human population is divided into two categories:

A subpopulation that is infected once by serotype 2.

A sub-population SNh that is infected twice: the first by the serotype 1 and the second by the serotype 2, this subpopulation is derived only from the removed from the first epidemic (serotype
1) who are exposed to DHF (Snh(t0)=Ssh(t0)=rh*).
Therefore the model is given by the following equations:
Human population:
C vh I v
dS h
dt = h ( h + h + N ) S h + hVh
h
dVh
= h S h ( h + h )Vh
dt
dI h C vh I v
dt = N S h ( h + h + h ) I h
h
dRh
dt = h I h h Rh
dN h
= h h Nh hIh
dt
C vh I v
dSS h
dt = ( h + h + N ) SS h + h SVh
h
dSVh
= h SS h ( h + h ) SVh
dt
dSI h C vh I v
dt = N SS h ( h + h + h ) SI h
h
dSRh
= h SI h h SRh
dt
817
Dengue Fever
Vector population:
Chv ( I h + SI h )
dSv
) Sv
dt = v N v ( v +
Nh
dI v = Chv ( I h + SI h ) S I
v
v v
dt
Nh
As in the first epidemic, introducing the proportions:

sh =
Sh
h /
Ss h =
sv =
; vh =
h
SS h
h /
Vh
h /
; Sv h =
h
; ih =
h
Ih
h /
; rh =
h
SVh
SI h
; Sih =
h / h
h /
Rh
h /
; Srh =
h
;
h
Nh
SRh
; nh =
h / h
h /
;
h
I
Sv
; and iv = v
Nv
Nv
and with the conditions: rh=nh-( sh+vh +ih+Snh) where Sn h (t 0 ) = Ss h (t 0 ) = rh ,

dSnh
=
dt
Snh
Sih, Srh=Snh-(Ssh+Svh+Sih) and sv =1-iv then the two previous systems become:
C vh m
ds h
dt = h ( h + h + n iv ) s h + h v h
h
dv h
= h s h ( h + h )v h
dt
dih C vh m
dt = n iv s h ( h + h + h )ih
h
div C hv
dt = n (1 iv ) v iv
h
dnh
= h h nh h ih
dt
C vh iv
dSs h
dt = ( h + h + N ) Ss h + h Sv h
h
dSv h
= h Ss h ( h + h ) Sv h
dt
dSih C vh iv
dt = N Ss h ( h + h + h ) Sih
h
dSn h
dt = h Snh h Sih
818
Dengue Fever
Equilibrium Points
Theorem 3
The previous system admits two equilibrium points:

If
( R 1) p the trivial state E1
,0,0,1,0,0,0 is the only equilib+ h)
h + p ( h + p )( h
rium.
If h ( R 1) > p then an endemic equilibrium E 2 s h , v h , ih , iv , nh ,0,0,0 will be the equilibrium
mC hv C vh
h
1
and p = h
point, where R =
v( h + h +
h)
h + h
Proof
The proof follows as in Theorem 1.
Stability
Theorem 4

For R
+p
the state E1 is globally asymptotically stable (ie lim ih (t ) = 0 and lim Sih ( t ) = 0).
x
For R >
+p
the state E2 is locally asymptotically stable.
Proof
1. The point E1:
For E1 the matrix of linearization (Jacobian matrix) is giving by:
J 11
J E1 =
0
(
A=
D=
J 22
J 12
A B
with J 11 =
0 D ,
J 22
)
(
C vh
0
0
,
+ h )
h
C vh m h
h + p
v
0
0 , and
0
( h + h + h)
h
h
0
819
Dengue Fever
Thus the eigenvalues of matrix J E1 are given by the eigenvalues of J11 and the eigenvalues if J22. So
the eigenvalues of J22 are 1 = ( h + h ), 2 = h and 1 = ( h + h + h ). According to
Theorem 2 the eigenvalues of J11 have a negative real parts if and only if:
R
+p
h
Then E1 is stable if and only if R
+p
h
2. The point E2:

The local stability of E2 is governed by the matrix of linearization (Jacobian matrix) of E2 this matrix
J 11 J 12
has the form J E2 =
0 J where:
22
J 22
mC vh
iv
nh
0
0
0
( h + h + h)
h
h
Thus the eigenvalues of matrix J E1 are given by the eigenvalues of J11 and the eigenvalues if J22. So the
mC vh
iv ), 2 = h and 3 = ( h + h + h ). According
eigenvalues of J22 are 1 = ( h + h +
nh
to Theorem 2 the eigenvalues of J11 have a negative real parts if and only if:
R>
+p
h
Then E2 is stable if and only if R >
+p
Remark: The inequality R
+p
represents the principle of herd immunity because the susceptible
population may be protected from epidemics if enough people are immunized.
RESULTS AND DISCUSSION

Stability analysis was carried out for the two epidemics and values of the threshold were obtained . Illustration of the dynamics of each epidemic is given by Figures 2-4.
820
Dengue Fever
Figure 2 shows the typical behaviour of the solutions indicating that the rate of susceptible, infectious and removed approaches, asymptotically, the trivial state of the system (the ideal state) i.e. to the
+p
case where all the population is and will remain healthy (in this case R h
).
h
Figure 3 illustrates an oscillatory behaviour near the neighbourhood of the endemic equilibrium point.
+p
This behaviour can be justified in the fact that if R > h
and if the initial values of sh, Ssh, and Ssv
h
satisfy the relation: R ( s h + Ss h + s v ) > 1 then ( s h + Ss h + s v ) decreases and the rate of the infectious
(ih + Sih + iv ) grows until a maximum, then it decreases since there is not enough susceptible to be
infected. However after a low value of (ih + Sih + iv ), the rate of susceptible starts growing because of
birth of susceptible, once big, an epidemic kid releases and so forth.
Figure 4 illustrates the benefits of vaccination in the control of the epidemic, a comparison is given
for different values of the proportion (h = 0, 0.5 and 0.95), but this eventuality remains subject to the
advent of the vaccine.
CONCLUSION
By its nature, dengue is a complex disease resulting from the interaction of human, biological, environmental, geographical and socio-economic factors. The present chapter is devoted to the understanding
of the dynamics of dengue and essentially its evolution to the haemorrhagic form. The model considers
Figure 2. Convergence
-4
0. 2
10
population
population
F irs t infection
s econd infection
0. 1
0. 05
10
50
time
Number of the infectious (s econd epidemic)
-3
x 10
700
800
900
1000
time
Number
of the infectious humans (s econd epidemic)
-4
x 10
s econd infection
population
-5
600
-5
600
150
100
infectious humans (1s t inf)

infectious vectors
population
infectious humans (s econd epidemic)
infectious humans
infectious vectors
0. 15

x
10
1
0
700
800
time
900
1000
-1
600
700
800
time
900
1000
821
Dengue Fever
Figure 3. Oscillation
Number of the infectious humans (s econd epidemic)
0. 3
F irs t infection
s econd infection
0. 2
0. 5
infectious humans
infectious vectors
population
population
0. 4
0. 3
0. 2
0. 1
0
0. 1
0
10
20
30
40
50
60
time
Number of the infectious (s econd epidemic)
0. 015
0. 01
0. 005
640
660
680
time
Number of the
-12 infectious humans (s econd epidemic)
x 10
1
2000
4000
6000
time
8000
620
population
population
infectious humans (1s t inf)

infectious vectors
-0. 1
600
-1
s econd infection
-2
3000 4000 5000 6000 7000 8000 9000 10000
time
10000
Figure 4. The role of vaccination in the eradication of the disease in the first and second epidemic
Infectious humans (st epidemic)
Infectious humans (ndepidemic)
0.
0.
h=0
population
population
h=0.
0.
h=.0
0.0
0
-0.0
0.0
0.0
0.0
0
-6

h=0
0.0
x 0
0
time
0
0
00
0
Infectious humans (ndepidemic st inf)
0
-3
x 0
0
0
time
0
h=0.
h=0.
population
population
0
Infectious humans (ndepidemic nd inf)
h=0.
0
0
h=0.
0
-5
00
822
0
0
time
0
0
0
00
0
0
time

0
Dengue Fever
a variable human population and the succession of two epidemics at different intervals of time. The
reproductive number R as a threshold of control of the epidemics is discussed through stability analysis. Simulations with different parameters settings give illustration of the succession of two epidemics
and their amplitudes. The model shows that environmental management alone as a vector control is
not sufficient, it can only delay the outbreak of the epidemics. The eventuality of a vaccine protecting
simultaneously against the four serotypes remains a hope for the future.
Meanwhile, partial vaccination could be part of a preventive strategy based on the control of environmental and socio-economic factors.
REFERENCES
Coleman, P.G., Fvre, E. M., & Cleaveland, S. (2004). Estimating the public health impact of rabies.
Emerging Infectious Diseases, 10(1), 140-142.
Derouich, M., Boutayeb, A., & Twizell, E.H. (2003). A model of dengue fever. Biomedical Engineering
Online, 2-4.
Derouich, M., & Boutayeb, A. (2004). Fivre dengue: Modlisation et simulation. In Proceeding of the
International Symposium on Health and Biomedical Research Interaction, Oujda, Morocco, 41-45,
ISBN: 99954-0-0973-6.
Derouich, M., & Boutayeb, A. (2004). Dengue fever: Mathematical modelling and computer simulation.
Esteva. L., & Vargas, C. (1998). Analysis of a dengue disease transmission model. Mathematical
Biosciences, 150(2), 131-151.
Esteva. L., & Vargas, C. (1999). A model for dengue disease with variable human population. Journal
of Mathematical Biology, 38, 220-240.
Esteva. L., & Vargas, C. (2000). Influence of vertical and mechanical transmission on the dynamics
of dengue disease. Mathematical Biosciences, 167(1), 51- 64.
Esteva. L., & Vargas, C. (2003). Coexistence of different serotypes of dengue virus. Journal of Mathematical Biology, 46, 31- 47.
Gubler, D.J. (1997). Dengue and C: Its history and resurgence as a global public health problem. In
Gubler DJ and Kuno G (eds.), Dengue and dengue haemorrhagic fever. New York: CAB International,
(pp. 1-22).
Gubler, D.J. (2002). Epidemic dengue/ dengue haemorrhagic fever as a public health, social and economic problem in the 21st century. TRENDS in microbiology, 10, 100-103.
Meltzer, M. I., Rigau-Perez, J. G., Clark, G. G., Reiter, P., & Gubler, D. J. (1998). Using DALY to assess
the economic impact of dengue in Puerto Rico: 1984-1994. Am. J. Trop. Med. Hyg, 59, 265-271.
Newton, E. A., & Reiter, A. (1992). A model of the transmission of dengue fever with an evaluation of
the impact of ultra-low volume (ULV) insecticide applications on dengue epidemics. Am J Trop Med
Hyg, 47, 709-720.
823
Dengue Fever
Parkin, D. M., Pisani, P., & Ferlay J. (1999). Global Cancer Statistics. CA Cancer J Clin, 49, 33-64
Report of the Scientific Working Group on Insect Vectors and Human Health. Insect Vectors and Human Health, TDR/SWG/VEC/03.1, Geneva, 2002.
Teixeira, M. G., Barretoo, M. L., Costa, M. C. N., Ferreira, L. D. A., Vasconcelos, P. F. C., & Cairncross,
S. (2002). Dynamics of dengue virus circulation: A silent epidemic in a complex urban area. Trop. Med.
Intern. Health 7, 757-762.
WHO. ( 2003). Shaping the future. The World Health Report 2003. WHO, Geneva.
Key Terms
Adequate Contact Rate: The average number of contacts between a person and a mosquito.
Compartmental Model: A model subdivided into different classes.
Dengue: An acute, infectious tropical disease caused by an arbovirus transmitted by mosquitoes,
and characterized by high fever, rash, headache.
Epidemic: A disease (infection) that is spread (in general by transmission) affecting a large part of
the population over a wide area.
Immunization: natural or acquired protection against infection.
Incidence: The frequency with which a disease appears in a particular population or area in a given
period of time.
Simulation: The use of a mathematical model to recreate a situation, or to imagine different scenarios
with various parameters settings.
Stability Analysis: Analysis indicating how a model reacts to perturbations and changes.
824
Section XIII
Data Processing in
Histopathology
826
Chapter XLV
Automated Image Analysis

Approaches in Histopathology
Ross Foley
University College Dublin (UCD), Ireland
Laoighse Mulrane
Matthew DiFranco
R. William Watson
Kenneth Bryan
Pdraig Cunningham
Elton Rexhepaj
William M. Gallagher
abstract
The field of histopathology has encountered a key transition point with the progressive move towards use
of digital slides and automated image analysis approaches. This chapter discusses the various methods
and techniques involved in the automation of image analysis in histopathology. Important concepts and
techniques are explained in the 5 main areas of workflow within image analysis in histopathology: data
acquisition, the digital image, image pre-processing, segmentation, and machine learning. Furthermore,
examples of the application of these concepts and techniques in histopathological research are then
given.
INTRODUCTION
As computing technology advances at a rapid pace worldwide, the impact is starting to be seen in
the field of histopathology. Digital imaging offers a wealth of advantages over traditional microscopy
Automated Image Analysis Approaches in Histopathology
procedures, including storage of image data for later use, automation of image analysis tasks, and the
application of novel image processing and machine learning techniques.
The development of large-scale image databases opens the door for collaboration amongst remote
laboratories by giving concurrent access to the same data. In clinical pathology, this type of collaboration
can lead to better consensus in image reading, as well as the potential to develop new training techniques.
In systems biology, access to remote image data allows researchers to engage in wide-ranging studies
far beyond the constraints of a single lab.
Automated image analysis approaches can serve as a valuable aide to clinical pathologists and systems biology researchers in the domain of histopathology. The objectives of automation are to analyse
data efficiently, accurately, and reproducibly in high-throughput environments. Appropriate image
processing techniques must be employed so as to extract, from images, as much relevant information
as possible in as short a time as possible.
This chapter is divided into five main sections which describe the key steps in a digital histopathology
environment. The first section deals with data acquisition, including tissue sample preparation, staining,
digital slide management, and a brief introduction to tissue microarrays (TMAs).
Section two provides an overview of digital image content, including grayscale and colour representations, as well as textural features of images. The third section provides a detailed overview of image
pre-processing techniques which are used to prepare image data for analysis. Techniques which are
covered include grayscale transformations, contrast enhancement, smoothing and edge detection.
Segmentation, the fourth section, deals with the critical process of automatically identifying important image regions. Segmentation topics discussed include thresholding, region-based segmentation,
watershed segmentation, template matching, and active contour models.
The fifth and final section, Machine Learning, introduces methods of image data analysis, including
dimensionality reduction and supervised learning techniques.
DATA ACQUISITION
There are a number of crucial steps that take place before a tissue sample can be digitally imaged and
subsequently analysed. Artifacts, i.e. anything that interferes with the examination of the tissue, can
be introduced during each of these steps that can greatly reduce the quality of the image and, therefore,
the accuracy of the image analysis performed later.
Tissue Processing
The tissue sample is first removed during surgery, biopsy or autopsy and placed in a fixative, typically
formalin, to prevent decay. It is then dehydrated by submerging it in ethanol. The sample is then permeated with paraffin and encased in a paraffin block. If the processes above are not carried out correctly,
the paraffin-embedded tissue can become brittle and difficult to work with, which can lead to degraded
image quality.
827
Tissue Sectioning
The paraffin-embedded tissue is now sectioned into slices of between 2 and 8 m and placed on a
glass slide. Many problems found in the images produced later are related to this process. Often the
section can warp or fold as it is being cut. Parts of the tissue can also break away from the section. To
reduce the warping and folding affects, the tissue section is floated on warm water after it has been cut
in a process known as mounting. The thickness of the section dramatically affects the final image and
must be considered carefully. If the section is too thick, it will be difficult to automatically identify and
process individual cells due to overlaps. On the other hand, if the section is too thin then many of the
cells will be missing their nuclei, leading to inaccuracies in automated analysis algorithms which rely
on nucleus detection.
Staining/Counterstaining
The slide must be stained so that various interesting features of the tissue can be seen. The most commonly used staining in histopathology is Haematoxylin and Eosin (H&E). Haematoxylin pigments
acidic structures (nucleus) purplish-blue and Eosin pigments basic structures (cytoplasm) pinkish-red.
An H&E stained prostate cancer tissue section is shown in Figure 1 (a). Another important staining
method is immunohistochemical staining. In this method, a specific antibody is developed that binds
selectively to a specific protein in tissue. The site of this antibody binding is visualised using a coloured
marker (Stevens, Lowe & Young, 2002). Various other stainings exist, but a similar set of problems
exist for all stains. The slides can be either under or overstained, resulting in images with poor contrast
and definition. The strength of the staining can change within one slide and between slides, making it
difficult to develop an analysis algorithm that is accurate and transferable between slides. Another glass
slide, i.e. coverslip, is attached so that the tissue section lies between the two. Artifacts such as dust or
air-bubbles can get trapped between the two slides and greatly degrade the final image.
Digital Microscopy/Scanning
The stained slide must now be scanned and stored onto a computer using a digital microscope. The
quality of the digital image produced will depend heavily on the microscope itself. There must be
uniform lighting on the slide in order to produce a usable image. The microscope must also produce a
sharp image suitable for analysis.
Telepathology
The growth of digital microscopy means that there are now an enormous number of histopathological
images available as digital image files. Each of these files is typically 200 700 MB in size. These large
file sizes offer their own unique challenges with regard to storage and circulation. The increased bandwidth and reliability of the internet in recent years offers the opportunity to share these images between
pathologists in distant geographical locations quickly and at relatively low cost. Having a larger pool of
pathologists available to remotely examine digital images can mean improved treatment for patients and
also global standardisation of the analysis procedure. Moreover, this vast wealth of image data can be
828
Figure 1. Various steps in a watershed segmentation algorithm using (a) an H&E stained prostate tissue sample, including (b) a grayscale representation of the original image, (c) contrast enhancement
to reduce noise, (d) edge detection, (e) thresholding to create a binary region map, and (f) the .nal
watershed segmentation result.
(a) Original Image
(b) Grayscale Transformation
(c) Contrast Enhancement
(d) Edge Detection
(e) Thresholding
(f) Watershed Segmentation
For a color version of the image, please visit http://www.igi-global.com/daskalaki_ch45.
829
utilised to develop more accurate automated image analysis algorithms if it is stored in databases that
are easily accessible (Weinstein, Bloom & Rozek, 1987, Weinstein et al, 1997, Kayser et al, 2004).
Annotation
In a clinical setting, any algorithm that is created to perform an image analysis task in histopathology
must be trained using images annotated by a pathologist. This makes the task of annotation crucially
important to the process of automated image analysis. However, the annotation process is highly subjective and studies have shown that inter and intra-observer repeatability is a huge problem (Cross,
2001, Deolekar & Morris, 2003). One of the benefits of the development of automated image analysis
approaches is that they can offer objective and consistent results and can, therefore, contribute to further
standardisation of annotation practices.
Tissue Microarrays
Tissue microarrays (TMAs) allow multiple tissue sections to be analysed on the same slide (Kononen
et al, 1998, Brennan et al, 2007, Mulrane et al, 2007). Regions of interest (or sometimes arbitrary regions) are identified in different tissue samples. Circular cores of between 0.6 and 2mm diameter are
then removed from these regions of their original paraffin block and collected as an ordered array in a
new paraffin block as shown in Figure 2 (a). Up to one thousand cores can be placed in a block, albeit
the convention is to use several hundred specimens as a maximum Such TMAs can be processed as
Figure 2. An example TMA. (a) TMA containing H&E stained rat liver tissue cores. (b) Close up of one
TMA core.
830
with standard full face sections (see above), and image analysis can be performed on a large number of
cores simultaneously. TMAs have received a lot of attention in recent years as they are well suited to a
high-throughput environment and become extremely powerful when combined with automated image
analysis techniques. Studies involving large numbers of tissue samples can be performed using relatively
few slides. For example, Brennan et al. (2008) used automated image analysis techniques on TMAs to
investigate surviving protein expression as a prognostic indicator of breast cancer.
THE DIGITAL IMAGE

An image is stored digitally as an array or matrix of pixels, each pixel representing light intensity at a
particular point in the image. The quality of the image is determined by the number of pixels used to
store the image, i.e. the length and width of the array, and also the number of quantization levels used
to measure the light intensity. All further analysis is performed on this array of pixels (Sonka, Hlavac
& Boyle, 1999).
Brightness/Grayscale
The simplest representation of an image is its grayscale representation. Each pixel has only one value
associated with it corresponding to the brightness of the light at that point. The image is displayed as
varying shades of gray from black, at the lowest brightness, to white, at the highest. Figure 1 (b) shows
a grayscale representation of an H&E stained prostate tissue section.
Colour
Colour is a property of enormous importance in the analysis of histopathological images. Image analysis
using grayscale alone requires high-intensity staining and good contrast between objects and background.
Different tissue structures often stain in different colours and, therefore, colour information is necessary
in order to analyse these slide stainings correctly. The interpretation of colour is highly subjective and
there are, therefore, numerous different ways to represent colour. Some are generic and open source
while others are commercially owned and controlled. This section focuses on three important, generic,
representations used in histopathology, RGB, CMYK and HSV.
RGB Colour Space

In a standard RGB colour image, each pixel is represented by a 3D vector value corresponding to the
intensity of red, green and blue light at that point. Black is represented by the vector (0, 0, 0) and white
by the vector (n, n, n), where n is the number of light intensity quantization levels (usually 256). The
RGB model is described as an additive model because the three primary colours are added together to
create white. All other colours are represented by a mixture of these three primary colours. The addition
of the primary colours in RGB is shown in Figure 3 (a).
831
CMYK Colour Space

A colour space that has proved useful in histopathology is the CMYK which represents each pixel in
terms of the secondary colours. The secondary colours are those which comprise two of the primary
colours, namely cyan, magenta and yellow. This colour space is described as a subtractive model because each of these secondary colours, mentioned above, can be used to subtract one of the primary
colours from white light. The K in the model title stands for key or black to signify that subtraction of
all three secondary colours yields black. The subtraction of the secondary colours in CMYK is shown
Figure 3. Different colour models. (a) Red Green Blue (RGB) is an additive colour model. The three
primary colours are added together to form other colours. (b) Cyan Magenta Yellow Key (CMYK) is a
subtractive colour model because the three secondary colours are subtracted from white to form other
colours. (c) Conical representation of the Hue Saturation Value (HSV) colour model. HSV models human perception of colour differences more accurately. As colours approach black on the value (light
intensity) scale, there are less saturation and hue levels available.
(a) RGB Colour Model (additive)
(b) CMYK Colour Model (subtractive)
(c) HSV Colour Model
For a color version of the image, please visit http://www.igi-global.com/daskalaki_ch45.
832
in Figure 3 (b). Pham et al. (2007) showed that the yellow channel in the CMYK model offers a large
contrast between immunohistochemical staining and counterstaining intensities and, therefore, a greater
sensitivity when compared with other colour schemes.
HSV Colour Space

The RGB and CMYK colour spaces do not match the human perception of colour differences. Often in
the analysis of histopathology images, the colours are transformed to a different, more intuitive colour
space. The HSV colour space represents each pixel by its Hue its perceived colour, Saturation its
saturation with white light (dark or light colour) and Value the intensity of the light. The hue and
saturation together represent colour in a similar manner to human perception (Sonka, Hlavac & Boyle,
1999). Another advantage of the HSV colour space is that it is highly resistant to illumination changes
(Mete et al, 2007). A conical representation of the HSV colour space can be seen in Figure 3 (c).
Texture
A powerful property that can be used to describe and segment a region is its textural properties. There
is no precise definition of texture due to the abstract nature of the concept but a major characteristic
that is accepted is the repetition of a group or a number of groups of pixels throughout a region. These
groups of pixels are referred to as texture primitives or texels. The texture of a region depends on the
number and size of these texels (Haralick, 1979). Texture is generally described in three ways; statistically, structurally and spectrally (Gonzalez and Woods, 1992). We usually describe texture as fine,
course, rough, smooth, etc. These are statistical descriptions but are too ambiguous and are, therefore,
not sufficient for the purpose of segmentation. Structural and spectral properties are used to improve
the textural description. Structural properties deal with the arrangement of and relationships between
texture elements. Spectral properties deal with intensity and colour properties of the texture elements.
The focus in histopathology has been on structural texture properties, which have been used to locate
diagnostically useful areas (Hamilton et al, 1997), distinguish between normal and cancerous tissue
(Esgiar et al, 1998), and classify various grades of tumour (Gilles et al, 1994).
IMAGE PRE-PROCESSING
Image pre-processing encompasses a range of techniques, performed at the pixel level, used to facilitate
more accurate further analysis of the image. It should be noted that these techniques do not increase
the amount of information in the image but reduce it. However, they are used to remove redundant information from the image and enhance image features that are interesting. Image pre-processing can,
therefore, be vital to the image analysis process.
Gray-Scale Transformations
Frequently in histopathology, the intensity of the staining is critical and the colour is supplementary.
In these cases, a grayscale transformation is performed which maps the colour image onto a grayscale
equivalent (Figure 1 (b)). This reduces the complexity of the analysis from three dimensions to one.
833
Different transformation functions can be used so that, for example, the negative grayscale image is
produced. One transformation function that is widely used in histopathology is histogram equalisation.
Histogram equalisation allows interesting ranges of intensities to be enhanced while other, less interesting, ranges are impaired. The use of histogram equalisation for contrast enhancement is shown in Figure
1 (c). Boudraa et al. (2000) used histogram equalization to increase the contrast between cerebrospinal
fluid and multiple sclerosis lesions. Grayscale transformations can reveal morphological details within
the tissue that may not be visible in the original image (Levenson & Hoyt, 2000).
Smoothing Filters
Any image taken using a digital microscope will contain some level of noise. A smoothing filter is used
to reduce this noise. In its simplest form, this is achieved by assigning a new value to each pixel, the average of the brightness values of all the pixels within a defined distance of the original pixel. The original
pixel is sometimes given a higher weight in this averaging process so that it better approximates the
properties of Gaussian noise. One of the drawbacks of this kind of smoothing is that it causes a blurring
of edges and sharp objects. One way of avoiding this blurring is to apply a median filter which operates
in much the same way except the pixel is assigned the median value of all the other pixels (Tyan, 1981).
Jadhav et al. (2006) used median filtering as a pre-processing technique in the analysis of precancerous
lesions. Other, more intelligent smoothing algorithms exist which combine edge detection (discussed
later) with filtering to prevent edges from becoming blurred.
Position-Dependent Brightness Correction

A degrading affect that is often seen in histopathology is uneven illumination, especially at the peripheries: this is called vignetting. Vignetting should be corrected before any analysis is performed on the
image. One solution is to image an empty space and use the illumination in this image as a correction
filter. However, a more desirable correction would not require the second white space image. Leoung,
Brady and McGee (2003) proposed a correction technique that uses illumination features available in
the original images. A Gaussian smoothing filter is used to blur the image so that only the brightness
of local regions in the image remains. This filtered version of the image can then be used as the correction filter.
Edge Detection
One of the major challenges in the analysis of histopathology images is the segmentation of nuclei.
Once the nuclei have been segmented properly, various attributes of the nuclei and cells can be quantified and utilized in a more in-depth analysis. An edge, in an image analysis point of view, is defined as
pixels where the brightness changes abruptly (Sonka, Hlavac & Boyle, 1999). This is generally the case
between nuclei and cytoplasm, where the staining will abruptly change from dark blue to pink. Changes
are described in calculus using derivatives. Using calculus, we can find the pixels where the most abrupt
changes occur and define these as the edges. At these pixels, the first derivative will be a maximum
(as these are the points of maximum change in brightness) and the second derivative will, therefore, be
zero. In practice, it is easier and more precise to measure a zero crossing than a maximum so the second
derivative is used (Marr & Hildreth, 1980). The use of a derivative-based edge detection algorithm to
834
discover nuclei borders is shown in Figure 1 (d). Keenan et al. (2000) used edge detection techniques in
the segmentation of nuclei for the purposes of grading cervical intraepithelial neoplasia (CIN).
SEGMENTATION
In practice, treating an image as a large matrix of pixels is not very useful. Therefore, it is desirable to
employ segmentation, the name given to the process performed by a group of techniques which transforms
this pixel level representation of the image into larger objects which better reflect the actual content of
the image. In histopathology, a segmented image could contain cell cytoplasm objects, cell nuclei objects, white space objects, vacuole objects, etc. In such complicated scenes, complete segmentation into
the real-world objects contained in the image is generally not achievable by one segmentation process.
A series of partial segmentations and post-processing is used to achieve complete segmentation. The
choice of segmentation method(s) depends on the properties of the image.
Thresholding
A fundamental and extremely useful technique which is widely used in the segmentation of histopathological images is thresholding. The simplest form is grayscale thresholding. This assumes that an object
and its background can be distinguished from the background based on its grayscale values. A threshold
is set on the grayscale and each pixel is assigned to one of two classes depending on whether its grayscale
value is above or below this threshold. Figure 1 (e) shows the results of grayscale thresholding on an
H&E stained prostate image for the purposes of nuclear segmentation. Grayscale thresholding was used
effectively by Goldlust et al. (1996), Beier and Fahimi (1992), and Francis et al. (2000). Thresholding
can also be performed using the different colour channels or using mathematical combinations of the
colour channels, for example, the red channel divided by the blue channel. Colour thresholding is useful in histopathology as the nature of H&E staining means that nuclei are heavily expressed in the blue
channel and cytoplasm in the red channel. The feature and the threshold value are chosen based on the
spectral (colour) properties of the regions being segmented. Colour thresholding techniques were used
by Sharipo et al. (1990), Lehr et al. (1997) and Lehr et al. (1999). Thresholding is easy to implement and
offers good results when two regions do not overlap. However, the threshold requires calibration which
can affect an analysis algorithm when transferred to other images. Moreover, there are usually at least
some overlapping nuclei in histopathological images. These cannot be segmented using thresholding
so more complicated post-processing techniques must be employed.
Region-Based Segmentation
Region-based segmentation methods are extremely important in image processing. They are used to
construct regions directly rather than by detecting edges and boundaries. Regions can be created with
very specific properties. Some of the most common region-based segmentation techniques are described
in this section.
835
Region Splitting
Quadtree segmentation is a partial segmentation technique suited to heterogeneous images where some
areas of the image contain much more detail and information than others. Many types of histopathological images have this property as large areas of the tissue sample can be white space while others can
contain minute sub-cellular details. Quadtree segmentation is a top down (or splitting) technique that
begins with large square regions, as few as is possible to cover the entire image, which are continuously split into their four constituent square regions of equal size until some condition of homogeneity
is satisfied for all regions. This has the effect of segmenting the image into large square regions in areas
of low detail and very small square regions in areas of high detail. The homogeneity condition can be
based on grayscale intensity, colour, texture, shape, some higher level symantic model, etc (Haralick
and Shapiro, 1985, Zamperoni, 1986, Grimson and Lozano-Perez, 1987, Pal and Pal, 1987, Adams and
Bischof, 1994, Chang and Li, 1994, Chang and Li, 1995, Kurita, 1995, Baraldi and Parmiggiani, 1996).
The properties chosen for this homogeneity condition will affect the nature of the regions produced.
For example, an image that is segmented based on a shape condition such as circularity will contain
Figure 4. Quadtree segmentation of an H&E stained prostate tissue section. Homogenous regions are
segmented into large squares whereas more detailed regions are broken up into smaller squares.
836
very different regions from the same image segmented based on some intensity condition. The result
of a quadtree segmentation on an H&E stained prostate cancer tissue sample is shown in Figure 4. A
relatively relaxed grayscale intensity homogeneity condition was used so quite large variance in intensity
is permitted within each region.
Region Merging
The simplest region merging (or growing) technique treats every pixel as an individual region to begin
with. Regions are then repeatedly merged together based, once again, on some homogeneity condition. This process yields very good results as the regions left when the process has completed are not
rigid squares but unbounded organic shapes that better represented the form of the objects within the
image. However, this technique is highly computationally expensive as often there are a large number
of competing regions that satisfy the merging condition with a large number of other regions. A huge
number of decisions must be made on which regions to merge. Another possible drawback of this merging technique is that it will yield different objects when it is carried out a number of times. The final
segmentation depends on where in the image the initial merging operation takes place. Ong, Giam and
Sinniah (1993) used a region growing technique in the detection of membrane structures in kidney electron micrographs. Seeded region growing is a directed region growing technique which is very effective
but usually requires some prior knowledge concerning the properties of the regions being segmented. It
relies on picking suitable seed pixels and growing regions from these seeds. tenKate et al. (1993) used
seeded region growing for counting mitoses in breast cancer tissue sections. It was known that mitoses
were dark regions, so dark pixels were selected as seed pixels for the region growing process.
Split-and-Merge Techniques
An approach that combines both merging and quadtree splitting can yield results with the advantages
of both techniques (Horowitz and Pavlidis, 1974, Pavlidis, 1977). A compromise initial segmentation
can be used that lies somewhere between complete segmentation to the pixel level in the merging
technique and very little segmentation in the splitting technique. The regions can then be continuously
split or merged until some homogeneity condition is maintained for all regions (Chen et al. 1991). This
combinational technique saves an enormous amount of processing power and delivers excellent segmentation results.
Watershed Segmentation
In watershed segmentation, the image is treated as a topographical image where each pixels grayscale
value is analogous to altitude in a geographical image. If we consider the image in this way, the process
of watershed segmentation is the equivalent of flooding the image iteratively up the grayscale. New local
minima, or catchment basins (think of them as valleys on a geographical map), are marked when the rising
grayscale level reaches their intensity values. Each new catchment basin is deemed to be a new region.
As the grayscale level rises, these regions are gradually grown outwards from these catchment basins. As
the grayscale level continues to rise, lines along which two regions merge are called watersheds (think of
these as hills being submerged) and these mark the boundaries between different regions (Vincent and
Soille, 1991). This technique has the advantage of being able to separate between overlapping regions
837
with similar grayscale values. It can be used to successfully segment overlapping nuclei from each other
in histological images. Watershed segmentation requires a high level of computational power and has
become more popular in recent years with the advances in computer processing technology. Law et al.
(2003) successfully used this algorithm to segment immunohistochemically stained tissue sections. The
nuclear segmentation resulting from a watershed algorithm is shown in Figure 1 (f).
Template Matching
Template matching is a segmentation technique that gives a measure of how similar two images are,
also called the registration of two images with each other. This is useful in cases where a large image
contains a smaller image or a repeating pattern of a smaller image. Histopathological images contain
many repeating cellular and sub-cellular structures which can be identified in this manner. The smaller
image is used as a template which is passed over the larger image and at each point a correlation function is performed based on some matching criteria. Regions where the result of this function is high are
deemed to match the template image. This method of matching has not been heavily used in histology
up to now. This may be due to the large variation in shape and size that cellular structures can have.
Choosing general templates is extremely difficult (Loukas and Linney, 2003).
Active Contour Models - Snakes

A group of template matching methods that have received attention in histopathology are the active
contour models or snakes (Kass, Witkin & Terzopoulos, 1987). An active contour (snake) can be considered as an adaptive spline that warps so that it matches regions within the image with predefined
desired attributes. It does this using an energy minimization technique. The snakes energy depends on
its shape and location within the image. Each desired property has an energy function that contributes
to the overall energy of the snake. Local minima of the snakes energy correspond to desired properties
of the region to be matched. Fok, et al. (1996) used active contour models to segment nerve cells. It
should be noted that snakes are highly affected by their initial position so care should be taken in their
initialization.
MACHINE LEARNING
Machine learning is an area of computer science concerned with the organization or classification of
data based on a number of features extracted from the data. Machine learning techniques are widely
used in histopathology. The goal of image analysis is often to gain some understanding of the image
as a whole. A major step towards achieving this image understanding is the recognition of individual
objects within the image. Machine learning approaches aim to classify different segmented regions
as different cellular and tissue objects. In order to classify or organize the data a measure of similarity between regions must be computable. Each region is represented by the features extracted from it.
These features can be colour, texture, shape, etc. The values of these features are stored in a vector for
each region. If we consider each feature as a dimension in a multi-dimensional graph, or feature space,
then each regions feature vector represents one point in this space. How close two points are to each
other in this space indicates how similar the corresponding regions are to each other. Machine learning
838
techniques are employed to either cluster the points in the most appropriate fashion or to discriminate
between different classes of points, based on their position in this space. It is clear that the features used
in the machine learning process are crucial and must be selected carefully. Figure 5 (a) & (b) show regions defined by two features, the average grayscale value of the region and the roundness of the region,
plotted as points in a 2D feature space. These regions have a feature vector of size two.
Dimension Reduction
The curse of dimensionality is a phrase that is commonly used in machine learning. Hundreds, if
not thousands of different colour, texture, shape and other higher level features can be extracted from
histopathological images using techniques described earlier. When data objects that are the subject of
analysis using machine learning techniques are described by a large number of features (i.e. the data is
high dimension) it is often beneficial to reduce the dimension of the data. Dimension reduction can be
beneficial not only for reasons of computational efficiency but also because it can improve the accuracy
of the analysis. One reason for this is the fact that objects appear more similar to each other in higher
dimension. Another reason is that often, especially in image analysis, two or more features are highly
correlated with each other and there is no benefit in using all of these features in the machine learning
process. In fact, use of redundant features will usually reduce accuracy as they constitute noise.
Feature Selection
One of the main methods for achieving dimension reduction is to perform a feature selection operation,
for instance by ranking features using information gain, prior to the analysis. Entropy is a measure of
randomness or disorder in a system. Information gain calculates how predictive a feature is for an outcome, and therefore how useful a feature is, by measuring the reduction in the entropy of a data system
after sorting the data on that feature. If a feature is highly predictive of the outcome, then sorting on
that feature will cause a large reduction of the entropy in the system and therefore a large information
gain.
Finding the information gain for each feature separately does not solve the problem of correlated
features. A wrapped filter deals with this by calculating the information gain for a set of features. It
starts with the feature with the highest information gain. It then adds some of the features with the
next highest information gain separately and tests how they perform with the first feature. The best of
these features is chosen. Features are continually added to the filter until the most predictive feature set
is obtained (Witten & Frank, 2005). Jelonek and Stefanowski (1997) used the wrapped filter model to
choose an optimal feature subset for the classification of tumours in the central nervous system.
Feature Transformation
Another type of dimension reduction is feature transformation. These techniques transform the feature
set to a completely different feature set of lower dimensionality. They do this through a series of matrix operations. Principal Component Analysis (PCA) transforms the feature set into a new feature set
whose dimensions map to variance in the data. The idea is to capture the majority of the variance in
as few of the dimensions as possible. Once this mapping is completed the new dimensions that capture
the least variance can be removed. PCA works well when input features are correlated with each other.
839
If features are correlated, most of the variance can be captured in fewer dimensions. The resulting features are not correlated. Performing PCA is equivalent to carrying out Singular Value Decomposition
(SVD) on a matrix (Jolliffe, 1986, Duda, Hart & Stork, 2001). However, PCA is not necessarily good
for discrimination in classification. Linear Discriminant Analysis (LDA) seeks to find a transformation
that maximises the between-class variance and minimises the within-class variance. This results in a
feature set of fewer dimensions that discriminates well between the classes (Klecka, 1980). Both LDA
and PCA were used by Decaestecker et al. (1997).
Supervision in Machine Learning

The most fundamental distinction in Machine Learning is that between supervised and unsupervised
techniques. The dening characteristic of supervised learning is the availability of annotated training
data. The name invokes the idea of a supervisor that instructs the learning system on the labels to
associate with training examples. Typically these labels are class labels in classification problems. Supervised Learning algorithms induce models from these training data and these models can be used to
classify other unlabelled data (Cord and Cunningham, 2008). In order to classify unlabelled samples,
the classes of the labeled samples must be separated by some boundary (boundaries) in the feature
space. Unlabelled samples that reside close to this (these) boundary (boundaries) are the most difficult
samples to label. Unsupervised learning lacks external guidance in the form of labels; thus the process
of building a model from the data is more difficult. Often all that can be done is to cluster or organise
the data in some way. Labelled training data is available in the vast majority of histopatholigical image
analysis problems due to the fact that the objective is usually to automate the classification of already
known tissue structures and pathologies rather than trying to discover novel patterns in the image data.
As a result, supervised techniques dominate in the area so only these are discussed here.
K - Nearest Neighbour
Perhaps the most straightforward classifier available in machine learning and one that is widely used
in image analysis is the K - nearest neighbour classifier (KNN). Classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of
the query. How near a neighbour is depends on its feature values. K denotes the number of nearest
neighbours that are used in the classification process. The neighbours are generally weighted such
that nearer neighbours have more influence on the classification than more distant ones. The distance
between samples is usually taken as the Euclidian distance but other measures such as Manhattan or
Mahalanobis distances can also be used. The nearest neighbour classifier is described as a lazy technique because it requires all of the training samples to be available at run-time. This limits the number
of neighbours that can be used to classify unlabelled data due to processing problems (Cover and Hart,
1967). Figure 5 (a) shows a 3-nearest neighbour classifying unlabelled regions of tissue as either normal
or diseased based on two features. Esgiar et al. (1998) used a KNN model to distinguish between and
classify normal and cancerous colonic mucosa.
Support Vector Machines

Support Vector Machines (SVMs) are an extremely powerful group of supervised learning techniques.
SVMs are described as eager techniques because they build a classification model during a training
840
Figure 5. Different classifiers employed to classify image regions in a 2D feature space as either normal
or diseased. (a) K-nearest neighbour classifier using 3 nearest neighbours to classify unlabelled data.
(b) Support Vector Machine (SVM) model. The support vectors in each class are shown as circled points
on the dotted black circles and are used to define the maximum margin decision boundary (continuous
black circle). The two classes are not linearly seperable in this 2D feature space. (c) SVM, using the
kernel trick, projects the data into a 3D feature space where the classes are linearly separable by a
hyper-plane. This SVM uses a quadratic kernel. The resulting dimensions are a quadratic combination
of the 2D ones.
(a) K-Nearest Neighbour

(b) SVM: Maximum Margin
(c) SVM: Projection
stage prior to run-time. Only the points that lie close to the neighbouring class are used in classification
and these so called support vectors are selected to define a decision boundary of maximum margin. In
other words, the support vectors define a boundary between the classes that achieves optimum separation of the classes (Figure 5 (b)). Only using a small number of points to define this boundary makes
SVM algorithms very efficient and prevents the model from overfitting the training data. Another major
advantage of SVMs is that they allow the separation of non-linearly separable classes (Figure 5 (b))
through the use of a process known as the kernel trick. The kernel trick projects the data into a higher
dimensional space where the classes are linearly separable (Figure 5 (c)) (Burges, 1998. Vapnik, 1995).
841
Doyle et al. (2007) used SVMs to classify the different Gleason grades of prostate cancer while Glotos
et al. (2005) used them in grade diagnosis of brain tumour astrocytomas.
Neural Networks
Artificial Neural Networks (ANNs) were originally introduced as models of biological neuronal networks in which nodes corresponded to neurons and connections between them to synapses (McCulloch
and Pitts, 1943). However they are rarely used for this purpose anymore. They are generally used as
an eager supervised learning technique. All neural networks are based on the fundamental idea of a
node. A node has a variable number of weighted inputs which it combines together with a bias term to
produce one binary output. A feed-forward neural network (FFNN) is made up of cascades of these
nodes that can be trained to perform complex non-linear classification. The weights on the inputs and the
bias are continuously updated for each node throughout the training process so that the overall system
classifies the training data correctly. Furthermore, with the correct architecture, neural networks can be
configured in such a way that they will select their own feature set for classification by self-organisation during the training process (Looney, 1997). It should be noted that ANNs are somewhat prone to
overfitting the training dataset. Karakitsos (1998) used ANNs to classify benign and malignant gastric
cells. Sjostrom, Frydel and Wahlberg (1999) used ANNs to automatically segment cell structures for
the purpose of cell counting.
Ensembles
Condorcets jury theorem (1785) states, that the probability of a group of people making a correct
decision is greater than that of an individual. It also states, that as the size of the group increases this
probability also increases. However, this theorem is only true if the new members add diversity to the
group. Diversity implies that each member of the group is specialised in some way regarding the subject being deliberated over. This notion of diversity is incorporated into ensemble methods. Ensembles
use the results of a number of different classifiers to classify data. Each classifier is chosen or designed
so that it specialises in correctly classifying one section of the data. The ensemble selects a class for
the unlabelled data based on a majority voting scheme by a group of classifiers. Ensembles exploit the
specialities of individual classifiers while also minimising the effect of the limitations of individual
classifiers on the classification (Sharkey et al, 2000).
Two different methods that are used to achieve diversity in an ensemble are bagging and boosting.
Bagging ensemble techniques use a different random subset of the training examples to train each classifier. As a result, the classifiers specialise in classifying their subset of the training data. Bagging can
be considered as a parallel ensemble technique because all ensemble members can be trained in parallel
(Breiman, 1996). Boosting, on the other hand, can be considered as a serial ensemble technique because
each ensemble member uses the output of the previous member to create the new overall classifier.
Training data samples that were incorrectly classified in the previous ensemble member are weighted
so that they have a greater influence in the new member. In this way, the ensemble members specialise
on the data that was classified incorrectly by the previous members (Freund & Schapire, 1995). Doyle
et al. (2006) used a boosting ensemble in the detection of prostate cancer. Daskalakis et al. (2007) used
a majority voting scheme ensemble to discriminate between benign and malignant thyroid nodules.
842
Conclusion
The future of image analysis in histopathology lies firmly in the digital domain. This chapter has discussed some of the most important techniques and given examples of their use in the area up to now.
Many of the earlier stages of analysis such as pre-processing and segmentation techniques have reached
maturity. Although some progress will undoubtedly be made in these areas over the next five years, the
vast majority of research will be focused on the more complicated task of using machine learning to
achieve higher level image understanding. Machine learning is being applied to histopatholgical image
analysis problems of ever increasing complexity. The high levels of processing power and the advanced
set of processing and analysis techniques available to researchers today make complete and automated
image analysis achievable.
Acknowledgment
WG, ER and LM would like to acknowledge the support of the Marie Curie Transfer of Knowledge
Industry-Academia Partnership Programme, Target-Breast (www.targetbreast.com), the Health Research Board under the auspices of the Programme Grant Breast Cancer Metastasis: Biomarkers and
Functional Mediators, and the EU Integrated Project, InnoMed, under the PredTox component of this
programme (www.innomed-predtox.com). PC, RF, MdiF and KB would like to acknowledge the support of IRCSET and SFI.
References
Adams, R., Bischof, L. (1994), Seeded region growing. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16, 641-647.
Baraldi, A., & Parmiggiani, F. (1996). Single linkage region growing algorithms based on the vector
degree of match. IEEE Transactions on Geoscience and Remote Sensing, 34, 137-148.
Beier, K., & Fahimi, H. D., (1992). Application of automated image analysis for quantitative morphological
studies of peroxisomes in rat liver in conjunction with cytochemical staining with 3-3-diaminobenzidine
and immunocytochemistry. Microscopy Research and Technique, 21, 271-282.
Boudraa, A., Dehak, S.M.R., Zhu, Y., Pachai, C., Boa, Y., & Grimaud, J. (2000). Automated segmentation of multiple sclerosis lesions in multispectral MR imaging using fuzzy clustering. Computers in
Biology and Medicine, 30(1), 23-40.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Brennan, D. J., Kelly, C., Rexhepaj, E., Dervan, P. A., Duffy, M. J., & Gallagher, W. M. (2007). Contribution of DNA and tissue microarray technology to the identification and validation of biomarkers and
personalised medicine in breast cancer. Cancer, Genomics and Proteomics, 4, 121-134.
843
Brennan, D. J., Rexhepaj, E., OBrien, S. L., Mc Sherry, E., OConnor, D. P., Fagan, A., Culhane, A. C.,
Higgins, D. G., Jirstom, K., Millikan, R. C., Landberg, G., Duffy, M. J., Hewitt, S. M., & Gallagher, W.
M. (2008). Altered cytoplasmic-to-nuclear ratio of surviving is a prognostic indicator in breast cancer,
[in press] Clinical Cancer Research, 14(9).
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and
Knowledge Discovery, 2(2), 121-167.
Chang, Y. L., & Li, X. (1994). Adaptive image region growing. IEEE Transactions on Image Processing, 3, 868-872.
Chang, Y. L., & Li, X. (1995). Fast image region growing. Image and Vision Computing, 13, 559-571.
Chen, S.Y., Lin, W. C., & Chen C. T. (1991). Split-and-merge image segmentation based on localized
feature analysis and statistical tests. Computer Vision, Graphics, and Image Processing Graphical
Models and Image Processing, 22(5), 457-475.
Condorcet, M., (1785). Essay on the application of analysis to the probability of majority decisions.
Cord, M., & Cunningham., P. (2008). Machine learning techniques for multimedia. Springer, Verlag
Berlin Heidelberg.
Cover, T., & Hart, P. (1967). Nearest neighbour pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
Cross, S. S., (2001). Observer accuracy in estimating proportions in images: implications for the semiquantitative assessment of staining reactions and a proposal for a new system. Journal of Clinical
Pathology, 54(5), 385-390.
Daskalakis, A., Kostopoulos, S., Spyridonos, P., Glotsos, D., Ravazoula, P., Kardari, M., Kalatzis, I.,
Cavouras, D., & Nikiforidis, G. (2007). Design of a multi-classifier system for discriminating benign
from malignant thyroid nodules using routinely H&E-stained cytological images. Computers in Biology
and Medicine, 38(2), 196-203.
Decaestecker, C., Lopes, B. S., Gardower, L., Camby, I., Cras, P., Martin, J J., Kiss, R., VandenBerg, S
.R., & Salmon, I. (1997). Quantitative chromatin description in Feulgen-stained nuclei as a diagnostic
tool to characterize the oligodendrogial and astrogial components in mixed oligo-astrocytomas. Journal
of Neuropathology and Experimental Neurology, 56(4), 391-402.
Deolekar, M., & Morris, J. A. (2003). How accurate are subjective judgements of a continuum? Histopathology, 42(3), 227-232.
Doyle, S., Hwang, M., Shah, K., Madabhushi, A., Feldman, M., & Tomaszeweski, J. (2007). Automated
grading of prostate cancer using architectural and textural image features. 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. (pp 1284-1287).
Doyle, S., Madabhushi, A., Feldman, M., & Tomaszeski, J. (2006). A boosting cascade for automated
detection of prostate cancer from digitized histology. Medical Imaging Computing and Computer-Assisted Intervention, 4191, 504-511.
844
Esgiar, A. N., Naguib, R. N. G., Sharif, B. S., Bennett, M. K., & Murray, A. (1998). Microscopic image analysis for quantitative measurement and feature identification of normal and cancerous colonic
mucosa. IEEE Transactions on Information Technology in Biomedicine, 2, 197-203.
Fok, Y. L., Chan, J. C. K., & Chin, R. T. (1996). Automated analysis of nerve-cell using active contour
models. IEEE Transactions on Medical Imaging, 15, 353-368.
Francis, I. M., Adeyanju, M. O., George, S. S., Junaid, T. A., & Luthra, U. K. (2000). Manual versus
image analysis estimation of PCNA in breast carcinoma. Analytical and Quantitative Cytology and
Histology, 22, 11-16.
Freund, Y., & Schapire, R. E., (1995). A decision-theoretic generalization of on-line learning and an
application to boosting. The Proceedings of the Second European Conference on Computational Learning Theory, Barcelona, ESP.
Gilles, F., Gentile, A., Le Doussal, V., Bertrand,. F., & Kahn, E. (1994). Use of texture parameters in the
classification of soft tissue tumours. Analytical and Quantitative Cytology and Histology, 16, 95-100.
Glotsos, D., Spyridonos, P., Cavouras, D., Ravazoula, P., Dadioti, P., Arapantoni, P., & Nikiforidis, G.
(2005). An image-analysis system based on support vector machines for automatic grade diagnosis of
brain-tumour astrocytomas in clinical routine. Medical Informatics and the Internet in Medicine, 30,
3, 179-193.
Goldlust, E. J., Paczynski, R. P., He, Y. Y., Hsu, C. Y., & Goldberg, M. P. (1996). Automated measurement of infract size with stained images of triphenyltetrazolium chloride-stained rat brains. Stroke, 27,
1657-1662.
Gonzalez, R. C. A., & Woods, R. E. (1992). Digital image processing. Reading, MA: Addison-Wesley.
Grimson, W. E. L., & Lozano-Perez, T. (1987). Localizing overlapping parts by searching the interpretation tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 4, 469-482.
Hamilton, P. W., Bartels, P. H., Thompson, D., Anderson, N. H., Montironi, R., & Sloan, J. M. (1997).
Automated detection of dysplastic fields in colorectal histology using image texture analysis. Journal
of Pathology, 182, 68-75.
Haralick, R. M. (1979). Statistical and structural approaches to texture. Proceedings IEEE, 67(5), 786804.
Haralick, R. M., & Shapiro, L. G. (1985). Image segmentation techniques. Computer Vision, Graphics,
and Image Processing, 29, 100-132.
Horowitz, S. L., & Pavlidis, T. (1974). Picture segmentation by a directed split-and-merge procedure.
Proceedings of the 2nd International Joint Conference on Pattern Recognition, Copenhagen, Denmark.
(pp. 424-433).
Jadhav, A. S., Banerjee, S., Dutta, P. K., Paul, R. R., Pal, M., Banerjee, P., Chaudhuri, K., & Chatterjee,
J. (2006). Quantitative analysis of histopathological features of precancerous lesion and condition using
image processing technique. Proceedings of the 19th IEEE Symposium on Computer-Based Medical
Systems. (pp. 231-236).
845
Jelonek, J., & Stefanowski, J. (1997). Feature subset selection for classification of histological images.
Artificial Intelligence in Medicine, 9, 227-239.
Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag.
Kass, M., Witkin, A., & Terzopoulos, D. (1987). Snakes: Active contour models. Proceedings of 1st
International Conference on Computer Vision, London, UK. (pp. 259-268).
Kayser, K., Kayser, G., Radziszowski, D., & Oehmann, A. (2004). New developments in digital pathology: From telepathology to virtual pathology lab. Studies in Health Technology and Informatics, 105,
61-69.
Keenan, S. J., Diamond, J., McCluggage, W. G., Bharucha, H., Thompson, D., Bartels, P. H., & Hamilton,
P W. (2000). An automated machine vision system for the histological grading of cervical intraepithelial
neoplasia (CIN). Journal of Pathology, 192(3), 351-362.
Klecka, W. R. (1980). Discriminant analysis. CA, USA: Sage Publications.
Kurita, T. (1995). An efficient clustering algorithm for region merging. The Institute of Electronics. Information and Communication Engineers Transactions on Information and Systems, 78-D, 1546-1551.
Law, A. K. W., Lam, K. Y., Lam, F. K., Wong, T. K. W., Poon, J. L. S., & Chan, F. H. Y. (2003). Image
analysis system for assessment of immunohistochemically stained proliferative marker (MIB-1) in oesophageal squamous cell carcinoma. Computer Methods and Programs in Biomedicine, 70, 37-45.
Lehr, H. A., Mankoff, D. A., Corwin, D., Santeusanio, G., & Gown, A. M., (1997). Application of
photoshop-based analysis to quantification of hormone receptor expression in breast cancer. Journal of
Histochemistry and Cytochemistry, 45, 1559-1565.
Lehr, H. A., Van der Loos, C. M., Teeling, P., & Gown, A. M., (1999). Complete cromogen separation
and analysis in double immunohistochemical stains using photoshop-based image analysis. Journal of
Histochemistry and Cytochemistry, 47, 199-225.
Leoung, F. J. W., Brady, M., & McGee, J. O. (2003). Correction of uneven illumination (vignetting) in
digital microscopy images. Journal of Clinical Pathology, 56, 619-621.
Levenson, R. M., & Hoyt, C. C. (2000). Spectral imaging and microscopy. American Laboratory,
32(22), 26-33.
Looney, C. G. (1997). Pattern recognition using neural netorks. Oxford, UK: Oxford University
Press.
Loukas, C. G., & Linney, A. (2003). A survey of histolgical image analysis-based assessment of three
major biological factors influencing radiotherapy: Proliferation, hypoxia and vasculature. Computer
Methods and Programs in Biomedicine, 74, 183-199.
Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proceedings of The Royal Society, B 207,
187-217.
McCulloch, W., & Pitts, W. (1947). A logical calculus of ideas immanent in nervous activity. Bulletin
of Mathematical Biophysics, 5, 115-133.
846
Mete, M., Xu, X., Fan, C., & Shafirstein, G. (2007). Automatic delineation of malignancy in histopathological head and neck slides. Biomed Central Bioinformatics, 8, 7.
Mulrane, L., Rexhepaj, R., Smart, V., Callanan, J. J., Orhan, D., Eldem, T., Mally, A., Schroeder, S.,
Meyer, K., Wendt, M., OShea, D., &Gallagher, W. M. (2008). Creation of a digital slide and tissue
microarray resource from a multi-institutional predictive toxicology study in the rat: an initial report
from the PredTox group. Experimental and Toxicologic Pathology, in press.
Ong, S. H., Giam, S. T., & Sinniah, R. (1993). Adaptive window-based tracking for the detection of
membrane structures in kidney electron micrographs. Machine Vision and Applications, 6, 215-223.
Pal, N. R., & Pal, S .K. (1987). Segmentation based on contrast homogeneity measure and region size.
IEEE Transactions on Systems, Man and Cybernetics, 17(5), 857-868.
Pavlidis, T. (1977). Structural pattern recognition. Berlin: Springer Verlag.
Pham, N., Morrison, A., Schwock, J., Aviel-Ronen, S., Iakovlev, V., Tsao, M., Ho, J., & Hedley, D.W.
(2007). Quantitative image analysis of immunohistochemical stains using a CMYK color model. Diagnostic Pathology, 2, 8.
Sharipo, E., Hartant, V., Lepor, H. (1990). Quantifying the smooth-muscle content of the prostate using double-immunoenzymatic staining and colour assisted image analysis. Journal of Urology, 147,
1167-1170.
Sharkey, A. J. C., Sharkey, N., Gerecke, U., & Chandroth, G. O. (2000) The test and select approach to ensemble combination. Proceedings of the First International Workshop on Multiple Classifier Systems.
Sjostrom, P. J., Frydel, B. R., & Wahlberg, L. U. (1999). Artificial neural network-aided image analysis
system for cell counting. Cytometry, 36, 18-26.
Sonka, M., Hlavac, V., & Boyle, R. (1999). Image Processing. Analysis and Machine Vision Second
Edition. London, UK: International Thomson Publishing Europe.
Stevens, A., Lowe, J. S., & Young, B. (2002). Wheaters basic histopathology, 4th Edition. London, UK:
Churchill Livingstone.
tenKate, T. K., Belien, J. A. M., Smeulders, A. W. M., & Baak, J. P. A. (1993). Method for counting
mitoses by image processing in Feulgen stained breast cancer sections. Cytometry, 14, 241-250.
Tyan, S. G. (1981). Median filtering, deterministic properties. In Huang, T.S., (ed.) Two-Dimensional
Digital Signal Processing, 2. Berlin, Germany: Springer Verlag,
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Vincent, L., & Soille, P., (1991). Watersheds in digital spaces: An efficient algorithm based on immersion
simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 6, 583-598.
Weinstein, R. S., Bhattacharyya, A. K., Graham, A. R., & Davis, J. R. (1997). Telepathology: A ten-year
progress report. Human Pathology, 28, 1, 1-7.
Weinstein, R. S., & loom, Rozek, (1987). Telepathology and the networking of pathology diagnostic
services. Archives of Pathology and Laboratory Medicine, 111, 7, 646-652.
847
Witten, I. H., & Frank, E. (2005). Data mining, 2nd Edition. CA, USA: Morgan Kaufmann Publishers
Zamperoni, P. (1986). Analysis of some region growing operators for image segmentation. Advances in
Image Processing and Pattern Recognition. North Holland, Amsterdam. (pp 204-208).
KEy TeRms
ANN: Artificial Neural Network. Any network that is made up of nodes that simulate the function
of neurons, and connections between them that represent synapses, in the brain.
Automated Image Analysis: Encompasses the automation of every step in the work flow of image
analysis in histopathology, from the production of the image, to the high level understanding of the
different objects present in the digital image produced.
CMYK: Cyan, Magenta, Yellow, Key colour space. A representation of colour using the three
secondary colours, cyan, magenta and yellow. All other colours are formed by subtracting different
combinations of these three from white.
Data Acquisition: The group of techniques performed in order to achieve a digital image of a tissue
sample, suitable for analysis.
FFNN: Feed Forward Neural Network. A classifier made up of a cascading ANN.
H&E: Haematoxylin and Eosin. A staining technique used in Histopathology that pigments nucleic
structures blue and cytoplasmic structures pink.
Histopathology: The study of disease by microscopically analysing tissue samples.
HSV: Hue, Saturation, Value colour space. A representation of colour that better represents human
perception of differences in colour. The intensity (value) is separated the two terms that we use to define
colour, hue (perceived colour) and saturation (light or dark)
Image Pre-Processing: The group of techniques carried out globally on an image in order to allow
more accurate analysis of the image.
KNN: K - Nearest Neighbour. A classifier that decides on the class of an unlabeled sample based on
its k nearest labeled neighbouring samples according to some distance measure (usually Euclidian).
LDA: Linear Discriminant Analysis. A dimension reduction technique used to transform a feature
set into a smaller set of features that best discriminates between the different classes in the data.
Machine Learning: The automation of the classification of different regions in the image as objects.
PCA: Principal Component Analysis. A dimension reduction technique used to transform a feature
set into a new feature set whose features map to the variance in the system. The new features that provide the least amount of variance can subsequently be removed.
848
RGB: Red, Green, Blue colour space. A representation of colour using the three primary colours,
red, green and blue. All other colours are formed by different combinations of these three.
Segmentation: The process of building regions within an image which better represent the real
world objects present in the image.
SVD: Singular Value Decomposition. The series of matrix operations carried out in order to perform
PCA on a set of features.
SVM: Support Vector Machine. A classifier that builds an optimum decision boundary between
classes based on a subset of labeled samples closest to the boundary. These samples are known as support vectors.
TMA: Tissue Microarray. An array of tissue samples from different sources, collected on one paraffin block.
849
About the Contributors
Andriani Daskalaki received her PhD degree in dentistry at the Free University of Berlin. She
received MS degree in bioinformatics. Her research areas is systems biology, PDT and laser applications in dentistry. She is the editor of the Handbook of Research on Systems Biology Applications in
Medicine.
***
Ines Abdeljaoued-Tej is a mathematician with special field in Computer Algebra. She did her PhD
studies in LIP6 at the Universit Pierre et Marie-Curie (France) with Annick Valibouze as supervisor. She taught general Informatics and Mathematics at Universit de Versailles and Universit Paris
6 (from 1996 to 2000), is Assistant Professor at the Universit de Carthage (Tunisia) since 2000 and is
currently in charge of the Optimization and Operational Research courses at the Ecole Suprieure de la
Statistique et de lAnalyse de lInformation. She is part of the research Unit Algorithmic and Structures,
author of Package PrimitiveInvariant in Gap and is co-author of 2 publications.
Julia Adolphs studied physics at the Free University in Berlin. She is currently a PhD student in the
Emmy-Noether group of Dr. Thomas Renger.
Jessica Ahmed was born in Berlin, Germany. She obtained her Master of science in bioinformatics at
the Free University, Berlin in 2006. Since then she has been a PhD student at the Institute of Molecular
Biology and Bioinformatics, Charit-University Medicine Berlin at the group of Robert Preissner. Her
main fields of research are cancer research and drug development.
Daniela Albrecht was born in 1983. She achieved her diploma in 2006 in bioinformatics at the
Friedrich Schiller University (FSU), Jena. Presently she is working on her PhD at the Leibniz-Institute
for Natural Product Research and Infection Biology Hans-Knoell-Institute (HKI), Jena. Her research
areas are databases and data warehouses, proteomics (data processing and analysis) and integrated
analysis of data from transcriptomic and proteomic level of human-pathogenic fungi.
Roberta Alfieri received the Master of science in bioinformatics from the University of MilanoBicocca, Milan, Italy. She is PhD student in Computer Science and Complex Systems at the School
of Advanced Studies, University of Camerino, Macerata, Italy. Her research activities in the field of
systems biology concerns the development of databases and tools for the mining and the integration
of cell cycle related information. She also works on the mathematical modelling of biological system,
with particular interest in the mammalian cell cycle process, using the high performance computing
techniques on distributed platforms in GRID at Italian National Research Council.
Christos Argyropoulos received his medical degree from the University of Patras, Greece in
1997, and his MSc and PhD degrees in biomedical sciences from the Graduate Program in Medical
Sciences of Patras Medical School in 2003 and 2005 respectively. He is currently a fellow in the Renal
and Electrolyte Division in the University of Pittsburgh. His main research interest lie in the field of
Bioinformatics and especially in the application of Bayesian and maximum entropy methods in the
analysis of microarray datasets.
Pantelis G. Bagos received a BSc in biology in 1997, a MSc in biostatistics in 2002, and a PhD in
bioinformatics in 2005 from the University of Athens, Athens, Greece. Since then, he has been working
as a post-doctoral research fellow at the Department of Cell Biology and Biophysics, Faculty of Biology,
University of Athens, and as visiting assistant professor at the University of Central Greece, Lamia,
Greece, where he is teaching biology and bioinformatics. He was recently elected assistant professor at
the Department of Informatics with Applications in Biomedicine at the University of Central Greece.
His research interests include computational analysis of protein sequences, biological databases, machine
learning algorithms in bioinformatics and genetic epidemiology.
Marc Baumann studied medicine and biochemistry at the University of Zrich, Switzerland,
where he graduated in 1984. Currently, he is the director of the Protein Chemistry/Proteomics Unit at
the University of Helsinki, Faculty of Medicine, Finland. His current research interests are focused on
three main subjects, i) the studies on protein misfolding disorders, e.g. Alzheimers disease, CADASIL
disease, Prion disorders and other amyloidoses, ii) searching for new protein biomarkers for various
clinical conditions, e.g sepsis with and without organ failure and episodic/long term hazardous alcohol
misuse, iii) development of nano-coated micro-chip based analytical devices for proteome studies and
medical diagnosis.
Slimane Ben Miled is a biomathematician. He has a Master thesis in theoretical physics and a
PhD in mathematics at the Non-Linear Institute of Nice. He was associated professor at the Federal
University of Rio de Janeiro and research associated at the Third Word Academy of Science. Now he
is assistant professor at the University of Tunis. He is co-organizer and co-fonder of the Tunis Winter
School of Biomathematics open to biologist who want to learn mathematics and co-responsible of a
Master diploma on Biomathematics for mathematicians who want to learn biology. His research field
is dynamical systems, ecology/evolution.
Alia Benkahla is a bioinformaticist. She did her PhD studies at the IGS-CNRS in Marseille with
Jean-Michel Claverie as PhD supervisor, did her Post-Doc at the Max-Planck Institute in Berlin, and
moved to LIVGM at Institut Pasteur de Tunis 36 months ago. She invested her initial period at IPT
in capacity building in the field of Bioinformatics by: training students; starting a research group; coorganizing international events in Africa; and looking for funds in Bioinformatics for pathogen and
disease vectors. She is co-author of 5 publications among which 3 Nature.
David Benovoy is pursuing his doctoral degree in the field of Human Genetics at McGill University
under the supervision of Dr. Jacek Majewski. He received a BS degree in biochemistry and Masters
degree in biology, both from the University of Ottawa. His current research interests are in comparative transcriptomics where he is studying variation of alternative splicing at a tissue, population and
species level.
Juergen Beuthan researches in experimental use of laser-induced fluorescences in biomedical optics
(specifically metabolic monitoring on cells). His research also covers diaphanoscopy, optical tomography and tissue optics in optical medical diagnostics. He studied electronics at the Humboldt University
of Berlin and physics at the University of Greifswald. Between 1983 and 1986 he received his doctors
degree and habilitated in experimental physics. Postdoc he stayed twice at the Moscow Academy of
Sciences to work in the team of Nobel Laureate Professor Basov. He received scientific honours like
the Stauffer Award (USA) and two innovation awards. His scientific achievements have been presented
and manifested in some 150 papers and over 40 patents.
Abdesslam Boutayeb received a doctorate in data analysis from Pau University (France) in 1983,
a MSc and a PhD in numerical analysis from Brunel University (GB). He is currently a professor of
applied mathematics at the University Mohmed Ier (Morocco). In 1997, he obtained a Fulbright grant
for a three months visit to Colorado School of Mines (USA). He has supervised many PhDs and MScs
and lead research projects. His main area of research is numerical analysis and mathematical modelling
with applications in medicine and biomedical sciences. During the last decade, he published 4 books
and more than 30 papers.
Axel Artur Brakhage was born in 1959. He achieved his diploma in biology in 1985 and his PhD
1989 in microbiology, molecular biology and biochemistry at the University of Mnster. Presently he is
full professor (C4) of microbiology / molecular biology at the FSU, Jena. He is head of the HKI in Jena.
His research areas are molecular microbiology, molecular biotechnology of fungi, virulence of fungi as
well as the interaction of immune effector cells and Aspergillus fumigatus (Cellular Microbiology).
Alexey R. Brazhe graduated in 2003 from the Moscow State University (Russia) and got a PhD in
biology and biophysics in 2006. Since 2006 he is a researcher at Moscow State University and postdoc
at Technical University of Denmark. His research interests are cell biophysics and neurophysiology,
self-organization and mathematical modelling, fractals in biology, data series analysis.
Nadezda A. Brazhe graduated in 2003 from the Moscow State University (Russia) and got a PhD in
biology and biophysics in 2006. Since 2006 she is a researcher at Moscow State University and postdoc
at Technical University of Denmark. Her research interests are cell biophysics and neurophysiology,
including intercellular communications in the nervous system and modulation of erythrocyte properties
under external stimuli. At present time she specializes in the application of interference microscopy to
the cell studies and investigation of nitric oxide role in the nervous and cardiovascular systems.
Kenneth Bryan received an honours degree in microbiology from The University of Dublin, Trinity College, in 2001. After attaining a Graduate diploma in I.T. from Dublin City University in 2002
he returned to Trinity where he completed a PhD in machine learning/bioinformatics, which focused
on the development of novel unsupervised methods for analysing microarray gene expression data. He
is currently employed as a research fellow in the Machine Learning Group at the Complex and Adaptive Systems Laboratory (CASL) in University College Dublin. Current research topics include the 27
developing semi-supervised methods for gene expression data analysis and supervised feature selection
methods for metabolomics data analysis.
Kwang-Hyun Cho received the BS (summa cum laude), MS, and PhD degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST) in 1993, 1995, and 1998,
respectively. He is currently associate professor at the College of Medicine, Seoul National University,
Korea and holds a joint position at the Bio-MAX Institute, Seoul National University, Korea as a Director
of Systems Biology Laboratory. His research interests cover the areas of systems science with bio-medical applications including systems biology, nonlinear dynamics, and discrete event systems. His focus
is on biotechnology and bio-medical science applications, from genetic to cellular systems.
Pdraig Cunningham is professor of knowledge and data engineering in the School of Computer
Science and Informatics at University College Dublin. His current research focus is on the use of machine
learning techniques in processing high-dimension data, particularly bioinformatics and multimedia
data. He has published extensively on the applications of supervised and unsupervised machine learning
techniques. His current research focus is on the application of machine learning techniques in image
analysis and on using dimension reduction and feature selection techniques for biomarker discovery.
Antonis Daskalakis received his diploma in physics from the University of Patras, Greece in 2003
and his MSc in medical physics from the Medical School of the University of Patras in 2005. Since
2005, he is a PhD candidate in medical physics at the University of Patras, Greece. His main research
interests lie in the field of bioinformatics and especially in the image processing and analysis of microarray images.
Ana Katerine De Carvalho Lima Lobato has a BSc and a MSc in chemical engineering from
the Federal University of Rio Grande do Norte, Brazil. Soon after her MSc she started in a chemical
engineering PhD program researching on metabolic engineering, where part of it was developed at the
University of Manchester, School of Chemical Engineering and Analytical Science, UK. Currently, she
is a professor at the Potiguar University which is part of the Laureate International Universities group.
She has experience in the area of chemical engineering focusing on biotechnology, working mainly in
the following fields: biofuel, biosurfactants, antibiotic production and metabolic flux analysis.
Bernard de Bono studied medicine at the University of Malta and has experienced working as a
medical and surgical intern at St.Lukes Hospital in Malta. He then read for an MPhil in protein engineering at the University of Malta followed by PhD in biology at the Medical Research Council Laboratory
of Molecular Biology, University of Cambridge. His main research focus is the evolution, signalling
and disease association of immunoglobulin superfamily proteins. He is also directly involved in the
development of software and training methods in integrative physiology, with particular emphasis on
the combination of gene expression with pathway data.
Koussay Dellagi is professor of haematology at the Faculty of Medicine of Tunis. His main research
interests are in immunology of host parasite interactions and genetics of haematological diseases. He
was director general of Institut Pasteur de Tunis (Tunisia) from 1988 until 2005 and head of the Research
Laboratory of Immunology, Vaccinology and Molecular Genetics (1985-2007). Since 2007 he is director
of the Centre de Recherche et de Veille sur les Maladies Emergentes dans lOcan Indien (CRVOI) at La
Runion (France). Member of the WHO Regional Advisory Committee (EMRO, WHOCAIROEGYPT)
since 2005. Member of the Scientific and Technical Advisory Committee (STAC) of TDR (2000-2003).
Member of the scientific advisory committee of the WHO-Kobe Centre.
Prerak Desai is currently pursuing his PhD at Utah State University (Logan, UT) with an emphasis
on microbial physiology and cellular microbiology. He has a Masters in food science (Utah State University, Logan, UT) and a BTech in dairy technology (Gujarat Agricultural University, Gujarat, India).
He aspires to have a research based career answering questions related to infectious diseases and food
safety.
Mohamed Derouich is a researcher in biomathematics with special interest in mathematical models applied to diabetes and communicable diseases transmitted by vectors. His MSc dissertation was
presented in 1997 on epidemiological and ecological models in discrete time then he received a PhD in
modelling and simulation in 2001 from the University Mohamed Ier (Oujda-Morocco). He is currently
teaching mathematics at a secondary school and computer sciences as a part time lecturer at the faculty
of Sciences (Errachidia Morocco).
Peter Deuflhard. 1994, Gerhard Damkhler Medal. 2002, Co-initiator DFG Research Center
MATHEON Berlin. 2007, ICIAM Maxwell prize (for fundamental contributions to applied mathematics).
Matthew DiFranco is a PhD candidate in the Machine Learning Group at the School of Computer
Science and Informatics, University College Dublin. He is currently working as an IRCSET Scholar under
prof. Pdraig Cunningham and Dr. William Watson on a supervised learning approach to automated
Gleason grading of prostatic carcinoma in prostate immunohistochemistry slides. Prior to his arrival
in Dublin, he worked as a research assistant under prof. Peter Hammond at the Biomedical Informatics
Unit of the Eastman Dental Institute, University College London, employing 3D statistical shape modeling techniques to investigate facial dysmorphology of cleft lip and/or palate as part of an $8 million
US National Institutes of Health grant. He graduated in 2004 from University College London with a
Masters of research (MRes) in computer vision, image processing, graphics and simulation.
Cathrin Dressler is a biologist and works at the research company Laser-Medizin-Technologie Berlin,
Germany, since 1996. Her research interests are settled in the field of optical diagnostics/bioanalytics
and are focused on laser tissue interactions, micropreparation of cells and subcellular structures as well
as stress reactions in various animal tissues. She is also involved in investigating and developing nanotechnological tools and devices. In this respect special emphasis deals with bioanalytical applications
of near-field optics and luminescent nanosensors.
Federico Esposti has a Master degree in biomedical engineering with full marks from the Politecnico di Milano in 2005 and at present is PhD student in bioengineering from January 2006. His activity
concerns the analysis of neuronal cultivations signals in cooperation with the Universit degli Studi di
Milano and the non linear analysis of fetal heart rate variability signal. From 2006 he is co-lecturer for
the course of electronics bioengineering and bioimages of the Politecnico di Milano. He is author of
several peer reviewed international papers in the field of non linear signal processing (whether biomedical or not) and biomechanics.
Chris T.A. Evelo is head of the BiGCaT Bioinformatics group of the Maastricht University in the
Netherlands. Dr. Evelo has a PhD in molecular toxicology. His research focus is on integration of critical evaluation of data and database quality with powerful analytical methods and creative ways to look
at data and to understand it in the context of what we already know. He applies this systems biology
approach in the fields of nutrigenomics, cardiovascular genomics, toxicogenomics and cancer.
Lloyd Flack is a research assistant at the Institute for Molecular Bioscience. He gained his BSc,
majoring in biology in 1973 and his MStats in 1995. Most of his work has been on the application of
statistics to biological, agricultural and environmental problems. He has been an author on papers and
technical reports. His main interests are in the statistical modelling of biological systems, in classification and clustering methods, and in smoothers.
Ross Foley is a PhD student in the Machine Learning Group at the School of Computer Science
and Informatics, University College Dublin. He is working under Prof. Pdraig Cunningham and Prof.
William Gallagher as part of an IRCSET group scholarship in bioinformatics. He is working as part
of the multi-national PredTox project on machine learning in automated image analysis approaches to
histopathology regarding predictive toxicology. He graduated from UCD with an honours degree in
electronic engingeering in 2004. He subsequently worked as a telecommunications engineer for Ericsson in Dublin.
William M. Gallagher is an associate professor of cancer biology within the UCD School of Biomolecular and Biomedical Science, University College Dublin, with his laboratory being located at
the UCD Conway Institute. A major of focus of prof. Gallaghers research work is the identification
and validation of candidate biomarkers of breast cancer and melanoma, with particular emphasis on
translation of transcriptomic and proteomic datasets into clinically relevant assays via the use of tissue
microarrays and associated image analysis approaches. Prof. Gallagher currently co-ordinates an FP6
Marie Curie Transfer of Knowledge Industry-Academia Partnership Programme, Target- Breast (www.
targetbreast.com), which involves 3 academic and 2 industrial partners across 3 EU countries (running
from 2006-2010). Prof. Gallagher is also a central participant of the PredTox component of the FP6
Integrated Project, InnoMed (2005- 2009), which aims to harness post-genomic approaches to better
monitor and understand drug-related toxicological effects. He has received a number of awards based on
his research work to date, including the BACR/AstraZeneca Young Scientist Frank Rose Award in 2004
and the St. Lukes Silver Medal Award in 2008. Prof. Gallagher is also on the Scientific Advisory Board
of SlidePath Ltd. Prof. Gallagher originally graduated from the Department of Biochemistry, UCD in
1993 with a 1st Class Joint Honours degree in molecular genetics and biochemistry. Subsequently, he
obtained a PhD in molecular and cellular biology from the Cancer Research UK Beatson Laboratories
in Glasgow. In 1997, he moved to Paris to undertake a Marie Curie Individual Fellowship at RhonePoulenc Rorer. Afterwards, he returned to Ireland upon receipt of an Enterprise Ireland Post-Doctoral
Fellowship (1999-2000) and, subsequently, a Marie Curie Return Fellowship (20002001). In 2001, he
was employed in a permanent capacity as college lecturer at UCD within the former Department of
Pharmacology. In January 2005, he was appointed senior lecturer within the UCD School of Biomolecular and Biomedical Science and was promoted to associate professor of cancer biology in July 2006.
Prof. Gallagher is also a principal investigator at the Conway Institute. In May 2007, he co-founded
OncoMark Ltd., which is a private company centred on the development and application of biomarker
panels and associated technologies, on both tissues and biological fluids. A major focus of the company
is to provide the link from omic level discovery to validation via the use of multiplex antibody-based
assays and high-throughput tissue microarray screening.
Georgi Georgiev studies physics in the Sofia University since 2000. From the beginning of 2005,
he works as young researcher in the Institute of Mechanics and Biomechanics (IMBM) to Bulgarian
Academy of Sciences. His scientific interests and publication activities are in the field of computer
programming, bioinformatics and computational systems biology.
Peter Ghazal, chair of molecular genetics and biomedicine, head of Division of Pathway Medicine.
Associate director, Centre for Systems Biology in Edinburgh. Professor, The Scripps Research Institute.
Founding director, Scottish Centre for Genomic Technology and Informatics.
Duncan Gillies graduated from Cambridge University with a degree in engineering science in 1971.
He obtained his PhD in the area of artificial intelligence from Queen Mary College. Currently, he is the
professor of biomedical data anlaysis at Imperial College, London where his research work has mostly
concerned the application of computers in medicine and biology. In particular he has worked on interactive graphics for simulation of endoscopic procedures, geometric and physical modelling of the upper
human airway, the use of Bayesian inference in visual diagnosis, and statistical analysis of microarray
data. To date he has published over 120 papers, been granted five patents, awarded 24 research grants
and supervised twenty successful research students.
Vanathi Gopalakrishnan is assistant professor of biomedical informatics, intelligent systems and
computational biology at the University of Pittsburgh. She received her doctorate in computer science from
the University of Pittsburgh in 1999. She has been involved in bioinformatics training and curriculum
development since 2000. Dr. Gopalakrishnan is the recipient of a five-year K25 quantitative research
career award from the NIGMS at the National Institutes of Health, USA. In 2006, she was awarded a
Pitt Innovator Award for having been involved in successful licensing of technology developed in her
lab to a commercial biotech company doing biomarker validation research.
Lamia Guizani-Tabbane has a PhD in immunology at the Universit Pierre et Marie Curie (France).
Shes working at the Institut Pasteur de Tunis. Her field of interest is host-pathogen interaction. She
founded a research group studying the alteration of macrophage-signal transduction in response to
pathogens and especially to Leishmania infection.
Reinhard Guthke was born in 1950. He achieved his diploma in 1973 in physics, his PhD in biophysics in 1978 and his PhD (Dr. sc. nat.) in biotechnology in 1988 at the FSU Jena. Presently he is vice
head of the Department Molecular and Applied Microbiology and head of the Systems Biology and
Bioinformatics Group at the HKI Jena. His research areas are bioprocess data analysis (data mining),
design of knowledge based systems, mathematical modeling and process simulation, mathematical
model-based experimental design as well as process optimization and control.
Hendrik Hache is a PhD candidate at the Max Planck Institute of Molecular Genetics. His research
interests are focused on the development and validation of reverse engineering methods of gene regulatory networks. He graduated with a diploma in the faculty of physics at the Humboldt University in
Berlin, Germany.
Michael R Hamblin is associate professor of dermatology at the Wellman Center for Photomedicine.
His research areas include photodynamic inactivation of pathogens, PDT-induced anti-tumor immunity,
low level light therapy for healing and biostimulation. He is author of books chapters and international
peer-reviewed papers in the field of low-light therapy.
Stavros J. Hamodrakas received his BSc from the Physics Department, of the University of Athens, Athens, Greece in 1970 and his PhD from the Astbury Department of Biophysics of the University
of Leeds, Leeds, U.K. in 1974. He is currently a full professor at the Department of Cell Biology and
Biophysics, Faculty of Biology, University of Athens, Athens, Greece. He also is the general director
of a post-graduate program in bioinformatics. Research interests include structural and functional
studies of insect chorion (eggshell) and insect cuticle, prediction of protein structure, function and
interactions, relational and object-oriented protein databases, automatic analyses of genomes, study of
fibrous and globular protein structure, function and interactions and studies of structure and self-assembly mechanisms of amyloids.
Julia Hossbach was born in Jena, Germany. She obtained her diploma in biology from the Friedrich-Schiller University Jena in 2003. At the University of Leipzig she was working as a PhD student
until 2006. She joined the Institute of Molecular Biology and Bioinformatics, Charit-University
Medicine Berlin in 2006 where she is currently working at the group of Robert Preissner. Her main
field of research interest comprises the investiagtion of bioactive compounds from natural and synthetic
compound libraries.
Jrgen Kleffe received the PhD degree in 1976, from Humboldt-University in Berlin in the field of
mathematical statistics. He received the DSc for outstanding research work on analysis of variance and
variance component estimation. His research in the area of mathematical statistics has been recognized
with support by the American Academy of Sciences and inclusion in the 8th edition of the International
Whos Who of Intellectuals. He is head of the group for Automatic Gene Annotation at the Charit
Berlin which is part of the Berlin Center for Genome Based Bioinformatics. He was a visiting professor in USA Universities. He worked on applications in various fields including gene prediction and fast
sequence matching, biostatistics, econometric models, growth curve analysis, actuarial mathematics
and missing data analysis. Bioinformatics, improvement of methodology and algorithms for statistical
analysis of DNA sequences and the reliability of genomic sequence assembly constitute his main areas
of interest.

Sophia Kossida obtained her DPhil in 1998 from Oxford University, Merton College in the UK. She
carried out a post-doc at Harvard University, USA at the Molecular & Cellular Biology Department.
She joined Biomedical Research Foundation of the Academy of Athens in July 2004 as tenure track
research bioinformatician, Center of Basic Research II, Biotechnology Division. Her current research
interests are focused on bioinformatics and medical informatics.
Vladimir Kotev is a PhD student at the Institute of Mechanics to the Bulgarian Academy of Sciences. He graduated from the Technical University Sofia, Bulgaria, having completed a Masters
degree in general mechanical engineering. He investigates the qualitative and quantitative behavior of
the nonlinear models describing signaling pathways and gene-regulation systems. His research interests
are in the nonlinear dynamics, system biology and bioinformatics.
Axel Kowald (born 1962) holds a PhD in mathematical biology from the National Institute for Medical
Research, London. He has worked at the University of Manchester, the Institute for Advanced Studies
in Budapest, the Humboldt University Berlin and the Max Planck Institute for Molecular Genetics in
Berlin. His current research interests focus on the mathematical description of biochemical pathways
involved in the aging process and systems biology.
Andrew Kuznetsov obtained his degrees in biochemistry (MSc), microbiology (PhD) and biotechnology (DrSci) in the former Soviet Union, where he was active in the gene engineering and transgenesis. Andrew was a postdoc fellow 2002-2005 in molecular biotechnology at the Freiburg University,
Germany. He is now a joined scientist in Institute for Microsystems Engineering at the same University.
Andrew developed Synbiology Database working for NEST Project (http://www.synthetic-biology.info/).
He was a leader of iGEM2006 Freiburg team. He is the author of 25 papers and he is interested in the
unconventional computation, synthetic biology and evolution by communication.
Tony Kwan obtained his PhD in biochemistry from McGill University in Montreal, Canada under
the supervision of Dr. Philippe Gros on the Structural and functional analysis of the mouse Mdr3 Pglycoprotein. He then completed an industrial post-doctoral fellowship in bioinformatics from 2001
to 2003 at Targanta Therapeutics in Montreal, a biotechnology company developing antibacterial
therapies. He remained at Targanta Therapeutics until 2005, before returning to McGill for a 2nd postdoctoral fellowship in the Department of Human Genetics under the direction of Dr. Jacek Majewski.
Dr. Kwans research is currently focused on the use of microarray technologies for the study of human
genetic variation.
Philippe Lambin is a professor in the Department of Radiation Oncology (MAASTRO) at the
University of Maastricht and division leader at the Research Institute GROW. His main areas of interest
are directed towards translational research in radiation biology with a specific focus on tumor hypoxia,
functionnal imaging and lung cancer. His most recent research investigates the use of systems biology approaches combined with large databases of biological (geno-proteomic), clinical, imaging and
treatment data coupled with an artificial intelligence system. This approach will allow computer-aided
individualized treatment.
Andrea Maffezzoli obtained a Masters degree in biomedical engineering from the Politecnico di
Milano in 2003 and at present he is a PhD student in Bioengineering from March 2005. His study and
research activity concerns mainly the analysis of data elaboration methods and the implementation of
related software applied on neuronal cultivations signals, with the collaboration of Universit degli
Studi of Milan. He is author of some proceedings, books chapters and international peer-reviewed
papers in the field of data elaboration, data mining and informatics applied in medicine.
Jacek Majewski is an assistant professor and Canada research chair at the Department of Human
Genetics, McGill University, Montreal, Canada. His background in physics and biology promoted his
interest in bioinformatics, genomics and large-scale biological data analysis. His current research interests include genome evolution, regulation of gene expression, and alternative splicing.
Georgy V. Maksimov is a professor of biology at the Moscow State University (Russia) and holds
a PhD and Russian DrSci degree in biophysics and human physiology. His main interests lie in the
area of cell biophysics including studies of the membrane and cytoplasm processes in excitable cells
and erythrocytes under normal conditions, laser irradiation, magnetic field and various pathologies,
axo-glial interactions during demyelinization and Lyme disease. He published about 150 papers and 4
books on the cell biophysics and basis of rhythmic excitation.
Evgenia Makrantonaki was born in 1978. She studied at the Aristoteles University Thessaloniki,
and Medical School, Charit University Medicine Berlin. Since 2006 is resident at the Departments of
Dermatology and Immunology, Dessau Medical Center. IFMSA-, Erasmus- and Verein zur Frderung
der Dermatologie e.V. She is a scholarship holder, prize winner of the Hermal prize (2006) and the William J. Cunliffe Scientific prize (2006).
Ferda Mavituna has BSc (with distinction) in chemical engineering (Middle East Technical University, Turkey), MSc in advanced chemical engineering and PhD in biochemical engineering, both from
the Victoria University of Manchester, UK. She is currently a professor of chemical and biochemical
engineering in the School of Chemical Engineering and Analytical Science, The University of Manchester, UK. She is the co-author of Biochemical Engineering and Biotechnology Handbook and coeditor of two books. Her research activities have been in the following areas: plant biotechnology for
pharmaceuticals production and somatic embryogenesis, immobilised microbial and plant cell cultures,
modelling, metabolic engineering and systems biology.
Elisabeth Maschke-Dutz received her diploma in mathematics and computer science from the
Technical University Berlin. She has worked at the bioinformatics department of the German Resource
Center for Genomics Research (RZPD). Currently her work at the Max Planck Institute for Molecular
Genetics in Berlin focuses on the development of computational methods and tools for systems biology. Her research interests include mathematical modeling and analysis of biological and biomedical
systems
Thomas Meinel studied chemistry at the Technical University in Berlin with focus on physical
chemistry and biochemistry. After several laboratory work on protein complexes in the interdisciplinary
field of photosynthesis, he joined Martin Vingrons Department of Computational Molecular Biology
10
at the Max Planck Institute for Molecular Genetics in Berlin-Dahlem. He incorporated taxonomic and
multiple alignment features to the SYSTERS web interface, analyzed taxonomic specificities of protein
families, and developed the PhyloMatrix phylogenetic profiling tool. He joined Hans Lehrachs Department of Vertebrate Genomics at the MPIMG.
Geoff McLachlan is professor of statistics in the Department of Mathematics and a professorial
research fellow in the Institute for Molecular Bioscience. He is also a chief investigator in the Australian
Research Council (ARC) Centre of Excellence in Bioinformatics. His research has been recognized
with various awards, including a DSc in 1994 and an ARC Professorial Fellowship in 2006.His numerous publications include five monographs, the last four as volumes in the Wiley series in Statistics. His
research in statistics has been concentrated on the related fields of classification machine learning, and
pattern recognition, and in the field of statistical inference. The focus in the latter field has been on the
theory and applications of finite mixture models and on estimation via the EM algorithm. More recently,
he has become actively involved in the field of bioinformatics with the focus on the statistical analysis
of microarray gene expression data in which he has coauthored a Wiley monograph.
Luis Mendoza holds a PhD in biomedical sciences, from the Universidad Nacional Autnoma de
Mxico (UNAM). He continued his preparation by doing a postdoctoral stay at the Karolinska Institute,
in Stockholm; and later he worked as an associate scientist for the biotechnology company Serono, in
Geneva. Since 2006 he is a group leader at the Instituto de Investigaciones Biomdicas, UNAM; there,
he leads the ComBioLab, developing models of regulatory networks in biological systems of interest.
Luciano Milanesi received his PhD in health physics from University of Milan in 1986. He is currently
researcher of the Italian National Research Council. Institute of Biomedical Technologies (CNR-ITB).
He is the coordinator of the CNR Bioinformatics project, CNR-Bioinformatics. He has been principle
investigator for the European projects: TRADAT (TRAnscription Database and Analysis Tools), ORIEL
an Online Research Information Environment for the Life Sciences and EGEE. He is the coordinator
of the European BIOINFOGRID (Bioinformatics Grid Applications) for life science. and the LITBIO
(Laboratory of Bioinformatics Technologies). He is an editorial board member of the IEEE Transactions
on Nanobioscience and Briefings in Bioinformatics. He has published more then 150 refereed publications in journals, books and conference proceedings relating to the areas of bioinformatics, system
biology and medical informatics.
Olaf Minets research is in tissue optics, spectroscopy, mathematical modelling and image processing. After first degree in theoretical physics from the Technical University in Darmstadt, Germany in
1986, he joint the Institute for Medical Physics and Optical Diagnosis at Charit, Berlin. He conducted
doctoral research in optical diagnosis of rheumatoid arthritis including advanced image processing.
Alok Mishra is a PhD candidate at Imperial College, London where he is researching techniques to
integrate various biological datasets using kernel based methods. He did his MSc degree in computing
science at Imperial College and another MSc in artificial intelligence at Univ. of Edinburgh. He did his
undergraduate engineering degree from Indian Institute of Technology, Kharagpur (India). His research
is funded by the Imperial College Deputy Rectors Scholarship.
11
Erik Mosekilde is a professor of physics at the Technical University of Denmark with complex
systems theory and modelling of biological systems as his main interests. He is also coordinator of
BioSim, a European Network of Excellence in Biosimulation A New Tool in Drug Development.
Erik Mosekilde holds a PhD and a Danish DrSci degree in experimental and theoretical physics. He
started to work in systems biology and mathematical modelling of cellular and physiological systems
in 1977 and published about 220 scientific papers and a number of books on application of nonlinear
dynamics to physical, technical and biological systems.
Charalampos Moschopoulos studied computer science at the University of Patras in Greece, where
he obtained his Masters degree in 2006. Currently, he is a PhD student at the Biomedical Foundation
Research of the Academy of Athens in cooperation with the University of Aegean in Greece. His research interests are focused on artificial intelligence techniques in bioinformatics, including machine
learning and data mining.
Laoighse Mulrane received a 1st Class Joint Honours degree (BSc) in genetics & pharmacology
from University College Dublin in 2007. She is currently studying for an MSc in pharmacology in
Prof. William Gallaghers lab in the UCD School of Biomolecular and Biomedical Science in Dublin,
Ireland. The subject of her thesis is the study of toxicity biomarkers and their validation using tissue
microarrays.
Raul Munoz-Hernandez is a chemical engineering PhD student from The University of Manchester,
and his research is in metabolic engineering and systems biology. Raul started his career in biotechnology after finishing his BSc (chemistry) in ITESM, Mexico. He studied an MSc (food science) in CIAD,
collaborating with Arizona University (USA). He also holds an MBA (honours). His experience in
industry includes the pulp & paper (quality engineer) and the poultry industry (new products & special
projects manager). Currently he is interested in the biopharmaceutical innovation management; and he
is working to spin-out from the University his venture InLife Technologies (www.inlifetech.com)
George Nikiforidis received his Laurea in physics and his MSc in atomic and nuclear physics both
from the University of Milan, Italy in 1973 and 1980 respectively and his PhD in medical physics from
the University of Patras, Greece in 1981. He is currently a professor of medical physics and the director
of the Department of Medical Physics, University of Patras, Greece and of the post-graduate course on
medical physics, from the same institution. He has been the principal investigator or been involved in
a variety of national or European research and development projects.
Svetoslav Nikolovs research and educational interests are in the fields of mathematical modeling,
nonlinear(chaotic) dynamics and bifurcation analysis of systems in cell biology. His MS in mechanical engineering he received from the Technical University of Sofia, Bulgaria, in 1994 and PhD degree
from the Institute of Mechanics and Biomechanics (IMech)-Bulgarian Academy of Science (BAS), in
1999. Since 2005 he has been associate professor at IMech and since 2004 a joint position as a lecturer
at Faculty of Biology, University of Sofia, Bulgaria.
12
Matej Orei holds a PhD in biophysics from Cornell University. Since 2003 he leads the research
in domains of quantitative biology and bioinformatics (http://sysbio.vtt.fi/) at VTT Technical Research
Centre of Finland (Espoo, Finland), with the main research areas being metabolomics applications
in pharmacology, biomedical research and integrative bioinformatics. Recent investigations include
studies of statin induced myopathy, longitudinal metabolic profiles of children who progressed to Type
1 Diabetes (DIPP study) and investigations of lipidomic profiles associated with lipotoxicity induced
insulin resistance. Prior to joining VTT, Dr. Orei was a head of computational biology and statistics at
Waltham/Massachusetts-based BG Medicine, Inc., and bioinformatician at LION Bioscience Research
in Cambridge/MA.
Alexey N. Pavlov graduated in 1995 from the Saratov State University (Russia). In 1998 he got a
PhD in physics and mathematics. Since 2002 he is associate professor at Saratov State University. His
research interests are the dynamics of living systems and time-series analysis. He is co-author of about
50 papers in peer-reviewed journals.
Valko Petrov is physicist by education and begins to work as young researcher in 1973 at Institute of
Mechanics and Biomechanics (IMBM) to Bulgarian Academy of Sciences. In 1987, he became doctor
(PhD) on biomechanics and works in IMBM on the field of mathematical modeling nonlinear dynamics
of biological systems. As a head of section of biodynamics and biorheology to IMBM, he reads regular
lectures on this discipline.
Tuan Pham received his PhD in 1995 from the University of New South Wales with a thesis entitled Fuzzy Finite Element Analysis of Engineering Problems which was the pioneering work in the
field and has attracted attention of researchers in engineering computations. He has been an associate
professor and appointed the director of the Bioinformatics Applications Research Centre at James Cook
University. His research interests include image processing, molecular and medical image analysis,
pattern recognition, bioinformatics, biomedical informatics, fuzzy-set algorithms, genetic algorithms,
neural networks, geostatistics, signal processing, fractals and chaos. Tuan Pham is a senior member
of the Institute of Electrical and Electronics Engineers (IEEE), and editorial board member of several
journals and book series including Pattern Recognition, Current Bioinformatics, Recent Patents on
Computer Science, Book Series on Bioinformatics and Computational BioImaging.
Robert Preissner was born in Berlin, Germany. He obtained his diploma in biophysics in 1988
from the Humboldt-University, Berlin. Until 1990, he worked as a research associate in the Department of Biomathematics at the Academy of Sciences. When the Berlin wall came down he joined the
Institute of Protein Crystallography at the Free University, Berlin. He received his PhD with a thesis
on the relations between sequences and structures of proteins. Since 2007 he is assistant professor at
the Institute of Molecular Biology and Bioinformatics, Charit-University Medicine Berlin. He favours
multidisciplinary approaches and his main field is structural bioinformatics.
Axel Rasche studied mathematics at the ETH Zurich and the Humboldt-University in Berlin. He is
currently PhD student at the Max-Planck-Institute for Molecular Genetics in Berlin-Dahlem. His research
topics are preprocessing of Affymetrix GeneChip arrays, statistical analysis of alternative splicing and
data integration with a focus on type 2 diabetes mellitus.
13
Isabel Reinecke studied mathematics at the University of Hamburg and at the Institut National
des Sciences Appliques (INSA) in Toulouse. From 2004 until 2006 she was a member of the research
group Computational Medicine. Since 2006 she is a member of the research group Computational
Drug Design Zuse Institute Berlin.
Elton Rexhepaj received an MSc in computer science from the University of Lausanne in 2005. He
is presently a PhD student in the group of Prof. William Gallagher at the UCD School of Biomolecular
and Biomedical Science, UCD Conway Institute, University College Dublin, funded under the Health
Research Programme Breast Cancer Metastasis: Biomarkers and Functional Mediators. His current
research is focused on development novel algorithms for the analysis of immunohistochemical data.
George Sakellaropoulos received his diploma in physics from the National and Kapodistrian University of Athens, Greece in 1993 and his MSc and PhD in medical physics from the University of Patras,
Greece in 1995 and 2001 respectively. He is currently lecturer of medical physics at the Department of
Medical Physics, University of Patras, Greece. His main research interests lie in the field of biomedical
informatics, statistical learning and decision support systems.
Elizabeth Santiago-Corts holds a BSc in experimental biology from the Universidad Autnoma
Metropolitana, Mxico. She is currently a PhD student working in the development of computational
models about the molecular and cellular mechanisms for differentiation in the root of Arabidopsis thaliana. She works at the Instituto de Investigaciones Biomdicas in the group of Luis Mendoza.
Renaud Seigneuric is a post-doctoral fellow in the Department of Radiation Oncology (MAASTRO) at the University of Maastricht in the Netherlands. His background is in physics and he received
a French PhD in health sciences from the University of Grenoble as well as a Canadian PhD in biomedical engineering from the University of Montreal. Renaud Seigneuric investigates the dynamics of
complex systems in biology and medicine at different scales, combining mathematical modelling and
experimental work.
Nikolaos G. Sgourakis received a BSc degree in biology (2004) and a MSc degree in bioinformatics
(2006) from the faculty of biology of the University of Athens. Since 2003 he has been involved in the
development of bioinformatics tools focusing on the automated prediction of properties of G-protein
coupled receptors, under the supervision of professor Stavros J. Hamodrakas. He is currently a doctoral
candidate at Rensselaer Polytechnic Institute (group of Dr. Angel Garcia). His research interests cover
the areas of protein folding and dynamics, including the implementation of novel methods for the interpretation of NMR observables based on molecular dynamics simulations.
Maria G. Signorini is associate professor at the Biomedical Engineering Department, Politecnico di
Milano, since december 2003. In the same university she received a Master degree in electric engineering and in 1995, a PhD in biomedical engineering on nonlinear analysis and modelling of cardiovascular
time series. Her teaching acitivities are in the field of electronic bioengineering and biomedical signal
and image processing. Since 2004, she is the coordinator of the PhD degree program in bioengineering,
School of Doctoral Programs, Politecnico di Milano. Her main research interests are nonlinear analysis
and modelling of biological signals.
14
Olga V. Sosnovtseva graduated in 1989 from the Saratov State University (Russia). In 1996 she got
a PhD in physics and mathematics. Since 2005 she is associate professor in biophysics at the Technical
University of Denmark. Her research interests are systems biology, nonlinear dynamics and modelling
of biological systems. Most of here works in systems biology are devoted to studies of the pressure and
flow regulation in kidney, neuro-glial and axo-glial interactions. She is co-author of about 60 papers in
peer-reviewed journals.
Sree N. Sreenath, Ph.D., is a faculty in the Electrical Engineering and Computer Science Department
at Case Western Reserve University, Cleveland, Ohio in the United States. His research and educational
interests are in complexity research (modeling, structural issues and simulation) focused on systems
biology. He applies multilevel hierarchical systems approaches to understand problems in cell signalling
implicated in Acute Myelogeous Lukemia (AML), inflammation and prostate cancer, and coordination
of the heart-brain interaction. His also director, Case Complex Systems Biology Center, and a recipient
of the US NIH Research Career award for 2004-2009.
Maud Starmans, MSc, is a PhD student in the Department of Radiation Oncology (MAASTRO)
at the University of Maastricht. She is working on the project that investigates the use of in vitro and
in vivo microarray derived gene signatures in patient outcome prediction. These gene signatures can
provide valuable information on tumor status, prognosis and prediction. This will help individualizing
treatment and should result in better tumor control, and more rapid and cost-effective research and
development.
Heike Stier studied biology at the University of Hohenheim; Stuttgart. Worked on the influence of
gangliosides on regernerating peripheral nerves: Diploma Dept. Prof. Dr. H. Rsner. Topic of the PhD
thesis: Influence of Mller Glia cells on axonal path finding of retina ganglion cells during development:
research work has been done at the NMI Reutlingen, Prof. Dr. B. Schlohauer.Post-doctoral research
at the NMI Reultingen. Post-doctoral research at University of Utah, School of Medicine, Utah, USA ,
laboratory of Prof. Dr. S.B. Kater, Calcium imagine of neuronal growth cones. (Feodor Lynen fellowship
of the Alexander von Humboldt Foundation). Postgraduate studies in Bioinformatics at the University
of Applied Science, Berlin; Dept.: Prof. Dr. I. Koch. Master thesis: Alternative Splicing of Membrane
Protein; Dept: Institut fr Molekularbiologie und Bioinformatik: Priv. Doz. Dr. J. Kleffe. Since April
2007: scientific assistant at Analyze&Realize AG (a&r), Berlin.
Athina Theodosiou studied biology at the Aristotle University of Thessaloniki in Greece where she
obtained her first degree, in 2005. She did her Masters degree in bioinformatics at the University of
Manchester in UK the year 2006. Currently she is a PhD student at the Biomedical Research Foundation of the Academy of Athens in cooperation with the University of Patras. Her research interests are
focused on evolution, proteomics, and bioinformatics.
Arie van Erk has a background in medicine (MD) and informatics (MSc). He is now working as
a PhD student at the BiGCaT Bioinformatics group of the University Maastricht. In this function, he
focuses on large scale data analysis and exploring regulatory elements in mRNA transcription.
15
Natal van Riel is assistant professor of biomodeling and systems biology in the Department of Biomedical Engineering at Eindhoven University of Technology and principal investigator of Metabolic
Systems Biology Eindhoven. His research interests include mathematical modelling and identification
of biological systems.
Paolo Vicini, Ph.D., is associate professor of bioengineering at the University of Washington, Seattle,
and director of the Resource Facility for Population Kinetics, a NIH/NIBIB research resource. He holds
a PhD from the Polytechnic of Milan, Italy (1996) and a Laurea degree in electrical engineering from
the University of Padova (1992). His research interest focus on mathematical and statistical models of
biological systems, in particular regarding the inverse problem, the estimation of biologically meaningful parameters from noisy data, modeling and simulation software development and issues of model
selection. He received the EMBS Early Career Award in 2003.
Antonio Vidal-Puig, MD, PhD is a reader in human metabolism at Cambridge University, deputy
director of MRC Center for Obesity and Associated Diseases (CORD), honorary consultant in metabolic
medicine at Addenbrookes Hospital. He received his MD from Valencia University and his PhD from
Granada Medical School. Dr. Vidal-Puig did his postdoctoral training and become junior faculty at
Harvard University before relocating in 1999 to Cambridge University. His areas of interest are related
to the role of lipids in insulin resistance and diabetes, and more specifically to the concept of lipotoxicity
as the pathogenic mechanism linking obesity to insulin resistance.
R. William Watson is a senior lecturer in the School of Medicine and Medical Science, University
College Dublin (UCD) and principal investigator at the UCD Conway Institute and Dublin Molecular
Medicine Centre. He is also lead co-ordinator of the Cancer Biology Group in the UCD Conway Institute, which consists of 29 independent investigators. He has used transcriptomic and proteomic approaches to investigate the cellular and molecular mechanisms by which prostate cancer epithelial cell
die, leading to new diagnostic tools and therapeutic targets. He is a founding member of the Prostate
Cancer Research consortium and chair of the Bio-Resource Management and Implementation group of
the consortium were he has established standard operating procedures for the appropriate collection of
tissue, blood and urine from men undergoing radical prostatectomy.
Bart Weimer directs the Center for Integrated BioSystems (CIB) at Utah State University. He is a
recognized expert in microbial physiology and functional genomics. As director he is focused on using
his expertise to lead the CIB in research, core services, and biotechnology training. Prior to joining
USU, he obtained degrees from University of Arizona (BS, honors) and Utah State University (PhD) in
microbiology. He trained at the University of Melbourne (Australia) as a postdoctoral fellow in genetics
and biochemistry. He has worked in microbiology for over 17 years.
Peter Wellsteads research is in mathematical modelling and data analysis motivated by problems in
biology, physiology and medicine. After an apprenticeship with Marconi Instruments and a first degree
in electrical engineering from the Hatfield College of Technology, UK in 1967, he conducted doctoral
research in random signal processing at Warwick University UK, graduating in 1970, with DSc awarded
in 1988. A period with CERN was followed by a career at the Control Systems Centre (UMIST) where
16
he became professor of control engineering. Since 2004 he has been Science Foundation Ireland (SFI)
research professor of systems biology at the Hamilton Institute.
Olaf Wolkenhauers research is in mathematical modelling and data analysis, focussing on nonlinear dynamic systems in cell biology. After first degrees in control engineering from the University of
Applied Sciences in Hamburg, Germany and the University of Portsmouth, U.K. in 1994, he conducted
doctoral research in possibility theory for data analysis at UMIST, Manchester, graduating in 1997. A
research lectureship at the Control Systems Centre (UMIST) led to a joint senior lectureship between
Biomolecular Sciences and Electrical Engineering and Electronics, at UMIST. Since 2003 he has held
the Chair in Systems Biology and Bioinformatics at University of Rostock in Germany.
Bradly G. Wouters is professor and head of the laboratory of radiation oncology at Maastricht University in the Netherlands. Professor Wouters received his PhD from the Medical Biophysics Department
at the University of British Columbia and then conducted post-doctoral research at Stanford University.
He is an expert in the field of molecular radiation oncology with a primary interest in understanding the
cellular and molecular responses to hypoxia and their influence on the biological behavior of tumors.
Dr. Wouters also has active research interests in DNA repair and is actively involved in translational
research studies with the radiation oncology.
Paul Wrede studied biology at the Freie Universitt Berlin. He worked on molecular evolution topics
at The Max-Planck-Institut fr Molekulare Genetik, Berlin, Diploma and PhD, Dept. Prof. Wittmann.
He has done his post-doctoral research at Massachusetts Institute of Technology, Cambridge, USA,
laboratory of Prof. Alexander Rich, evolution of tRNAs. He was assistant at University of Heidelberg,
Lab. Prof. Hermann Bujard and at the Freie Universitt Berlin, working on membrane proteins, protein
integration and secretion, founding the bioinformatics group, Lab. Biophysics Dept. Prof. Georg Bldt.
Currently he isprofessor and bioinformatics group leader at the Charit in the Institut fr Molekularbiologie und Bioinformatik, head Prof. Burghardt Wittig.
Wasco Wruck received his diploma in computer science from Technical University Berlin in 1990.
After working in a research project for parallel computer architectures and developping software in the
telecommunication industry he joined the Max-Planck-Institute for Molecular Genetics in 1999. His
interests are in image analysis and in microarray evaluation methodology.
Urszula Zabarylo obtained her Bachelors degree in medical physics in 1999 from the University
of Nicolai Copernicus in Poland and Master of Science degree in experimental Physics from the Free
University of Berlin in 2004. Since 2005 she joints the Institute for Medical Physics and Optical Diagnosis at Charit, Berlin. Her research interests include image processing, tissue optics and optical
diagnostics in medicine.
Christos C. Zouboulis was born in 1960. MD, University of Athens; Dr. Med, Medical School,
The Free University of Berlin; Diploma in dermatology, venerology, allergology and proctology (Berlin
Medical Association). Since 2000 is professor of dermatology, Medical School, The Free University
of Berlin. Since 2005 chair of departments of Dermatology and Immunology, Dessau Medical Center
and head of the Laboratory for Biogerontology, Dermato-Pharmacology and Dermato-Endocrinology,
17
Institute of Clinical Pharmacology and Toxicology, Charit Universitaetsmedizin Berlin. President of the
European society of Anti-Aging Medicine (ESAAM), spokesperson of the research group DermatoEndocrinology of DDG, chair of the executive committee William J. Cunliffe Scientific Awards and
of the german register Morbus Adamantiades-Behet e.V.. Oskar-Gans-, Felix Wankel-, Geroulanos-,
Paul-Gerson-Unna-, Springer-Verlag prize winner, Honorary member of the Lithouanian Dermatology
Society.
18
Index
Symbols
2D-PAGE 408, 409, 410, 411, 412, 413,

414, 417
5-aminolevulinic acid (ALA) 599, 600, 601,
602, 604, 608, 618, 619, 620, 621,

630, 636, 641, 642, 643, 644, 646,

649, 650, 653, 654
-barrel 182, 183, 184, 185, 186, 187,

188, 189, 190, 191, 192, 193, 194,

195, 196
A
adaptive immunity 6, 379, 382
additive model 256, 411, 502, 504, 830
adenosine diphosphate (ADP) 573, 574, 587,
614, 615, 646
adenosine triphosphate (ATP) 307, 334,

364, 374, 459, 573, 574, 587, 614,

685, 686
adipose tissue 355, 356, 358, 363, 368, 467
Affymetrix 249, 250, 251, 252, 253, 254,25
5, 256, 257, 259, 255, 260, 251,

260, 255, 261, 262, 263, 264, 265,

266, 269, 276, 342, 345, 346, 387,

388, 406, 721
agglomerative clustering 209, 211
allelic association 263, 266
allostasis 24, 358
alternative splicing 144, 148, 160, 241,

259, 262, 263, 264, 265, 266, 267,

269, 270, 271, 272, 273, 274, 275,

276, 277, 291, 293, 295, 298, 300,

302, 304, 305, 306, 307, 308, 309
amplitude modulation 667, 668, 669
analysis of variance (ANOVA) 270, 347, 412

Andronov-Hopf bifurcation 36, 37, 40, 41,

43, 55, 57, 59, 60, 61, 72
apoptosis 22, 74, 75, 79, 80-85, 93, 94,

298, 309, 327, 329, 351, 352, 359,

368, 382, 383, 387, 398, 427, 435,

487, 539, 588, 589, 598-609, 612628, 631-639, 641, 675, 686, 690,

691, 703, 704, 707, 721
avian influenza 797, 798, 799, 803, 807
B
background correction 222, 231, 232, 237,

242, 243, 251, 254, 411
backpropagation through time 504, 514
bacterial robot 106
balance equation 76, 82
basic local alignment search tool (BLAST)
135, 146, 149, 150, 152, 161, 162,

166, 245, 249, 399, 485, 494
Bayesian information criterion (BIC)
212, 213, 215, 217, 507
Bayesian networks 8, 10, 100, 112, 113, 34
9, 397, 499, 506, 507, 508, 512, 514,
520, 526, 528
bead-level-analysis 242, 243
BeadArray 239, 249
BeadStudio 240, 241, 242, 243, 244
beta-thalassemia 291, 308
biobricks 104, 113
biochemical network 8, 327, 389, 390, 493,
569
bioconductor 240, 241, 242, 243, 244, 247,
248, 260, 286, 289, 290, 341, 345
Index
biomarker 10, 119, 120, 121, 126, 130,

132, 133, 134, 137, 138, 140, 352
Boolean logic 4, 75
Boolean models 75, 391, 392, 393, 394, 500
bootstrapping 150, 219
bottom-up approach 112, 391, 392
C
candidate gene approach 366, 367
cell division cycle (CDC) 31
ceramides 356, 360, 637
chromatin immunoprecipitation
384, 401, 518, 519, 696
cluster analysis 100, 119, 210, 214, 217,

218, 219
combinatorial complexity 1
conformational entropy 734, 736, 738, 741,
742, 744, 746, 749, 750, 751, 756,

757, 758
controllability 16
cross-hybridization 253, 254, 255
crosstalk 16, 17, 18, 23, 24, 378, 635,

674, 717
D
data integration 340, 403, 414, 416, 418,

476, 477, 478, 480, 481, 482, 483,

484, 485, 486, 488, 489, 490, 491,

492, 516, 517, 523
dehydroepiandrosterone (DHEA)
467, 468, 772, 774, 778
delay differential equation (DDE)
33, 34, 35, 36, 66, 67
dengue fever 808, 809, 810, 820, 822, 823
depolarization 638, 664, 668, 680, 684, 685,
686, 688, 689, 690
deterministic kinetic modeling 74
deterministic models 75, 324, 738, 799
differential expression
119, 242, 248, 334, 412
dilatative cardiomyopathy (DCM)
438, 446, 447, 449, 456
dimension reduction 118, 213, 214, 413,

838, 847
discrete dynamical system
531, 533, 534, 535, 537
discriminant analysis 118, 210

disulphide bridges 746
divisive clustering 211
dynamical Bayesian models 391, 393
dynamical Boolean models 392, 393
dynamical diseases 30
dynamical system 14, 16, 25, 34, 54, 500,

501, 502, 515, 531, 532, 534, 535,

537, 540
E
eigenvalue 41, 43, 57, 82, 84, 95, 578,

803, 815, 819
electrostatic solvation 742, 747, 748, 749
enthalpy 731, 732, 734, 742, 743, 744, 745,
747, 748, 750, 751, 758
entropy 139, 225, 227, 231, 232, 234,

236, 237, 238, 500, 511, 545, 731,

732, 733, 734, 735, 736, 737, 738,

739, 740, 741, 742, 743, 744, 745,

746, 748, 749, 750, 751, 752, 753,

754, 755, 756, 757, 758, 838
enzyme-linked immunsorbent assay (ELISA)
447, 448, 449, 456
equilibrium constant 77, 78, 735, 737, 741
erythrocytes 379, 662, 663, 664, 665, 667,
669
exciton 573, 576, 578, 579, 583, 584, 586
Exon Array 263, 264, 270, 272, 274, 276
expressed sequence tag (EST) 263, 267, 269,
270, 274, 275, 276, 305, 487
F
FARMS 256, 261
feature extraction 119, 120, 121, 439, 441,
442
feedback control 12, 15, 24, 71, 554, 644
feedback loops 15, 17, 23, 30, 31, 72, 80,
91, 349, 372, 533, 534, 535, 536,

540, 717
fluorescence intensity
680, 684, 685, 686, 688
fluorescence microscopy
120, 597, 599, 639, 685
free energy 731, 732, 734, 735, 736, 737,

738, 741, 742, 743, 744, 745, 746,
Index

747, 748, 749, 750, 751, 752, 755,

757, 758
frequency modulation 667, 668
G
G-protein coupled receptors (GPCRs) 167,

168, 170, 171, 169, 171, 172, 173,

174, 175, 176, 177, 179, 180, 181
Gene Expression Profile 236
gene ontology 348, 397, 416, 493, 519, 521,
522, 523, 524, 526, 724
gene regulatory network 9, 74, 503, 512, 516
gene shaving 215, 219
gene signature 346, 348
genetic algorithm 124, 433, 435, 502, 503,
504, 513, 580, 581
genetic regulatory network
531, 532, 535, 539
genome-wide scanning 366
gonadotropin 768, 794, 795
gonadotropin releasing hormone (GnRH)
764, 765, 766, 767, 768, 774, 793,

794, 795
H
heat shock proteins 588, 589, 622, 631,

674, 691
heat stress response 678, 691
heme 130, 575, 588, 589, 591, 599,

600, 620, 621, 622, 625, 631, 637,

643, 644, 645, 646, 648, 649, 650,

652, 653, 654
hierarchical clustering 149, 152, 153, 163,

164, 210, 211, 213, 218, 249
high-throughput methods 362, 369
Hill function 760, 761, 762, 766, 767, 768
,795
histopathology 825, 826, 827, 829, 830,

831, 832, 833, 834, 837, 842, 846,

847
homeostasis 12, 13, 15, 32, 33, 329, 331,

334, 335, 354, 355, 356, 359, 368,

389, 460, 530, 533, 538, 616
Hopf's theorem 27
host-pathogen interaction 381, 390
human-pathogenic fungi 403, 404, 406, 408,

409, 410, 411, 412, 413, 414, 416,

417, 418
hybridization 118, 220, 222, 226, 240,

242, 249, 252, 253, 254, 255, 257,

259, 261, 264, 265, 266, 267, 268,

280, 289, 342, 400, 411, 527
hydrogen bonds 183, 186, 746, 747, 748
hydrophobicity 131, 174, 186, 187, 442,

444, 448, 597, 749
hyperthermia 673, 674, 675, 685, 686,

688, 689
I
identifiability 15
image pre-processing 825, 826
immune response 377, 379, 451, 536, 539,
610
immunocompromised 404, 405
in-paralogs 145, 147, 151, 164
in-silico simulation 12, 14, 21
infectious systems biology 378, 390
interaction graph 388, 389, 700
interference microscopy 656, 657, 659, 661,

662, 663, 664, 665, 666, 669, 670,

671, 672
interferometers 658, 659, 672
in vitro 52, 107, 112, 114, 195, 300, 308,

331, 334, 335, 340, 344, 346, 347,
350, 356, 368, 425, 429, 445-447
, 467, 470, 542, 543, 546, 554,
597, 598, 601, 603, 614-618, 621,

622, 628, 630, 632, 635-637, 641,
645, 697, 701, 702, 706, 710
in vivo 31, 52, 107, 110, 175, 192, 195,

327, 332, 340, 344, 355-358, 368,

384, 395, 429, 435, 445, 470, 485,

542, 546, 554, 591, 592, 603,
608, 610, 614, 615, 618, 622, 625,
631, 635, 637, 640, 697, 698, 701,
706, 709, 710, 712, 713
isoform 19, 262, 263, 264, 270, 273, 274,
275, 276, 277, 293, 300, 303, 309,

358, 621, 645
Index
J
J-aggregates 675, 676, 677, 680, 681, 684,
686, 688
Jacobian matrix 82, 83, 84, 87, 802, 814,

815, 818, 819
Jaynes 237
JC-1 monomers 676, 684, 686
joint probability distribution 505, 506
K
k-means clustering 209, 523
Kalman filtering 17
kinetic laws 76, 77, 89, 489, 648
knowledge discovery 126, 127, 128, 133,

134, 137, 477, 486
L
laboratory evolution 105
Law of Mass Action 77, 78
least square fit 503
lipidomics 354, 355, 356, 357, 359, 408
lipotoxicity 356, 359, 360
Lyapunov-Andronov theory 27, 28, 44
M
Mach-Zehnder interferometer 659, 672
Markov Chain Monte Carlo (MCMC)
508, 515
mass spectrometry 100, 102, 117, 119, 120,
121, 122, 123, 124, 131, 132, 138,

282, 284, 384, 408, 619, 698
mast cell 664, 666, 667, 669, 670, 671
MATLAB 14, 18, 24, 25, 47, 61, 99, 113,
401, 411, 468
maximum likelihood method 212, 215, 220,
228, 504, 557, 559, 560, 561, 562,
570
metabolic control analysis (MCA)
74, 75, 85, 88, 91, 95
metabolic engineering
278, 458, 459, 465, 471
metabolic flux balancing 458, 459, 460, 462,
465, 467, 468, 469, 470, 471
metabolic pathways 2, 4, 7, 14, 30, 74,

103, 134, 158, 283, 284, 288, 368,

374, 401, 452, 459, 462, 465, 470,

523, 647
metabolomics 8, 278, 279, 284, 286, 353,

354, 355, 357, 360, 408, 413
MicroArray quality control (MAQC) 246,

247, 248, 249, 259, 260
microarrays 5, 117, 118, 123, 124, 209,

219, 226, 229, 232, 237, 239, 246,

250, 251, 252, 258, 259, 263, 266,

274, 275, 289, 290, 305, 339, 340346, 349, 351, 353, 361, 362, 366
369, 375, 383, 384, 400, 406, 409,

410, 414, 420, 467, 487, 497, 512,

517, 518, 526-529, 619, 621, 696,

698, 711, 712, 826, 829
micro electrode array (MEA) 541, 542, 543,
544, 545, 546, 548, 549, 550, 551,

553, 554
minimal genome 106, 110
mitochondrial transmembrane potential
675, 677, 679, 690
mixed effects models 558, 559, 561, 564,

566, 570, 571
mixed models 215, 218
mixture models 212, 213, 219, 221, 229
model-based clustering 212, 527
model decomposition 776, 779, 787
multiple myeloma 423, 427, 433
multiple sequence alignments
151, 152, 175, 180
mutual information 153, 154, 500, 501,

505, 513, 514
N
nanorobot 106
natural language processing (NLP) 171, 174
neuronal network 541, 542, 545, 550, 551,
554
O
obesity 354, 355, 358, 359, 362, 363, 365,
366, 368, 375, 467, 724
observability 16
ODE system 75, 77, 82, 83, 89, 489
optimal control 13, 17, 22
orthogonal life 110
Index
orthology 145, 146, 147, 148, 149, 150,

151, 152, 159, 160, 161, 162, 390,

391, 719, 721, 729
OrthoMCL 151, 158, 162, 164, 400, 719, 7
21, 723
overfitting 347, 840, 841
oxygenic photosynthesis 573, 574
P
P-POD 151, 163
paralogy 147, 148, 150, 161, 165
parameter scanning 74, 75, 82, 85, 89
pathway biology 1, 2, 4, 6, 10
Petri Net 9, 508, 512, 513, 537
Pfam 145, 151, 152, 155, 158, 163, 164,

177, 180, 199
phase contrast microscopy 657, 658
phase image 658, 660, 664, 665, 672
PhIGs 152, 162
photodynamic therapy (PDT)
624, 625, 626, 627, 628, 629, 630,

631, 632, 633, 634, 635, 636, 637,

638, 639, 640, 641, 643, 646,
652, 653
phylogenetic profiling
143, 144, 154, 158, 160,

161, 164, 165
Poisson process 546, 766
population kinetic analysis 556, 557, 568
porin 189, 190, 191, 196, 197, 198, 199,

201, 202, 203, 204, 205
porphyrin metabolism
643, 647, 648, 652, 654
Power Series Distribution 237
preprocessing 251, 253, 254, 256, 257, 259,

414, 509
principal component analysis 187, 346, 736,
749,838, 847
probe set 252, 255, 256, 257, 258, 384
profile hidden Markov models 170
protein crystallization
127, 128, 129, 130, 138
protein interaction graph 700
protein interaction network 9, 10, 399, 695,
699, 705, 708, 711, 712
proteome 9, 10, 119, 132, 146, 147, 148,

203, 282, 351, 383, 385, 386, 388,

403, 407, 409, 412, 417, 418, 495,

526, 699, 710, 711
proteomics 113, 119, 120, 121, 127, 131,

137, 138, 139, 143, 278, 279, 283,

286, 349, 363, 388, 408, 410, 413,

416, 418, 419, 477, 479, 488, 495,

694, 707, 710
protoporphyrin IX (PPIX)
596, 597, 599, 600, 608, 609, 615,

620, 641, 642, 646, 650
PyBioS 82, 85, 90, 91, 370, 372, 375, 643,

647, 648, 652, 654
Q
QFAST algorithm 171, 172, 173
quantitative sequence activity relation (QSAR)
179, 451, 453, 457
R
rate equation 76, 77
rational design 105, 446, 455
Reactome 2, 8, 92, 146, 165, 401, 487,

489, 490, 495, 647, 653, 701, 709,

714, 716, 717, 718, 720, 721-729
refactoring 103
refractive index 657, 658, 661-672
regulatory modules 516, 517, 522, 527
resampling 212, 397
retinitis pigmentosa 296, 298
rhodopsin 168, 173, 175, 176, 179
RNA silencing 27, 28, 51, 52, 53, 61, 66,

67, 68, 69, 71
robustness 16, 23, 66, 70, 71, 263, 266,

372, 374, 375, 386, 700
Routh-Hurwitz 36, 41, 45, 47, 54, 55, 57,
65, 803, 816
S
sebaceous gland 331, 332, 333, 334, 336,
337
secretase 423, 424, 425, 426, 427, 428,

429, 430, 432, 433, 434, 435, 436
Index
segmentation 121, 122, 123, 124, 135, 136,

139,210, 212, 214, 228, 231, 237,

825, 826, 828, 832, 833, 834, 835,

836, 837, 842, 843, 844, 847
sequence similarity 143, 144, 145, 146, 147,

148, 149, 151, 158, 159, 160, 166,
174, 185, 189, 191, 194, 196, 386,

388, 390, 391, 444, 485
signaling pathways 23, 75, 102, 103, 168,

169, 175, 334, 335, 349, 494, 603,

605, 607, 613, 614, 625, 664, 675
single nucleotide polymorphisms (SNP) 259,
263, 267, 268, 270, 272, 276,

277, 362, 367
Smith-Waterman algorithm 146, 149
spike sorting 544, 553
spinal muscular atrophy
291, 302, 304, 305, 306
steady states 30, 55, 85, 531, 533, 534,

536, 540
stochastic control 17
stochastic models 75, 101, 393, 499, 509,
799
stoichiometry 76, 96, 462, 463
structure prediction 126, 128, 134, 135, 137,
138, 139, 140, 177, 428, 429, 724
supervised learning 513, 524, 826, 839, 841
support vector machine (SVM) 173, 174, 202,
347, 412, 840, 848
SYS biology 97, 98, 100, 110, 115
systematization 127, 128, 130, 134
System for Population Kinetics (SPK) 556,

557, 558, 560, 561, 562, 563, 564,

565, 566, 568
systems biology 2, 5, 6, 8, 9, 11-18, 23,

25, 70, 74, 90, 94, 95, 112, 113,

126, 128, 137, 143-145, 228, 251,

278-280, 284, 287, 288, 314, 322,

325, 327, 329, 354, 355, 358, 359,
361, 363, 367, 369, 370, 374, 375,
378, 380, 384, 386, 390, 392, 395397, 399-404, 418, 438, 452, 458,

459, 471, 476-480, 485, 486, 488
493, 513, 514, 539, 569, 588, 589,

624, 653, 674, 687, 710, 731, 732,

826
Systems Biology Workbench (SBW)

90, 99, 324, 330
systems theory 12, 13, 14, 15, 16, 28, 393
T
Tanimoto coefficient 430, 431
taxonomy 152, 153, 161, 484, 708
thermodynamics 731, 732, 733, 734, 736,

737, 738, 739, 751, 752, 753, 754,

755, 756, 757, 758
time-course simulation 81, 84
transcriptional rate 233, 531
transcriptional regulatory network 362, 516
transcription factors 91, 366, 368, 370,

389, 401, 487, 498, 516, 518-523,

528, 530, 531, 535, 588, 589, 600,

602, 605, 606, 622, 625, 687, 691,

703, 711
transcriptomics 353, 408, 413, 416, 419,

421, 479
transmembrane 167-197, 199, 200-202, 206,

207, 301, 304, 425, 426, 428, 430,

573, 673, 675, 677, 679, 684, 690,

704, 708
TreeFam 152, 164, 165
V
von Willebrand disease 296, 305
W
wavelet analysis 656, 657, 665, 667, 668,

669, 671

Andriani Daskalaki-Handbook of Research On Systems Biology Applications in Medicine (2008)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Andriani Daskalaki-Handbook of Research On Systems Biology Applications in Medicine (2008)

Uploaded by

Copyright:

Available Formats

Handbook of Research

Medical Information science reference

Director of Editorial Content:

Published in the United States of America by

Editorial Advisory Board

Abdeljaoued-Tej, Ines / ESSAI-UR Algorithmes et Structures, Tunisia............................................ 377

Moschopoulos, Charalampos / Biomedical Research Foundation of the Academy of Athens, Greece....694

Detailed Table of Contents

Organisa ti on of the H andb ook

Basic Concepts in Medical

Pathway Biology Approach

Pathway Biology Approach to Medicine

Litera t ure and da t a mining: st amp c ollecting

Pathway Biology Approach to Medicine

Pathway Biology Approach to Medicine

Mode lling: f orma lizing dependencies

E xperiment ati on: sing ular pathw ay appr oach

Pathway Biology Approach to Medicine

Pathway Biology Approach to Medicine

Co ncept ua l cha llenge of multip le pathw ay dependencies

All of these types of interrelated activities can be coordinated in a hierarchical, non-hierarchical or

Pathway Biology Approach to Medicine

Pathway Biology Approach to Medicine

Pathway Biology Approach to Medicine

Kitano, H. (2003). A graphical notation for biochemical networks. Biosilico, 1, 169.

Pathway Biology Approach to Medicine

Systems and Control Theory for

Systems and Control Theory for Medical Systems Biology

Stage 1: High-throughput biochemical instrumentation was (and continues to be) developed to

measurement data information visualisation understanding.

Systems and Control Theory for Medical Systems Biology

HIST ORY OF C ONTR OL AND SYSTEMS

OUT LINE OF C ONTR OL AND SYSTEMS

THE ORY METH ODS

Mathematical Models for C ontrol and S ystems A nalysis

Systems and Control Theory for Medical Systems Biology

Modelling Methods and C omputer B ased S imulation

System Identi.cation and Data Analysis

Systems and Control Theory for Medical Systems Biology

Random Processes in Control and Systems Theory

C ontrol S ystem B asics

Systems and Control Theory for Medical Systems Biology

Stability and Transient Response

Systems and Control Theory for Medical Systems Biology

T ypes of C ontrol S ystems

E xample 1: T he D ynamical R ole of C rosstalk B etween W nt and ERK Pathways

Systems and Control Theory for Medical Systems Biology

E xample 2: Using a C omplex S ystems B iology A pproach to Understand

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Figure 2. Block diagram of the modularized JAK-STAT5 system

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Systems and Control Theory for Medical Systems Biology

Mathematical Description of Time Delays in Pathways Cross Talk

D ynamical A spects of Protein C ross T alk and T ime D elay

) of X in a previous moment t . Thus instead of (1.1.1)-(2), we should write:

Mathematical Description of Time Delays in Pathways Cross Talk

After substituting (1.1.7) into (1.1.4)-(5), the system becomes:

Mathematical Description of Time Delays in Pathways Cross Talk