2d Gel Electrophoresis

ANALYSIS OF TWO-DIMENSIONAL ELECTROPHORESIS GEL IMAGES
Lars Pedersen
Informatics and Mathematical Modelling Ph.D. Thesis No. 96 Kgs. Lyngby 2002
IMM
c Copyright 2002 by Lars Pedersen
Printed by IMM/Technical University of Denmark
Preface
This thesis has been prepared at the Image Analysis and Computer Graphics section at the Informatics and Mathematical Modelling (IMM), Technical University of Denmark in partial fullment of the requirements for the degree of Ph.D. in engineering. The general framework for this thesis is pattern analysis, digital image processing and computer vision with application in the eld of proteomics. The subject is Analysis of Two-dimensional Electrophoresis Gel Images. This work was carried out in close collaboration with Centre for Proteome Analysis in Life Sciences (CPA), University of Southern Denmark, Odense. Part of this thesis is condential and therefore Chapter 5 is omitted from this edition.
Kgs. Lyngby, February 2002
Lars Pedersen
iv
Lars Pedersen
Acknowledgements
I would like to thank the many people who have contributed their time to helping me with this thesis, for fruitful discussions and critical review. First, my thanks to my supervisors Associate Professor Bjarne Ersbll, Associate Professor Stephen J. Fey, and Professor Knut Conradsen for their invaluable suggestions and constructive criticism. I am grateful to Centre for Proteome Analysis in Life Science (CPA) for nancial support and supplying of data material. In particular to Associate Professor Stephen J. Fey and Associate Professor Peter Mose Larsen, for making me realise the importance of proteomics and for teaching me some of the peculiarities of cell biology. Also at CPA, I wish to thank Arkadiusz Nawrocki and Adelina Rogowska for providing and preparing most of the data material used in this work. Informatics and Mathematical Modelling at the Technical University of Denmark has provided me with an environment in which to carry out this work and for that I am thankful. At IMM I wish to thank previous and present members of the Section for Image Analysis and Computer Graphics and in particular my oce-mates Klaus Baggesen Hilger, Mikkel B. Stegmann, and Rune Fisker who have provided a pleasant atmosphere throughout the years while also being inspiring in many ways and Associate Professor Jens Michael Carstensen for many inspiring discussions. Many thanks to our secretary sta Helle Welling, Mette Larsen, and Eina Boeck. I am most thankful to Professor James Duncan for giving me the chance to work with his group; the Image Processing and Analysis Group at Department of Diagnostic Radiology, Yale University School of Medicine during my external research stay. Here, I in particular wish to thank Dr. Haili Chui and Associate
vi
Professor Anand Rangarajan who have been of great inspiration and Reshma Munbodh and Larry Win for providing an always joyful environment. Thanks to Carolyn Meloling for secretary help. I owe a great debt to Peter Chapman for his unique hospitality and to Camilla Hampton, Susan D. Greenberg, Robert Rocke and Matt Feiner for taking care of me and for introducing me to many aspects of New Haven and the American East coast culture. Finally I wish to thank Stephen J. Fey, Bjarne Ersbll, Mikkel B. Stegmann, Klaus Baggesen Hilger, Rasmus R. Paulsen, and Michael Grunkin for careful and patient review of the manuscript.
Lars Pedersen
Abstract
This thesis describes and proposes solutions to some of the currently most important problems in pattern recognition and image analysis of two-dimensional gel electrophoresis (2DGE) images. 2DGE is the leading technique to separate individual proteins in biological samples with many biological and pharmaceutical applications, e.g., drug development. The technique results in an image, where the proteins appear as dark spots on a bright background. However, the analysis of these images is very time consuming and requires a large amount of manual work so there is a great need for fast, objective, and robust methods based on image analysis techniques in order to signicantly accelerate this key technology. The methods described and developed fall into three categories: image segmentation, point pattern matching, and a unied approach simultaneously segmentation the image and matching the spots. The main challenges in the segmentation of 2DGE images are to separate overlapping protein spots correctly and to nd the abundance of weak protein spots. Issues in the segmentation are demonstrated using morphology based methods, scale space blob detection and parametric spot modelling. A mixture model for parametric modelling of several spots that may also be overlapping is proposed. To enable comparison of protein patterns between dierent samples, it is necessary to match the patterns so that homologous spots are identied. Protein spot patterns, represented by the spot centre coordinates can be regarded as twodimensional points sets and methods for point pattern matching can be applied. This thesis presents a range of state-of-the-art methods for this purpose and also suggests a regionalised scheme. The general point pattern matching methods focussed on are the Robust Point Matching methods and among the methods
viii
developed in the literature specically for matching protein spot patterns, the focus is on a method based on neighbourhood relations. These methods are applied to a range of 2DGE protein spot data in a comparative study. The point pattern matching requires segmentation of the gel images and since the correct image segmentation can be dicult, a new alternative approach, exploiting prior knowledge from a reference gel about the protein locations to segment an incoming gel image, is proposed.
Lars Pedersen
Resum e
Denne afhandling beskriver og foresl ar lsninger til nogle af de vigtigste eksisterende problemer inden for mnstergenkendelse og billedanalyse af todimensional elektroforese gel (2DGE) billeder. 2DGE er den frende teknik til at separere de enkelte proteiner i biologiske prver fra hinanden og teknikken har adskillige anvendelser inden for bioteknologi og farmakologi, f.eks. ved udvikling af nye lgemidler. Teknikken resulterer i et billede, hvor proteinerne fremst ar som mrke pletter p a en lys baggrund. Imidlertid er analysen af disse billeder srdeles tidskrvende og krver en del manuelt arbejde, s a der er et udtalt behov for hurtige, objektive og robuste metoder, baseret p a billedanalyseteknikker med det form al at give 2DGE teknologien et vsentligt skub fremad. Metoderne beskrevet og udviklet her kan inddeles i tre kategorier: billedsegmentering (dvs. adskillelse af billedet i protein-pletter og baggrund), punkt mnster parring og en forenet fremgangsm ade, der segmenterer billedet og samtidig parrer sammenhrende protein-pletter. De vigtigste udfordringer i segmentering af 2DGE-billeder er at adskille ttliggende, overlappende protein-pletter og at detektere den store mngde af sm a, svage protein-pletter. Problemstillinger i segmenteringen er belyst vha. metoder i den matematiske morfologi, skalarumsbaseret klatdetektion (eng. blob detection ), og parametrisk modellering af protein-pletter. Derudover foresl as en ny model baseret p a vgtet superposition af parametriske plet-modeller. Denne mixture model kan modellere ere, evt. overlappende, pletter. For at kunne sammenligne protein mnstre fra forskellige biologiske prver er det ndvendigt at parre mnstrene s a homologe protein-pletter kan identiceres. Reprsenteres protein mnstrene vha. pletternes center-positioner, kan disse betragtes som punktmngder i to dimensioner og s a kan metoder til parring
(eng: matching ) af punktmngder anvendes. Denne afhandling prsenterer en rkke frende, generelle metoder til dette form al og foresl ar ogs a en regionaliseret fremgangsm ade. Blandt de generelle metoder til punkt-parring fokuseres p a familien af metoder kaldet Robust Point Matching og blandt metoderne i litteraturen, specielt udviklet til parring af protein-plet-mnstre, ligger fokus p a en metode baseret p a naboskabsrelationer. Metoderne er i et sammenlignende studie anvendt p a en rkke 2DGE protein-plet data. Parring af punktmngder forudstter en segmentering af gelbilledet og en s adan segmentering kan vre vanskelig at udfre korrekt. Derfor er der her udviklet en alternativ fremgangsm ade, der i segmenteringen af et nyt gelbillede drager nytte af forh andsviden om proteinernes position fra en reference gel.
Lars Pedersen
Contents
Preface Acknowledgements Abstract Resum e Contents List of Tables List of Figures List of Algorithms 1 Introduction 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 2 Motivation 2.1 Biological Background . . . . . . . . . . . . 2.1.1 Proteome analysis . . . . . . . . . . 2.1.2 Two-dimensional gel electrophoresis 2.2 Issues in Image Segmentation . . . . . . . . 2.3 Issues in Spot Matching . . . . . . . . . . . 2.3.1 Properties of 2DGE spot patterns . 2.4 Unied Approach . . . . . . . . . . . . . . . 2.5 Protein Pattern Databases . . . . . . . . . .
iii v vii ix xi xv xvii xxi 1 2 4 4 7 8 8 10 16 22 24 26 26
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
xii
Contents
3 Gel Segmentation 3.1 Mathematical Morphology Based Methods . . 3.1.1 Watersheds . . . . . . . . . . . . . . . 3.1.2 H-domes . . . . . . . . . . . . . . . . . 3.2 Scale Space Blob Detection . . . . . . . . . . 3.3 Parametric Spot Models . . . . . . . . . . . . 3.3.1 Gaussian spot model . . . . . . . . . . 3.3.2 Diusion spot model . . . . . . . . . . 3.3.3 Mixture model . . . . . . . . . . . . . 3.4 Experiments and Results . . . . . . . . . . . . 3.4.1 Scale space blob detection . . . . . . . 3.4.2 Marker based watershed segmentation 3.4.3 H-dome transformation . . . . . . . . 3.4.4 Parametric spot models . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
29 30 31 32 33 40 41 44 45 49 49 50 50 50 70 73 74 74 79 80 80 80 82 82 82 84 96 99 101 101 104 111 113 114 116 117 126 129 131 135
4 Point Pattern Matching 4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Correspondence and match matrices . . . . 4.1.2 Motion estimation . . . . . . . . . . . . . . 4.1.3 The classical chicken and egg problem . . . 4.1.4 Graph based methods . . . . . . . . . . . . 4.2 General Point Pattern Matching Methods . . . . . 4.2.1 Iterative closest point . . . . . . . . . . . . 4.2.2 Dual step EM . . . . . . . . . . . . . . . . . 4.2.3 Bipartite graph matching of shape context . 4.2.4 Robust point matching . . . . . . . . . . . . 4.3 Point Matching of Protein Spot Patterns . . . . . . 4.3.1 Neighbourhood based matching . . . . . . . 4.3.2 Graph based matching . . . . . . . . . . . . 4.3.3 Successive point matching . . . . . . . . . . 4.3.4 Regionalised robust point matching . . . . 4.4 Match Evaluation . . . . . . . . . . . . . . . . . . . 4.5 Experiments and Results . . . . . . . . . . . . . . . 4.5.1 The trade-o resulting from binarization . . 4.5.2 Method comparison . . . . . . . . . . . . . 4.5.3 Error locations . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . 5 Elastic Graph Matching 6 Conclusion A Thin-Plate Spline Transformation
Lars Pedersen
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
Contents
xiii
B Algorithms 139 B.1 Binarization of fuzzy match matrix . . . . . . . . . . . . . . . . . 139 C Data material C.1 Gel Images . . . . . . . . . . . . . C.1.1 Data set . . . . . . . . . . . C.2 Protein Spot Attribute Information C.2.1 Match information . . . . . C.3 Disparity Analysis . . . . . . . . . D Grey level based warping D.1 Experiments . . . . . . . . . . . D.1.1 No regularisation . . . . D.1.2 Gaussian smoothing . . D.2 Application in Point Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 143 144 144 145 145 149 150 150 151 153
IMM Image Analysis and Computer Graphics
xiv
Contents
Lars Pedersen
List of Tables
3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7
Parameters corresponding to plots in Fig. 3.12. . . . . . . . . . . Number of detected blobs at dierent scales. . . . . . . . . . . . . Overview of general point pattern matching methods. . . . . . . Point pattern matching methods applied to 2D electrophoresis protein spot patterns. . . . . . . . . . . . . . . . . . . . . . . . . Results of protein spot set discretisation. . . . . . . . . . . . . . . Experiment specication. . . . . . . . . . . . . . . . . . . . . . . Evaluation of match result. Gel pair 1A vs. 2A. . . . . . . . . . . Evaluation of match result. Gel pair 1A vs. 2A. . . . . . . . . . . Average scores across 15 experiment pairs. . . . . . . . . . . . . .
48 49 81 98 104 113 116 116 117
xvi
List of Tables
Lars Pedersen
List of Figures
1.1
2.1 2.2 2.3
2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 3.1 3.2 3.3 3.4 3.5 3.6
Two-dimensional electrophoresis gel image of bakers yeast Saccharomyces cerevisiae strain Fy1679-28C EC [pRS315]. Detail of 150 150 pixels region. . . . . . . . . . . . . . . . . . . . . . . . Division of sequenced genes into known, homologous and unknown categories for three dierent organisms. . . . . . . . . . . Schematic two-dimensional electrophoresis gel. . . . . . . . . . . Two-dimensional electrophoresis gel images. Two dierent gels of bakers yeast Saccharomyces cerevisiae strain Fy1679-28C EC [pRS315]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diagram of the 2DGE process. By courtesy of CPA. . . . . . . . Comparison of four dierent protein visualisation methods. . . . Histograms of %II and log(%II) for a gel image with 1919 spots. Example images of gel regions with low signal to noise ratio low intensity spots. . . . . . . . . . . . . . . . . . . . . . . . . . . Example image of gel regions with overlapping spots. . . . . . . . Example image of gel with typical varying background. . . . . . . Intensity proles along the horizontal and vertical lines, respectively in Fig. 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . Principal sketch of (partial) correspondences between protein spots in two gel images. . . . . . . . . . . . . . . . . . . . . . . . Gel images with known spot centres overlaid as points. . . . . . . Known correspondence between spots in gel A and gel B. . . . . Deterministic and stochastic point patterns. . . . . . . . . . . . . Construction of a 2D gel image database. By courtesy of CPA. . Original ferrit nanoparticle image . . . . . . . . . . . . . . . . . . Example of watershed with and without markers . . . . . . . . . Principal sketch of h-dome extraction. . . . . . . . . . . . . . . . H-dome extraction of two-dimensional electrophoresis gel image. Example of scale space blob detection on nanoparticle image . . Scale space blob detection at 4 dierent scales . . . . . . . . . . .
3 9 11
12 15 17 19 20 20 21 21 22 23 24 25 27 30 32 34 35 38 39
xviii
List of Figures
3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14
Selected spots in 2D gel image. . . . . . . . . . . . . . . . . . . . Gallery of 6 dierent protein spots. . . . . . . . . . . . . . . . . . Spot 11 from dierent view points. . . . . . . . . . . . . . . . . . 2D Gaussian spot model. . . . . . . . . . . . . . . . . . . . . . . 2D diusion spot models with increasing t. . . . . . . . . . . . . 2D diusion spot model at dierent parameter congurations. . . Sub region of electrophoresis gel. Spot centres of 61 known protein spots are marked. . . . . . . . . . . . . . . . . . . . . . . . . Scale space blob detection at dierent scales, t = 1 and t = 2 . . Scale space blob detection at dierent scales, t = 3 and t = 4. . . Scale space blob detection at dierent scales, t = 5 and t = 6. . . Scale space blob detection at dierent scales, t = 7 and t = 8. . . Marker based watershed segmentation. . . . . . . . . . . . . . . . H-dome transformation of 2D gel image. From the top left: h = 0.05, 0.15, 0.25, 0.35, 0.45, and 0.50. . . . . . . . . . . . . . . . . . Parametric t of Gaussian spot model to spot 11. . . . . . . . . . Parametric t of diusion spot model to spot 11. . . . . . . . . . Parametric t of Gaussian spot model to spot 18. . . . . . . . . . Parametric t of diusion spot model to spot 18. . . . . . . . . . Parametric t of Gaussian spot model to spot 20. . . . . . . . . . Parametric t of diusion spot model to spot 20. . . . . . . . . . Parametric t of Gaussian spot model to spot 36. . . . . . . . . . Parametric t of diusion spot model to spot 36. . . . . . . . . . Parametric t of Gaussian spot model to spot 53. . . . . . . . . . Parametric t of diusion spot model to spot 53. . . . . . . . . . Parametric t of Gaussian spot model to spot 54. . . . . . . . . . Parametric t of diusion spot model to spot 54. . . . . . . . . . Point correspondence. . . . . . . . . . . . . . . . . . . . . . . . . Shape context and matching. From Belongie et al. [6]. . . . . . . Known correspondence between spots in gel A and gel B. . . . . Principal sketch of (partial) correspondences between protein spots in two gel images. . . . . . . . . . . . . . . . . . . . . . . . Panek and Vohradsky neighbourhood segment description. From [64]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regionalised point/spot matching. . . . . . . . . . . . . . . . . . Neighbouring regions. . . . . . . . . . . . . . . . . . . . . . . . . Overlapping regions. . . . . . . . . . . . . . . . . . . . . . . . . . Regionalisation grid, o = 50%. . . . . . . . . . . . . . . . . . . . . Gel images with known spot centres overlaid as points. . . . . . . Known correspondence between P and Q (for the gels shown in Fig. 4.10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binarization eect. M2 scores for point set P at dierent levels of the binarization threshold . . . . . . . . . . . . . . . . . . . . Test scores for M1 , M2 , and M3 . . . . . . . . . . . . . . . . . . . Error locations. Method M1 . . . . . . . . . . . . . . . . . . . . .
Lars Pedersen
41 42 43 44 46 47 50 51 52 53 54 55 56 58 59 60 61 62 63 64 65 66 67 68 69 77 83 96 97 100 107 108 109 111 114 115 115 118 119
List of Figures
xix
4.15 4.16 4.17 4.18 4.19 4.20 C.1 C.2 C.3 D.1 D.2
D.3
D.4
D.5
D.6
D.7
Error locations. Method M2 . = 0.5. . . . . . . . . . . . . . . . Error locations. Method M3 . = 0.5. . . . . . . . . . . . . . . . Spatial location of errors in all experiments, M1 . . . . . . . . . . Spatial location of errors in all experiments, M2 ( = 3.5). . . . . Spatial location of errors in all experiments, M3 ( = 2.7). . . . . Spatial location of errors in all experiments, M1 -M3 . . . . . . . . Overview of images in Data Set 1. . . . . . . . . . . . . . . . . . Group 1A vs. Group 1B. . . . . . . . . . . . . . . . . . . . . . . . Disparity eld for gel 1A vs. gel 2A with average disparity histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Original 512512 region of two gel images. Left: Reference image (A). Centre: Match image (B). Right: Pixel-wise dierence A-B. No regularisation of disparity maps. Disparity maps and warped grid. Left: Horizontal disparity map h . Centre: Vertical disparity map v . Right: Regular grid warped according to disparity maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . No regularisation of disparity maps. Warped versions of original 512 512 images. Left: Reference image warped according to h (Aw). Centre: Match image warped according to v (Bw). Right: Pixel-wise dierence Aw-Bw. . . . . . . . . . . . . . . . . Gaussian smoothing of disparity maps. Disparity maps and warped grid. Left: Horizontal disparity map h . Centre: Vertical disparity map v . Right: Regular grid warped according to disparity maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaussian smoothing of disparity maps, = 3. Warped versions of original 512 512 images. Left: Reference image warped according to h (Aw). Centre: Match image warped according to v (Bw). Right: Pixel-wise dierence Aw-Bw. . . . . . . . . . . . Dierence images from Figs. D.1, D.3 and D.5 displayed in common grey level range. Left: Dierence image before warp. Centre: Dierence image after warp without regularisation of the disparity maps. Right: Dierence image after warp with regularisation of the disparity maps. . . . . . . . . . . . . . . . . . . . . Pseudo-colour display of image pairs. Left: Original images A (green) and B (magenta). Centre: Aw (green) and Bw (magenta) after warp without regularisation of the disparity maps. Right: Aw (green) and Bw (magenta) after warp with regularisation of the disparity maps. . . . . . . . . . . . . . . . . . . . . . . . . . .
120 121 122 123 124 125 144 146 147 150
151
151
152
152
153
153
xx
List of Figures
Lars Pedersen
List of Algorithms
1 2 3 4 5 6 7 8 9 10
Robust-Point-Matching(P , Q, T0 ) . . . . . . Sinkhorn(m ) . . . . . . . . . . . . . . . . . . . . Extended-Sinkhorn(m ) . . . . . . . . . . . . . RPM-Affine(P , Q, T0 ) . . . . . . . . . . . . . . RPM-TPS(P , Q, T0 ) . . . . . . . . . . . . . . . c) . . . . . . c , Q Successive Spot Matching(P Regionalised RPM(p , q , P , Q, T0 , sp , sq , o, Binarization(m , ) . . . . . . . . . . . . . . . . j, c) . . . . Select-best-match-in-row(m, r m, k, r) . . Select-best-match-in-column(m, c m,
. . . . . . . . . . . . R, . . . . . .
. . . . . . . . . . . . C) . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
85 86 90 92 94 102 110 140 141 141
xxii
List of Figures
Lars Pedersen
Chapter
1 Introduction
The eld of proteomics or proteome analysis has become an increasingly important part of the life sciences, especially after the completion of sequencing the human genome. Proteome analysis is the science of separation, identication, and quantitation of proteins from biological samples with the purpose of revealing the function of living cells. Applications range from prognosis of virtually all types of cancer over drug development to monitoring of environmental pollution. Currently, the leading technique for protein separation is two-dimensional gel electrophoresis (2DGE), resulting in grey level images showing the separated proteins as dark spots on a bright background (see Fig. 1.1). Such an image can represent thousands of proteins. In order to identify the protein diversity and to quantitate the protein amount in a biological sample, pattern analysis and recognition can be of help. It also seems natural to apply pattern analysis in the task of comparing this information with similar information from other samples or a database. A small region of 150 150 pixels is shown in detail. Pattern analysis methods are currently applied in order to automate and ease the task of analysing gel images and comparing images from dierent biological samples, but with the current methods this part of the process requires large
Introduction
amounts of human-assisted work and it can be identied as the major bottleneck in the total process from biological sample to protein identication and quantitation. The most important breakthrough in proteomics have been: introduction of immobilised pH gradients (1988) and introduction of mass spectrometry in the 1990s. What would lead to an equal breakthrough would be improved pattern recognition methods for analysis of the gel images, reducing the large amount of resources spend on human-assisted analysis of the gels. In other words, there is a great need for eective, reliable, and objective methods to analyse the enormous amounts of data coming from the proteomics research. The pattern analysis of the 2DGE data is traditionally divided into two parts, namely the segmentation of the 2DGE images into what is protein spots and what is background, and the process of matching protein spot patterns across two or more gels. Correct segmentation results in quantitation of the spots that reects accurately the amount of protein present. The matching enables to detect changes in protein expressions across samples, or even to identify new proteins. This thesis provides, with the main focus on protein spot matching, an overview of the pattern analysis related issues and open problems in the eld of analysis of 2DGE images. State-of-the-art methods for point matching are presented, extended and tested on 2DGE data as well as a new method, combining segmentation and matching, is proposed. These contributions will most likely open the above-mentioned bottleneck and enable to reduce the resources currently spend in the analysis of 2DGE images, hence take proteomics a step further.
1.1
Thesis Overview
The thesis is structured in the following manner: 2 provides the motivation for this work. An introduction to the biological background is given and the interesting issues in the two pattern analysis related areas, 2DGE image segmentation and spot pattern matching, are
Lars Pedersen
1.1
Thesis Overview
Figure 1.1: Two-dimensional electrophoresis gel image of bakers yeast Saccharomyces cerevisiae strain Fy1679-28C EC [pRS315]. Detail of 150 150 pixels region.
Introduction
described. The nature of the spot patterns are discussed as well as a set of requirements for spot matching methods is dened. 3 deals with issues such as low signal-to-noise ratio and spot overlap in the non-trivial task of segmenting the protein spots from the background. The subjects discussed are classical mathematical morphology, scale space based blob detection, and parametric spot modelling, all for the purpose of 2DGE image segmentation. 4 presents a variety of methods for general point pattern matching succeeded by a range of methods designed for the matching of proteins spot patterns. A number of experiments on real 2DGE data is used for method comparison. 5 proposes an alternative to the classical two-step procedure (segmentation succeeded by matching), namely a unied approach that simultaneously estimates the spot correspondence and segments the gel image.
1.1.1
Notation
When matching data from two 2DGE images, the task is usually to match an incoming new gel, q to a well-known gel, p . Hence the names reference gel (p ) and match gel (q ). The grey level intensity image of the reference gel is referred to as Ip and the set of points representing the protein spot centres is denoted P . Similarly for the match gel, the intensity image is Iq and the point set is Q. The correspondence between homologous points in the two point sets can be thought of as a eld of disparity, describing the deformation from one set to the other. This disparity eld is denoted . A few abbreviations used most often in this thesis are: 2DGE two-dimensional gel electrophoresis. RPM robust point matching. TPS thin-plate spline.
1.2
Thesis Contributions
The main contributions of this thesis can be summarised in order of appearance as follows:
Lars Pedersen
1.2
Thesis Contributions
extension of Sinkhorns matrix normalisation method 4.2.4, extension of RPM-TPS to include attribute information in the energy function 4.2.4, regionalised point matching 4.3.4, application and comparison of state-of-the-art point matching methods to real 2DGE data 4.5 and EGM elastic graph matching 5. The Sinkhorns matrix normalisation method used in the Robust Point Matching (RPM) methods has been extended so that outlier rows and columns in the match matrix are not normalised and the method has also been extended to robustly handle non-square matrices. In point matching, the points spatial locations are used to determine the matches. However, if other information than the spatial location is available about each point, this can be used to ease the matching process if there is some correlation between the corresponding points and their attribute information, i.e., corresponding points should have similar attribute information. The Robust Point Matching (RPM) method with thin-plate-spline (TPS) has been extended to include attribute information in the energy function. This enables to exploit extra information available for each point. A regionalisation scheme to break down a large, complex matching problem into several smaller problems and nally combine the sub-results has been proposed. The regionalised point matching serves two purposes. 1) to simplify a dense and locally varying disparity eld relating corresponding points into several simpler disparity elds and 2) to reduce the number of points in the matching process and thereby reduce the computational cost. This is based on an assumption that a matching point should be found in the neighbourhood and therefore it is not necessary to attempt to match all points to all other points. The regionalised approach is suitable for point pattern matching methods robust to a large number of outliers. A comparative study of three methods for protein spot matching, of which two have been proposed here, has been conducted on a number of real 2DGE image spot data. A new method based on simultaneous segmentation and match of protein spots have been proposed and is also the main contribution of this thesis. The method, Elastic Graph Matching, uses the a priori knowledge of the spots location and neighbourhood interrelations from the reference gel as well as the new 2DGE
Introduction
grey level image information is exploited. Most importantly, the prior image segmentation, necessary for the point matching methods, is not needed here.
Lars Pedersen
Chapter
2 Motivation
Proteomics is an increasingly important part of cell biology and the eorts to understand the basic principles of life how the living cell works. This chapter will give some basic introductory knowledge to proteomics, the process of twodimensional gel electrophoresis for protein separation, and the motivation for applying image analysis in the eld of proteomics will be further explained. In proteome analysis, gel electrophoresis is a technique to separate proteins in a biological sample on a gel. The resulting gel images are by captured as a digital image of the gel. This image is then analysed in order to quantitate the relative amount of each of the proteins in the sample in question or to compare the sample with other samples or a database. The task of analysing the images can be tedious and is subjective (dependent on the human operator) if performed manually. The use of digital image analysis in the eld of proteomics is primarily motivated by the need to improve speed and consistency in the analysis of two-dimensional electrophoresis gel (2DGE) images. The most important issues and challenges related to digital image analysis of the gel images will be addressed, namely the segmentation of the images and the matching of corresponding protein spots.
Motivation
2.1
Biological Background
Knowledge of the basic principles in proteome analysis and gel electrophoresis provides a good background to understand the issues related to the image analysis part of the process the main focus of this thesis. Readers familiar with the biological concepts and techniques may safely skip this part. Sections 2.2-2.4 where problems faced in image analysis are addressed should still be interesting.
2.1.1
Proteome analysis
A short denition of proteome analysis is: identication, separation and quantitation of proteins. The rst publication of the word proteome was in 1995 by Wasinger et al. [85], and Wilkins [89] denes the concept of proteome analysis: Proteome Analysis: the analysis of the entire PROTEin complement expressed by a genOME, or by a cell or tissue type. In other words, the proteome is the complete set of of proteins that is expressed, and modied following expression, by the genome at a given timepoint and under given conditions in the cell. The proteome provides us with much more information about the working of the living cell than the genome does. The genome is static and essentially identical in all somatic cells of an organism [32], where the proteome is constantly changing, reecting the cell environment and also responding to both internal and external stimuli. The complete sequencing of the genome is not able to tell much about the function of the cell but analysis of the proteome is. The techniques focused on here are two-dimensional gel electrophoresis (2DGE) combined with mass spectrometry (MS) and a general methods description for 2DGE and MS is given in Fey et al. [33]. A general introduction to the science of proteomics can be found in [1]. In the past years the extensive DNA sequencing eorts have provided hundreds of thousands of open reading frames in international databases. Unfortunately, a large proportion of this information has no or very little homology to any known protein. As one goes up the evolutionary tree this proportion increases (see Fig. 2.1) and even for one of the most extensively studied organisms, the relatively simple humble bakers yeast (Saccharomyces cerevisiae ) 63% of the genes have either no or only limited homology to known proteins.
Lars Pedersen
2.1
Haemophilus influensae. 1.8 MB. 1,743 genes. 24% 37% 32% Saccharomyces cerevisiae. 12.5 MB. 6,482 genes. Homo Sapiens. 3,000 MB. ca. 45,000 genes. 22%
56% 20% 25%
53%
31% Known Homologous Unknown
Figure 2.1: Division of sequenced genes into known, homologous and unknown categories for three dierent organisms. Data from CPA.
Furthermore, even when the open reading frame has some homology and the proteins function can be guessed at, many questions remain unanswered. These are questions as: Under what condition is the protein expressed? Where is it expressed in the organism? Where in the cell is it used? How is its expression regulated? Is the proteins expression aected in diseases (e.g. cancer, cardiovascular, auto-immunity or inammatory diseases)? To nd answers to these questions is basically the motivation for studying the function of the gene products, namely the function of the proteins. By analysing expression patterns of the proteins under dierent conditions the function of particular genes can be determined and some of the questions posed above may be answered. In proteome analysis, the technique of two-dimensional gel electrophoresis (2DGE) enables biotechnologists to generate protein expression patterns that can be digitised into images and analysed. Proteome analysis can provide a shortcut to identication of certain genes or groups of genes involved in, e.g., the development of severe illnesses. This is because the dierences, quantitative and qualitative, in protein spot patterns between gels are related to the disease or treatment investigated.
Biological applications Proteome analysis has a number of biological applications, examples include understanding of the basic principles of life,
10
Motivation
relating the genome and the environment to the organisms phenotype, drug development/evaluation (including toxicology and mechanism of action), disease prognosis, diagnosis, screening, monitoring of e.g., diabetes, all types of cancer, cardiovascular, and many more identication of new drug or vaccine targets, improvement of food quality, monitoring environmental pollution, and prevention of micro organism/parasite infections. For instance in drug development, pharmaceutical companies spend large amounts1 of resources on studying the drug eect in animal experiments. Some of these eects can be assessed by measuring changes in protein levels across dierent tissue samples.
2.1.2
Two-dimensional gel electrophoresis
Two-dimensional gel electrophoresis (2DGE) enables separation of mixtures of proteins due to dierences in their isoelectric points (pI), in the rst dimension, and subsequently by their molecular weight (MWt) in the second dimension as sketched in Fig. 2.2. Other techniques for protein separation exist, but currently 2DGE provides the highest resolution allowing thousands of proteins to be separated. For a review of the latest developments in the proteomics eld, please refer to Fey and Mose Larsen [32], where 2DGE and some of the candidate technologies to potentially replace 2DGE are presented along with their advantages and drawbacks. The great advantage of this technique is that it enables, from very small amounts of material, the investigation of the protein expression for thousands of proteins simultaneously. After protein separation an image of the protein spot pattern is captured. Proper nding and quantitation of the protein spots in the images and subsequent correct matching of the protein spot patterns allows not only for the comparison of two or more samples but furthermore makes the creation of an image database possible.
1 The
cost of developing one new drug compound amounts to 300 million USD [30].
Lars Pedersen
2.1
7 4 pI (pH)
11
250
10 MWt (kD)
Figure 2.2: Schematic two-dimensional electrophoresis gel. Proteins are separated in two dimensions; horizontally by iso-electric point (pI) and vertically by molecular weight (MWt). No proteins shown. The pI and MWt ranges are example values.
The changes in protein expression, for example in the development of cancer are subtle: a change in the expression level of a protein of a factor 10 is rare, and a factor 5 is uncommon. Furthermore, few proteins change: usually less than 200 proteins out of 15,000 would be expected to change by more than a factor 2.5. Multiple samples need to be analysed because of the natural variation, for example between individuals and therefore it is necessary to be able to rely on perfect matching of patterns of the new images. Even though promising attempts have been made [13] to make the technique as reproducible as possible there are still dierences in protein spot patterns from run to run. Also due to improvements in the composition of the chemicals used to extract as many proteins as possible the patterns become so dense (crowded) that locating the individual protein spots is a non-trivial task.
The laboratory process The laboratory process as it is practiced in CPA is roughly sketched in the following. Some steps have been left out but please refer to a detailed description in [33]. Given a biological sample of living cells, e.g., a biopsy or a blood sample the process from the living cells to separated proteins on a gel will be explained. The
12
Motivation
(a) Gel 1.
(b) Gel 2. Figure 2.3: Two-dimensional electrophoresis gel images. Two dierent gels of bakers yeast Saccharomyces cerevisiae strain Fy1679-28C EC [pRS315].
Lars Pedersen
2.1
13
procedure described here uses radioactive labelling, IPG for the rst dimension, SDS polyacrylamide gels for the second dimension, and phosphor imaging to capture digital images of the protein patterns. Alternative visualisation methods will be described in 2.1.2. The 1st dimension, the incubation and, the 2nd dimension steps are illustrated in Fig. 2.4. Labelling. A radioactive amino acid is fed to the living cells and all the proteins synthesised de novo may then contain the radioactive amino acid ([35 S]-methionine) in place of the non-radioactive one. The radioisotope used for the labelling is typically [35 S], but other radioisotopes, e.g., [32 P] or [14 C] can also be used. The radioactive labelling enables detection of the proteins later on. Duration: 20 hrs is the usual labelling interval used but this can be changed for specic purposes or situation. Solubilisation. The cells structures are broken down (killing the cells) and the proteins are dissolved in a detergent lysis buer. The lysis buer contains urea, thiourea, detergent (NP40 or CHAPS), ampholytes, dithiothreitol, all with the purpose of dissolving the proteins, unfolding them and preventing proteolysis. The actual procedure used depends on the sample itself and can take from less than 1 minute to 2 days. 1st dimension isoelectric focusing. On an immobilised pH gradient (IPG) gel, in glass tube or on plastic strip, the proteins are separated according to their isoelectric point (pI). An electric eld is applied across the gel and the charged proteins start to migrate into the gel. The proteins are dierently charged and the electric eld will pull them to the point where the pH cast into the IPG gel is the same as the pI of the protein, i.e., the pH value at which the number of positive and negative charges on the protein are the same. At this point no net electrical force is pulling the protein. See Fig. 2.4. Eventually all proteins will have migrated to their pI their state of equilibrium. Duration: from 8-48 hrs. depending on the pH range of the IPG gel, e.g., 17.5 hrs for IPG pH range 4-7. Incubation. In the incubation step the 1st dimension gel is washed in a detergent ensuring (virtually) the same charge on all proteins per unit length. Proteins are linear chains of amino acids. These fold up and can be cross-linked by disulphide bridges. The solutions that are used at CPA contain urea, thiourea and detergents which cause the proteins to unfold into long random-coil chains. Duration: 2 15 min. 2nd dimension MWt separation. The incubated 1st dimensional gel strip is positioned on the upper edge of a polyacrylamide gel slab. See Fig. 2.4. The second dimension acts like a molecular sieve so that the small molecules can pass more quickly than the large. Again, an electrical eld is applied,
14
Motivation
this time in the perpendicular direction, and proteins migrate into the gel. As all proteins have the same charge per unit length now, the same electrical force is pulling them. However, small ( light) proteins meet less obstruction in the gel and will migrate with higher velocity through the gel. The larger proteins meet more resistance and migrate slower. Proteins with the same pI will migrate in the same column but will now be separated by molecular weight (MWt). As opposed to the 1st dimension process, the 2nd dimension has no equilibrium state because the proteins keep moving as long as the electric eld is applied. The small proteins reach bottom of the gel rst and the process has to be halted before they migrate out of the bottom of the gel. Duration: approx. 16 hrs. Drying etc. The gel is dried on paper support requiring some manual handling. Duration: 20 min. Image generation. The dry gel is put in contact with a phosphor plate which is sensitive to emissions. The radiation from the labelled proteins excites the electrons of rare earth atoms in the plate at positions where there is protein present in the gel. The larger amount of protein present at a specic location in the gel, the more electrons in the plate will be excited at that location. The amount of radioactive protein in the samples can be quite small (at the picogram level) hence the level of radiation is also small and the time required to expose the phosphor plate is long. After exposure, the phosphor plates are read using phosphor imaging technology where a laser beam excites the (already excited) electrons to an even higher energy state. The electrons return to their normal state while emitting electro-magnetic radiation (light). A CCD chip captures the light and a digital image is generated. Exposure time: usually 5 days. Image capture: 1 minute.
Alternative image generation The radioactive [35 S]-methionine labelling described above is not the only technique to capture images of the separated proteins, although it is the most sensitive. Older methods using X-ray lm to capture the image are still used. Staining with the Coomassie blue dye, silver or uorescent dye can also visualise the proteins using spectroscopic techniques. Fig. 2.5 shows a comparison of four dierent visualisation methods on the same cell sample. Note how the [35 S]-methionine labelling technique (top left) results in an image with much more detail than the other techniques. Many more weak proteins are revealed using this technique. The most important methods for image generation are:
Lars Pedersen
2.1
15
1st DIMENSION
- OH OH OH OH OH - OH OH OH OH - OH OH - OH OH OH
pH 8 7 6 5 4 3
+ H+ + + H+ + H+H++ + H+ H H H H H
H+ H+ H+ H+ H+
INCUBATION 2nd DIMENSION

- -- - - - --
+ + +
Figure 2.4: Diagram of the 2DGE process. By courtesy of CPA.
16
Motivation
X-ray autoradiography. Direct capture of irradiation on X-ray lm by contact print (gel and lm in contact). X-ray uorography. Where the gel is impregnated with PPO (2,5 - diphenyloxazole) to amplify the signal. Contact print as above but the gel/lm have to be placed at -70 C to speed up exposure. Phosphorimager autoradiography. Contact print where irradiation energy is captured by a rare earth complex (irradiation lifts electron into a higher orbital (meta-stable)) and then the plates are discharged pixel by pixel in the phosphorimager by a laser. The laser lifts the electron to an even higher, unstable state. The electron falls back to its normal orbital and the combined (irradiation plus the laser) energy is read. Fluorescence. Monobromobimane binds covalently to cysteine (and in doing so becomes more strongly uorescent) and is used to stain the proteins before electrophoresis. SyproRuby binds to proteins in the gel after electrophoresis contains some rare earth elements. Silver staining. Gels are chemically treated in a similar fashion to photography in order to bind silver atoms to the proteins.
Image analysis The protein pattern dierences between gel images can be very subtle and tedious to detect by eye and therefore digital image analysis is a natural part of this process. By means of digital image analysis speed and objectivity can be greatly improved. Still, most existing commercial software for analysis of twodimensional electrophoresis gels require a large amount of manual editing and correction of the spot segmentation and matching results. There is a need for development of better image segmentation and protein spot matching methods [83],[32], and this is the main motivation for this work. 2.2, 2.3 and 2.4 present some of the issues and challenges that will be discussed in the remaining of the thesis.
2.2
Issues in Image Segmentation
The segmentation of an electrophoresis gel image basically consists of distinguishing the protein spots from the background. There are however several issues that make the segmentation process non-trivial. In a typical gel image with 1900+ protein spots, the strongest third of the spots account for more
Lars Pedersen
2.2
17
SDS
IPG
MWt (kD) MWt (kD)
25
25
Figure 2.5: Comparison of four dierent protein visualisation methods. The same sample (HeLa cells) is used for all four visualisation methods. Top left: [35 S]methionine labelled (2 mio. cpm). Top right: SyproRuby stained (100 g protein). Bottom left: Mono Bromo Bimane labelled (100 g protein). Bottom right: Silver stained. Only a small part of the gels is shown. By courtesy of CPA.
18
Motivation
than 75% of the total amount of protein in the sample and the weakest third of the spots account for less than 6% of the total protein amount. The distribution of protein is in other words very skew and Fig. 2.6 shows the histogram of the so called percentage integrated intensity (%II, see C.2) for the protein spots in a gel image. For [35 S]-methionine labelled proteins %II is proportional to the amount of protein present, the rate of turnover of the protein and the number of methionine residues in the protein. A protein without methionine will not be detected irrespective of its amount. Similarly, a protein with a high rate of turnover will appear to be more abundant if the labelling interval is short (e.g., less than 2 hours). For bimane it is proportional to the number of cysteines. About 2% of human proteins do not have methionine and 8% do not have cysteine. For SyproRuby and for silver staining, it is not well dened. The integrated intensity (II) is calculated as the sum of pixel values inside the spot borders in an inverted gel image, i.e., when spots appear bright on a dark background. The %II for a spot is the II normalised with the sum of IIs inside all spots in the gel image. The relatively large number of weak spots combined with a high spatial density of spots is one of the main challenges in the image segmentation. Accurate quantitation is very important because, as mentioned earlier it is often subtle changes that are seen in comparing two samples from for example normal and cancerous tissue. To demonstrate how weak spots appear, Fig. 2.7 shows three small example gel regions from the same gel image. The top row is the image regions and in the second row, the same regions are shown with spot centres overlaid. Note the high level of noise compared to the weak spots. A second challenge, in segmentation of the image into (separate) spots and background, is the fact that overlapping spots is not a rare phenomenon. At CPA mass spectrometry has shown that in standard gels covering the pH range from 4 to 7, more than 60% of the spots represent more than one protein. For this reason CPA is moving towards running gels covering single pH regions e.g., 5.0-6.0. For this type of gels, it is known that only 5% of the spots have more than one protein present (with current sensitivities for the mass spectrometry). Fig. 2.8 displays three example regions with typical cases of neighbouring spots that are located so close that they overlap each other. Overlapping spots are naturally harder to detect (and separate) than isolated ones. The intensity of the image background can vary across the image. A typical gel image with varying background is shown in Fig. 2.9. Intensity proles are picked up along the horisontal (y = 250) and vertical (x = 200) lines and shown in Fig. 2.10. The trends in these lines show generally a larger background variation in the vertical direction and higher background intensity at the edges
Lars Pedersen
2.2
19
of the gel than in the gel centre. The latter is probably due to a larger spot intensity in the centre of the gel. Thus, the main challenges in segmentation of electrophoresis images are: noise / very weak (low intensity) spots, overlapping spots, and varying background.
400 300 200
250 200 150 100
100 0
50 0 0.5 %II 1 0 -8 -6 -4 -2 log(%II ) 0
Figure 2.6: Histograms of %II and log(%II) for a gel image with 1919 spots.
3 will present some general segmentation techniques based on mathematical morphology and scale space blob detection. Furthermore, some parametric spot models are investigated.
20
Motivation
Figure 2.7: Example images of gel regions with low signal to noise ratio low intensity spots. Top row: region 1, 2, and 3 from same gel image. Bottom row: same regions with known spot centres overlaid. The regions are 100 100 pixels and the grey level range in each region has been scaled appropriately to improve visual inspection. Region 1, 2, and 3 contain 14, 19, and 17 spots, respectively.
Figure 2.8: Example image of gel regions with overlapping spots. Top row: region 1, 2, and 3 from same gel image. Bottom row: same regions with known spot centres overlaid. The regions are 100 100 pixels and the grey level range in each region has been scaled appropriately to improve visual inspection. Region 1, 2, and 3 contain 11, 8, and 19 spots, respectively.
Lars Pedersen
2.2
21
50 100 150 200 250 300 350 100 200 300
Figure 2.9: Example image of gel with typical varying background. Intensity proles are picked up along the horizontal (y = 250) and vertical (x = 200) lines and shown in Fig. 2.10.
0 I (y = 250) 200 100 200 300 0 0 100 200 300 0 100 I (x = 200) 200
100
Figure 2.10: Intensity proles along the horizontal and vertical lines, respectively in Fig. 2.9.
22
Motivation
2.3
Issues in Spot Matching
Spot matching is a central issue in electrophoresis [83] and is also the main focus of this work. The goal is to establish protein spot correspondence between gel images in order to detect changes in protein expression levels or discover new proteins that are only detected in one of the images. For comparison of protein levels across several gels a correct match or correspondence between the protein spots is necessary. In matching up the protein spot patterns from two gels it is necessary to solve the correspondence problem. The task is to determine the exact (correct) correspondence between known spots in a reference gel image and the spots in an incoming gel image with protein spots. The new incoming gel is referred to as the match gel. Fig. 2.11 shows a sketch of the correspondence concept.
Figure 2.11: Principal sketch of (partial) correspondences between protein spots in two gel images. In order to compare protein expression levels between two gels the correct correspondence between matching spots is necessary.
For the purpose of spot matching, the problem of matching points instead of spots, i.e., matching the spot centres instead of the entire spots, is regarded most often. Also the main focus will be on matching two set of spots from two dierent gel images. In Fig. 2.12 two electrophoresis gel images are shown with known protein spot centres overlaid as point sets. These two point sets (or point patterns) are shown together in Fig. 2.13 where corresponding points are connected with small arrows. The arrows can be interpreted as a disparity eld. This will be the standard way of displaying the point correspondence throughout the thesis. To the left in Fig. 2.13 is shown the correspondence when the point sets are simply plotted together. Clearly a large contribution to the disparity eld
Lars Pedersen
2.3
23
Figure 2.12: Gel images with known spot centres overlaid as points. Left: gel A (1919 spots), right: gel B (1918 spots). 1918 spots in common, which means that one spot present in gel A is missing in gel B.
stems from a rotation and a translation, probably due to the handling of the gels. To the right 20 landmarks have been hand-picked and the parameters in a rst order polynomial transformation has been computed (see 4). The entire point set from gel B has been transformed according to the parameters found from the landmarks. The plot to the left shows the residual after this transformation (mainly a translation and a rotation) has been removed. The residual disparity eld exhibit local, highly non-linear behaviour. Together with the high denseness of the points this constitutes the main challenges in the point matching task at hand. The gels in Fig. 2.12 contain a dierent number of spots (1919 and 1918) respectively. One spot present in gel A is missing in gel B. If protein expression is very low this can cause the spot not to show up in the gel. This situation of extra or missing spots is not unusual and, in fact, very interesting from a biological point of view. A point occurring in only one of the gels will be referred to as outliers or singles and a pair of matching spots is referred to as a spot pair. As seen from the Figs. 2.12 and 2.13 the point patterns to be matched possess no easily recognisable shape structure. Often, in other point pattern matching applications the exact match of certain points is not important, instead a good match of shape structures is sucient.
24
Motivation
Figure 2.13: Known correspondence between spots in gel A and gel B. Corresponding spots are connected with small arrows. Left: Before initial alignment. Right: Residual after correction for 1st order polynomial transformation using landmarks. Landmarks are manually dened and marked with stars.
2.3.1
Properties of 2DGE spot patterns
It seems that some point patterns have more structure or shape than others. E.g., a pattern of the letter A shown in Fig. 2.14(a) exhibits far more structure than the sub pattern of a 2DGE spot pattern in Fig. 2.14(b). In texture analysis [18], the notion of more or less stochastic textures is used to describe how wellordered the texture is. This terminology is adopted for point patterns. Similarly, point patterns can be described as more or less stochastic ranging from pure stochastic to pure deterministic. No quantitative measure for the degree of stochasticness is dened, but one could imagine an entropy-based measure to be suitable. In Fig. 2.14(a), the A is said to have a deterministic nature and the spot pattern (Fig. 2.14(b)) is more stochastic or amorphous. The neighbour relations may be useful in describing the degree of stochasticness. In the A each point have two or three natural neighbours, because the points form a shape. In the 2DGE case there are none such shape and all neighbours are equally important. It is hard to quantitate the idea into a measure, but the purpose of these remarks is to underline the dierence in point patterns that make dierent point matching methods more or less suitable.
Lars Pedersen
2.3
25
(a) Deterministic pattern. Disregard the circle and the triangle markers. From Belongie et al. [7].
(b) Stochastic pattern. 2D electrophoresis gel spot centres.
Figure 2.14: Deterministic and stochastic point patterns. Both patterns contain 100 points.
It seems acceptable that the more stochastic (less shape structure) the pattern is the more dicult it is to obtain correct matches. Methods for protein spot matching should not specically rely on the fact, that the patterns are deterministic. Furthermore, when matching shape structures, it is usually an acceptable result when the shapes have a reasonable overlap. In the case of matching 2DGE spot patterns, a correct match of each point is a requirement. The above considerations lead to ve main requirements of a spot matching method. It must be able to: exactly and robustly match protein pairs, allow for non-linear distortions/transformations, robustly handle outliers in both sets, be able to handle point sets of stochastic/amorphous nature, and robustly match dense point sets.
26
Motivation
In 4 a number of point matching methods are presented for general point matching purposes, as well as for matching protein spot patterns.
2.4
Unied Approach
As the previous sections have implied there might be good reasons to combine the segmentation and the matching into one method, i.e., to nd (locate) the spots in a new gel while simultaneously matching up the spots with spots already known in a reference gel. 5 will discuss such an approach.
2.5
Protein Pattern Databases
The large amount of gel data can be organised in image databases and Fig. 2.15 shows an example of the construction of such a database. From a set of normal gels a normal composite gel is generated and similarly for a set of gels representing diseased subjects. The composite gels are formed as the union of the contributing gels. From the normal and diseased composite gels the database gel is formed and marker proteins can be identied. One of the more important data often missing in most image databases is protein expression data (under given environmental growth conditions) [32], i.e., only gel images are presented. There exist also many other databases of biological data. These may include gene and protein sequence data, protein identication (unique protein code), protein function, theoretical values for the isoelectric point (pI) and the molecular weight (MWt), biochemical pathways, chemical data and the scientic literature all of which can be very useful in interpreting the data from the 2D gel image databases.
Lars Pedersen
2.5
Protein Pattern Databases
27
Normal (1)
Normal (2)...
...Normal (N)
Disease (1)
Disease (2)....
Disease (M)
NORMAL COMPOSITE
Marker Protein
DISEASE COMPOSITE
ADDITIONAL INFORMATION: e.g. PROTEIN IDENTIFICATION SEQUENCE FUNCTION THEORETICAL pI & MWt
M and N are estimated from

Variability = V = coefficient Stdi Av (IOD%) for 2...M;N i i=1 Z
DATABASE
Z: number of spots pr. gel M and N are chosen when V becomes constant
Figure 2.15: Construction of a 2D gel image database. By courtesy of CPA.
28
Motivation
Lars Pedersen
Chapter
3 Gel Segmentation
In a recognition system a preprocessing step to segment the pattern of interest from the background, noise etc. usually precedes [44] the actual recognition process and for the current task this is no exception. The two-dimensional electrophoresis gel images show the expression levels of several hundreds of proteins where each protein is represented as a blob shaped spot of grey level values. In order to apply point pattern matching methods to solve the problem of matching spots from dierent images each spot must be reduced to a pattern (e.g., a point the spot centre). It is of crucial importance that the segmentation is correct in order to obtain correct quantitation of protein expression and a successful matching result. The matching becomes meaningless if the input is an erroneous segmentation. The segmentation task at hand consists of a separation of the image into what is background and what is spots and the challenging part is the cases of overlapping spots, varying background and a high level of noise in the images. Please refer to 2.2 for examples. Although the segmentation is an extremely important step, it is not the main focus of this thesis. Therefore, this chapter will only touch upon a few approaches to segmentation of gel images. These include methods based on mathematical morphology, parametric spot models and a Gaussian scale space blob detector.
30
Gel Segmentation
Figure 3.1: Original ferrit nanoparticle image. Microscope image of nano particles.
To illustrate the methods an image of nanoparticles with a number of distinct dark blobs will be used (Fig. 3.1). The particles are of relatively uniform size, intensity and shape. The nanoparticle image is overly simple compared to the 2DGE images and it is used here for illustration purposes only. The almost constant background, and blobs of almost identical shape and intensity facilitates the illustration of ideas in the segmentation process.
3.1
Mathematical Morphology Based Methods
Mathematical morphology in image analysis is a vast eld of research and even though it is beyond the scope of this thesis some selected topics will be discussed. Some of the earliest work on segmentation of electrophoresis gels using mathematical morphology is by Beucher et al. [11, 12] who proposed to use a watershed based method for the segmentation of the images. These ideas are now well known and commonly used to segment images of electrophoresis gels. Other more recent approaches [74, 81] deal with the problems of oversegmentation by using marker controlled watersheds. Another technique from the mathematical morphology is the so called h-domes, which is a grey-scale reconstruction method. After a brief introduction to the method some examples of gel image segmentation will be showed.
Lars Pedersen
3.1
Mathematical Morphology Based Methods
31
3.1.1
Watersheds
In geoscience terminology, a watershed line is the outline of a catchment basin, which again, is an area of land that drains to a common point. When it rains on an area all drops landing within the same watershed lines will eventually drain to the same point the minimum of the catchment basin. Viewing greyscale images as landscapes, i.e., as topographic reliefs where the pixel values represent the surface height, notions as valleys, tops, ridges, catchment basins and watershed lines can be introduced. In the elds of image analysis and mathematical morphology various methods using watersheds as a segmentation tool have emerged. One of the fastest techniques developed by Vincent and Soille [82] is based on the immersion principle.
The immersion principle Imagine a grey-scale image as a landscape with tops, valleys etc. where the pixel grey value corresponds to the terrain height, i.e., dark areas of the image (low pixel values) correspond to a low altitude area in the landscape and vice versa. Now pierce a hole in all local minima of the surface and slowly immerse the landscape model in water. The water will trickle out from the minima starting with the global minimum. At some point, as the water level rises from dierent minima, two neighbouring basins will meet and merge. At the pixels where the water from the two neighbouring basins meet a dam is raised. Continuing like this until the entire landscape is immersed in water will result in a partitioning of the grey-scale image into a large number of catchment basins (as many as the number of local minima in the image). Each catchment basin associated with a local minimum is now bound by a dam and these dams constitute the watershed lines.
Over-segmentation A major disadvantage of the watershed segmentation is its tendency to oversegmentation. A noisy image with many local minima will segment into a large number of sections. Among ways to overcome this are marker controlled watersheds [81] and Gaussian scale space based multi-scale techniques [63]. The marker controlled watershed transform is a restricted form of the watershed method where holes are pierced in selected minima only (the markers). This way the number of catchment basins is controlled and over-segmentation is avoided. How to automatically choose a good set of markers is, however, seldom trivial. 3.2 describes a blob detector that is suitable for this purpose in the case of
32
Gel Segmentation
segmentation of 2D electrophoresis gel images. 3.4 contains experiments of watershed segmentation with and without markers of electrophoresis images.
Figure 3.2: Example of watershed with and without markers. The goal is to nd the dark circular nanoparticles. Top left: original image. Top right: watersheds of original image. This result is a severe over segmentation. Bottom left: original image with markers obtained using the method described in 3.2. Bottom right: result of marker controlled watershed segmentation. Each particle has its own catchment basin.
3.1.2
H-domes
Another morphology based method the so called h-dome transformation is a technique to determine maximal structures in grey-scale images. Vincent [81] denes the h-dome image Dh (I ) of a grey-scale image I as Dh (I ) = I I (I h), (3.1)
where X (Y ) is the grey level reconstruction of X from Y . The grey level reconstruction is obtained by iterative geodesic dilations of Y under X until stability is reached. Please refer to [73, 76, 81] for a general introduction to the subject.
Lars Pedersen
3.2
Scale Space Blob Detection
33
The principles of h-dome extraction is demonstrated in Fig. 3.3 where the gel image has been inverted for illustrational purposes. Note that not all blobs are detected as h-domes. From the prole lines it is seen that the most intense spot on the horizontal line has a right shoulder stemming from a neighbour overlapping spot. This overlapping spot is not a regional maximum and therefore it is not an h-dome. Fig. 3.4 shows the h-dome extraction on the inverted gel image and 3.4 presents more experiments with dierent values of h on a two-dimensional electrophoresis gel image. The electrophoresis images are traditionally viewed as dark protein spots on a bright background. If preferred, the extraction of minimal structures (spots) can be done using the h-basins (reverse of h-domes) as demonstrated by Horgan and Glasbey [43] on non-inverted electrophoresis gel images.
3.2
Protein spots in electrophoresis images may be detected using a blob detector because the spots possess blob characteristics. They do however vary in size and therefore the scale space theory is a natural framework for this task. This theory provides several feature detectors [56] like corner, junction, ridge, and blob detectors. A thorough review of the scale space theory is out of the scope here so the reader is kindly referred to [55, 77]. Blob detection can be formulated in terms of local extrema in the grey-level landscape (chapters 7 and 10 in [55]), e.g., as spatial minima in the Gaussian or as extrema in the Laplacian operator (chapter 13 in [55] and [14]). More specically, as described in [56] (p. 9, Eq. (18)), which is used in the present implementation of the latter. This presentation is formulated for continuous signals in two dimensions (2D). The concepts used here do however translate to discrete signals [54]. Given an image I and an isotropic 2D Gaussian g (x; t), g ( x ; t) = 1 x2 +y2 e 2t . 2t
t is referred to as the scale-parameter and L is dened as the convolution of I with g (x; t): L(x; t) = I g (x; t).
34
Gel Segmentation
Figure 3.3: Principal sketch of h-dome extraction, h = 0.25. Top left: small region of two-dimensional electrophoresis gel image with horizontal prole line. Top right: intensity prole (thin line) and intensity prole minus h (bold line). Bottom left: grey level reconstruction (bold) and original intensity prole (dotted). Bottom right: hdomes result after subtraction. Note that not all blobs are detected as h-domes. It is seen that the most intense spot on the horizontal line has a right shoulder stemming from a neighbour overlapping spot. This overlapping spot is not a regional maximum and therefore it is not an h-dome.
Lars Pedersen
3.2
35
Figure 3.4: H-dome extraction of two-dimensional electrophoresis gel image, h = 0.25. Top left: small region of two-dimensional electrophoresis gel image I with ground truth operator specied spot centres (as specied in C.2) overlaid. Top right: grey level reconstruction I (I h). Bottom left: H-domes of I , Dh (I ). Bottom right: Dh (I ) > 0.05 with spot ground truth spot centres overlaid. Note how only major isolated spots are found and overlapping spots are either not found or detected as a single spot.
36
Gel Segmentation
Scale-space derivatives in 2D at scale t are dened as the result of convolution with dierentiated Gaussians: Lx (; t) = xx yy L(; t) = xx yy g (; t) I where is the order of dierentiation. With the Laplacian of L, F : F = 2 L = L = Lxx + Lyy and the Hessian H of F : H (F ) = Fxx Fyx Fxy Lxxxx + Lxxyy = Fyy Lxxxy + Lxyyy Lxxxy + Lxyyy Lxxyy + Lyyyy
the determinant and the trace of H can be written as: det H(F ) = (Lxxxx + Lxxyy )(Lxxyy + Lyyyy ) (Lxxxy + Lxyyy )2 and tr H(F ) = Lxxxx + Lyyyy + 2Lxxyy . Critical points in the Laplacian of L, i.e., in F are found as zero crossings of Fx and Fy Fx = 0 and Fy = 0. (3.2)
To detect blobs it is furthermore required that the critical points are extrema: det H(F ) > 0, and to ensure that the extrema are also minima tr H(F ) < 0 must be fullled. For this blob detector, the normalised strength measure for scale selection is [56]: t2 L. Large blobs have maximum response at large scales and vice versa. Therefore, the scale can be used to characterise the size of a blob. Lindeberg [55] provides
Lars Pedersen
(3.3)
(3.4)
3.2
37
an automatic scale selection based on maxima over scale in the normalised measure of blob strength. Fig. 3.5 shows the steps of scale space blob detection of particles in the nanoparticle image. The top row shows to the left L, the Gaussian blurred version of the original image (a). Then follows in the same row the Laplacian of Gaussian, F , the determinant of the Hessian of F , and the trace of the Hessian of F . The second row displays to the left F (a) with contour lines where Fx = 0 (blue) and Fy = 0 (red). (b) is a binary image of the critical points of F (satisfaction of (3.2)), i.e. crossings of the contours in (a). The two last images in second row show binary images of the conditions in (3.3) and (3.4) respectively. The last row shows (a) the three binary condition images ((b), (c), (d) from second row) added. The brightest values indicate positions where all three conditions hold, i.e., where blobs are detected. Image (c) in the last row show a simple threshold of L to remove unwanted blobs. The last image is the original image with resulting blob markers overlaid. This example is however only for a single xed scale (t = 3) and the number of detected blobs greatly depends on the choice of scale. In this case of relatively uniformly sized spots all particles seems to be detected correctly at the same scale. Fig. 3.6 shows how the number of detected blobs decrease as the scale increases. For electrophoresis gel images however, the protein spots exhibit large variations in size and automatic scale selection [57] is one approach that may be suitable. Linking of segments across multiple scales is another. Olsen et al. [63] study multi-scale watersheds and the structural changes as scale increases. Their scheme to link watershed segments across scales seems promising for segmentation of electrophoresis gel images. Scale space linking and automatic scale selection will, however, not be pursued further here. 3.4 shows some examples of the described methods applied to electrophoresis images.
38
Gel Segmentation
(a)
(b)
(c)
(d)
Figure 3.5: Example of scale space blob detection on nanoparticle image. Top row: (a) Original image blurred with a Gaussian lter (t = 3), L. (b) Laplacian of Gaussian, F . (c) det(H(F )). (d) tr(H(F )). Centre row: (a) Critical contours (3.2) of F , green: Fx = 0, red: Fy = 0. (b) Critical points of F . (c) Extrema condition (3.3) det(H) > 0. (d) Minima condition (3.4) tr(H(F )) < 0. Bottom row: (a) All 3 necessary blob conditions superimposed. (b) Blob markers detected at this scale. (c) Simple threshold to remove non-particles. (d) Markers of particles found overlaid the original image.
Lars Pedersen
3.2
39
Figure 3.6: Scale space blob detection at 4 dierent scales. Scale t increases from top and down. t = 1, 3, 5, and 7. Left column: L(; t). Centre column: Blobs markers at this scale. Right column: Blob markers inside threshold mask (see Fig. 3.5).
40
Gel Segmentation
3.3
Parametric Spot Models
Parametric spot models rely on the assumption that the protein spots have some common characteristics that can be captured by a model. The idea is to dene a suitable spot model C (x, ) and adjust the parameters in the model so that the model ts the data (image I (x)) by some measure. I (x) is dened on a square region, w around the protein spot in question and the range of x is denoted the support for the model. More formally: given an image I : 2 and a model C (x, ) : 2 p , x = (x, y ), and p the goal is to minimise the sum of squared residuals within the support region, w:
= arg min
I ( x) C ( x, ) .
xw
Attempts made to parametrically model the protein spots are most commonly based on a two-dimensional Gaussian spot model [10]. This model has some deciencies which are sought overcome in more advanced models as the diusion based model proposed by Bettens [10]. A description of both models will be given and in 3.4 the models are tted to a gallery of protein spots of various sizes and shapes. In the optimisation of the parametric spot models the sum of residuals between the image data and the spot model is minimised within the support region, w. Ideally, the support region should be large enough to contain the entire (or almost all of the) spot. Also, it should not be too large, since this will include neighbouring spots in the support region. As the models can only model a single spot within the region, neighbouring spots should be avoided. The spots vary greatly in size and without prior knowledge about the spot size it is dicult to determine the size of the support region for the spot in question. A weighting function (e.g., a Tukey window) could solve some of the diculties with neighbouring spots within the support region, but the size of such a window still remains unknown. Also, overlapping spots is not an uncommon phenomenon and in such cases both models must be expected to perform less well. This can possibly be solved by additive mixture models as suggested in 3.3.3. In the following the (inverted) sub image (Fig. 3.7) of an electrophoresis gel image will be the source for real data examples. The image is shown with inverse grey levels so that spots appear bright on a dark background. The image contains a total of 61 protein spots of which there is several large, bright spots and quite a few smaller and less distinct spots. Six spots have been selected
Lars Pedersen
3.3
41
to cover the range of strong, intermediate and weak spots, relatively isolated spots, spots with a dense neighbourhood and overlapping spots. The six spots will serve as examples in the following presentation of the two parametric spot models. Fig. 3.8 displays the spot gallery as mesh plots. Each mesh plot is centred around the spot in question and the centre of this spot is marked with a small ag (a star on top of a vertical bar). The spot centre locations stem from the protein spot attribute information ( C.2). It appears that the spot centres are not located exactly at the local maximum of the spot. This is due to the attribute data denition of the spot centre, which is not necessarily at the maximum grey level intensity.
4 2 3 1 11 9 15 13 37 12 32 31 35 33 38 40 54 39 41 49 52 61 43 60 44 45 46 50 47 58 51 56 48 14 18 10 16 20 22 57 25 17 19 27 24 26 53 29 23 30 21 5 7 6 8
28
55
42 59
34
36
Figure 3.7: Selected spots in 2D gel image. The ground truth attribute information (as dened in C.2) about spot centres and spot areas is marked on the image with stars and circles. The neighbourhoods of selected spots # (11, 18, 20, 36, 53 and 54) are marked with quadratic regions. Selected spots cover the range of strong, intermediate and weak spots, relatively isolated spots, spots with a dense neighbourhood and overlapping spots. Note that spot #54 is the non-h-dome from Figs. 3.3 and 3.4.
3.3.1
Gaussian spot model
The two-dimensional anisotropic Gaussian spot model with background level is on the form: (x x0 )2 (y y0 )2 2 2 2y 2x e CG ( x, ) = B + c e
(3.5)
with = (B, c, x0 , y0 , x , y ). B is the background intensity level, i.e., the intensity level in the image around the spot. c is the spot height, (x0 , y0 ) are
42
Gel Segmentation
11
18
20
36
53
54
Figure 3.8: Gallery of 6 dierent protein spots. Top row: spot #11, 18, and 20. Bottom row: spot #36, 53, and 54. Each mesh plot is centred around the spot in question and the centre of this spot is marked with a small ag (a star on top of a vertical bar). The spots are viewed a little from above and from the bottom left. Furthermore, if other spots are present in the region, their centres are also marked with a ag (a vertical bar). Please compare to the corresponding quadratic regions around each spot in Fig. 3.7. All meshes are scaled equally so that spot heights and sizes are comparable across plots. Note e.g., how spot #18 is very weak and spot #11 is strong again in agreement with Fig. 3.7.
Lars Pedersen
3.3
43
Figure 3.9: Spot 11 from dierent view points.
44
Gel Segmentation
oset constants for the centre coordinates, x is the standard deviation in the xdirection and y is the standard deviation in the y -direction. Fig. 3.10 shows the Gaussian parametric model at four dierent congurations of the parameters. Bettens et al. [10] showed that this model is far too simple to model protein spots and proposed instead a diusion model that attempts to model the diusion process of protein migration in the 2D gel. The main drawback of the Gaussian model is its inability to model saturation of the spots. In order to be able to see very low intensity spots the phosphor plate is sometimes exposed for a long period of time, resulting in a saturation of the most intense spots.
Figure 3.10: 2D anisotropic Gaussian parametric spot model shown at four different congurations of the parameters. Top left: = (B, c, x0 , y0 , x , y ) = (0, 1, 0, 0, 3, 3). Top right: = (B, c, x0 , y0 , x , y ) = (0, 1, 5, 0, 3, 3). Bottom left: = (B, c, x0 , y0 , x , y ) = (0, 1, 0, 0, 4, 2). Bottom right: = (B, c, x0 , y0 , x , y ) = (0, 1, 0, 5, 2, 3).
3.3.2
Diusion spot model
The diusion protein spot model proposed by Bettens et al. [10] is designed to model the actual diusion process in the gel. The following are the assumptions about the process that denes the model: 1) the medium of the diusion is twodimensional and anisotropic, i.e., there are two main directions of diusion (x and y ) with dierent diusion constants (Dx and Dy ), 2) the diusing substance is initially distributed uniformly on a disc with radius a. Bettens gives the
Lars Pedersen
3.3
45
solution to the corresponding diusion equation as CD ( x, ) = B + 1 C0 erf 2

a +r 2 Dt
2
+ erf
a r 2 Dt
2
C0 r
(ar ) Dt (a+r) e 4Dt e 4Dt
with
(x x0 )2 (y y0 )2 + Dx Dy
(3.6)
where = (B, C0 , x0 , y0 , a, D, Dx , Dy , t). B is the background intensity level and C0 is the initial concentration in the circle. x0 and y0 are oset constants for the centre coordinates. erf is the error function dened as: 2 erf (z ) =
z 0
et dt.
(3.7)
By elimination of symmetric parameters the nal model reduces to: C D ( x, ) = B + 1 C0 erf 2 1 e

a +r 2
+ erf
a r 2
C0 r
( a +r ) 2 4
(a r )2 4
with
(y y0 )2 (x x0 )2 + . Dx Dy
(3.8)
Following [10] the 7 parameters in (3.8) to be estimated are = (B, C0 , x0 , y0 , a = D a, Dx = Dx t, Dy = Dy t). t
To give an impression of the model capabilities Figs. 3.11 and 3.12 show the model for a number of dierent parameter settings. In 3.4 the two parametric models are tested on the six spots in the spot gallery (Fig. 3.8).
3.3.3
Mixture model
The the parametric spot models described so far do not handle overlapping spots in a well-dened manner. In general, the presence of other spots within
46
Gel Segmentation
Figure 3.11: 2D diusion spot model. left (t = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5), (B, C0 , a, D, Dx , Dy , x0 , y0 ) = (0, 1, 5, 1, 1, 1, 0, 0).
Increasing t from top other parameters constant
Lars Pedersen
3.3
47
Figure 3.12: 2D diusion spot model at dierent parameter congurations. The spot centre oset is constant (x0 , y0 ) = (0, 0) as well as time t = 1 for all plots. The parameters for each plot are listed in Tab. 3.1. Note how this model is able to model saturated spots.
48
B 0.0 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 C0 1.0 1.0 1.0 1.5 1.0 1.0 1.0 1.5 1.0 a 3.0 3.0 3.0 5.0 3.0 5.0 5.0 5.5 5.5 D 1.0 1.0 2.0 1.0 3.0 1.0 1.1 1.1 1.5 Dx 1.0 1.0 1.0 1.0 2.5 2.5 0.5 2.5 2.5 Dy 1.0 1.0 1.0 1.0 1.0 2.5 2.5 2.5 2.5
Gel Segmentation
1 2 3 4 5 6 7 8 9
Table 3.1: Parameters corresponding to plots in Fig. 3.12. Row 1-3 in the table corresponds to the 3 plots in the top row, row 4-6 in the table corresponds to the 3 plots in the centre row, and row 7-9 corresponds to the 3 plots in the bottom row.
the support region of a spot is not modelled. This may be overcome by simple superposition of diusion spot models resulting in a mixture model on the form: Cm (x, ) = k1 C1 (x, 1 ) + k2 C2 (x, 2 ) + . . . + kn Cn (x, n ), (3.9)
where n is the number of diusion spot models to constitute the mixture model and is a n 7 parameter matrix (7 parameters for each spot). The sum of squared residuals inside the support region, w I (x) Cm (x, )
xw 2
(3.10)
is to be minimised and by including enough spot models, i.e., for suciently large n (3.9) can model the data perfectly. Therefore, it is necessary to penalise more complex models (large n) in the optimisation of the model. The goal is to estimate n (the number of spots inside the support region, w) and the (n 7) parameters in so that (3.10) is small while the model is not too complex. The following optimisation problem is proposed:
= arg min
I (x) Cm (x, )
xw
+ 1 e2 n ,
(3.11)
where the last term serves to penalise large models. 1 and 2 are constants that must be chosen according to the problem at hand.
Lars Pedersen
3.4
Experiments and Results
49
3.4
This section presents some experiments and results of application of the methods described in the previous sections on a sub image (Fig. 3.7) and the selected spots (Fig. 3.8). First, some results on scale space blob detection on the sub image are presented and these will be used in marker based watershed segmentation. The same sub image is used in experiments with h-dome transformation at dierent values of h. Finally the parametric spot models (2D anisotropic Gaussian and 2D anisotropic diusion) are tted to the selected spots in the spot gallery.
3.4.1
Scale space blob detection
For the purpose of scale space blob detection the view of the spots is inverted so that spots appear dark on a bright background. Fig. 3.13 show the same region as Fig. 3.7 with the 61 known protein spot centres overlaid. Figs. 3.143.17 show the results of scale space blob detection at scales t = 1, 2, 3, 4, 5, 6, 7, and 8. The number of detected distinct blobs decreases as scale increases (Tab. 3.2). At low scales the image noise dominates the blobs found. With respect to the number of blobs detected, the best result is in the top row of Fig. 3.17 (at t = 7) where 57 blobs are found. Compared to Fig. 3.13 most of the spot centres are found even many of the very weak ones. Note however, that even if the number of detected blobs is close to the correct number of proteins, still quite a few spots are not detected. In return other blobs, that are not proteins this method detects as blobs. Some of the very weak spots and spots located close to other spots are detected, only at a very low scale where far too many local minima are found. In other words, there exist minima stronger than the weakest spots and this fact makes the spot detection non-trivial.
scale, t 1 2 3 4 5 6 7 8 Number of blobs 4313 972 261 103 81 72 57 36
Table 3.2: Number of detected blobs at dierent scales.
50
Gel Segmentation
Figure 3.13: Sub region of electrophoresis gel. Spot centres of 61 known protein spots are marked.
3.4.2
Marker based watershed segmentation
In Fig. 3.18 left the attribute (known) spot centres are used as markers for the segmentations. Right shows the segmentation using the scale space blob detection markers (t = 6) from experiments in 3.4.1.
3.4.3
H-dome transformation
This section shows experiments with the h-domes transformation ( 3.1.2) on an electrophoresis gel image. Again the (inverted) gel image is used and Fig. 3.19 shows the h-domes for dierent values of h.
3.4.4
Parametric spot models
The two parametric protein spot models described in 3.3 both have been optimised to t the 6 spots in the spot gallery (see Fig. 3.8). The optimal ts were obtained using the non-linear least squares method lsqnonlin in the Matlab Optimization Toolbox (The MathWorks, Inc.). Figs. 3.20 - 3.31 show the results of tting the Gaussian model and the diusion model to the six spots in the gallery. Every second gure shows the Gaussian model and every other second the diusion model.
Lars Pedersen
3.4
51
Figure 3.14: Scale space blob detection at dierent scales. Left column: Gaussian ltered versions of the original image (Fig. 3.13). Right column: original image with blob markers. Top: t = 1, 4313 distinct blobs were found. Bottom: t = 2, 972 blobs. The correct number of protein spots in this part of the gel is 61.
52
Gel Segmentation
Lars Pedersen
3.4
53
54
Gel Segmentation
Figure 3.17: Scale space blob detection at dierent scales. Left column: Gaussian ltered versions of the original image (Fig. 3.13). Right column: original image with blob markers. Top t = 7, 57 distinct blobs. Bottom t = 8, 36 blobs. The correct number of protein spots in this part of the gel is 61.
Lars Pedersen
3.4
55
Figure 3.18: Watershed segmentation using dierent markers. Left: markers are the spot centres from the ground truth attribute information (see C.2). Right: markers are the results of scale space blob detection at t = 6.
The gallery gures (3.7 and 3.8) show the spot centres of the neighbour spots within the square support region. The support region size has been chosen to be the same (61 61 pixels) for all spots in the gallery. The evaluation of the t of a model to the spot data can be done in dierent ways. One is by some quantitative measure, such as the sum of squared differences inside the support region. Another way of evaluation is by qualitative visual inspection. However, dening a good quantitative measure is not trivial. The reason is, that an ideal (single spot) model ts nicely to the spot and ignores eventual surrounding neighbours inside the support region and therefore the residual may (correctly) be large in areas of the support region where neighbours are present. This problem is closely connected to the size of the support region. Clearly there is a need for such a quantitative evaluation, but the following evaluation will be conned to the qualitative visual inspection. Each of the gures consists of ve components. In the top is seen the square region around the known spot centre displayed as a grey level image. To the left in the second row is the same information shown as a wire-frame surface. To the right in the second row is shown the resulting optimised model also as a wire-frame surface. The bottom row displays the residual (pixel-wise dierence between data and model) in dierent ways. To the left, the residual is shown as a wire-frame surface together with the zero-plane as reference. To the right, the spot data is shown as a wire-frame surface. Together with the data, the model is displayed as a coloured solid surface. The colouring represents the residuals
56
Gel Segmentation
Figure 3.19: H-dome transformation of 2D gel image. From the top left: h = 0.05, 0.15, 0.25, 0.35, 0.45, and 0.50. Lars Pedersen
3.4
57
(signed). Note the colour at intersections between the data and the model (zero error). Spot 11 (gs. 3.20 and 3.21). This spot is relatively large and intense. Within the shown region it has 4 neighbouring spots that are all weaker than spot 11. Some of the neighbours are overlapping. From the wireframe surface of the data (second row, left) it is seen, that the spot appears to be at on the top, which indicates that the spot is saturated. Note how the diusion model is much better at capturing this behaviour. Spot 18 (gs. 3.22 and 3.23). The spot is so small and weak, that it almost vanishes. Furthermore it has 4 (all stronger) neighbouring spots. None of the models are able to detect the spot and they both suggest a at surface. The restrictions of x0 and y0 prevent the models from moving to one of the stronger neighbours. Spot 20 (gs. 3.24 and 3.25). This spot is of medium size and intensity. It has four neighbours, two strong and two weak. The spot is oblong in the x-direction, which may be due to overlapping neighbours. The Gaussian model seems to be more inuenced by this than the diusion model. Spot 36 (gs. 3.26 and 3.27) is a relatively large spot of medium intensity. The spot is relatively isolated, i.e., neighbours are few, small and far away. Both models seem to do well in this case. Spot 53 (gs. 3.28 and 3.29). This spot of medium strength is small and well separated from neighbouring spots. Both models t this spot well. Spot 54 (gs. 3.30 and 3.31) is of medium size and intensity but it has a relatively large and intense neighbour spot to the left. The two spots have a severe overlap and both methods experience diculty modelling the spot correctly. Another strong neighbour to the right also overlap, although it is not in the support region. Both models try to model all three spots as one big spot.
58
Gel Segmentation
Figure 3.20: Parametric t of Gaussian spot model to spot 11. Top: Spot in question shown as grey-scale image. Second row: left spot data in question shown as wireframe, right tted model. Bottom row: left residual shown with zero-level plane, right spot data in question shown as wire-frame and the tted model shown as a solid surface, coloured according to the residual.
Lars Pedersen
3.4
59
Figure 3.21: Parametric t of diusion spot model to spot 11. Top: Spot in question shown as grey-scale image. Second row: left spot data in question shown as wireframe, right tted model. Bottom row: left residual shown with zero-level plane, right spot data in question shown as wire-frame and the tted model shown as a solid surface, coloured according to the residual.
60
Gel Segmentation
Lars Pedersen
3.4
61
62
Gel Segmentation
Lars Pedersen
3.4
63
64
Gel Segmentation
Lars Pedersen
3.4
65
66
Gel Segmentation
Lars Pedersen
3.4
67
68
Gel Segmentation
Lars Pedersen
3.4
69
70
Gel Segmentation
3.5
Summary
Although not the primary focus of this work, various approaches to segmentation of two-dimensional electrophoresis gel images into protein spots and background have been presented. This is motivated by the fact that a good segmentation is fundamental to the point matching methods described in 4. Summarising the methods described: Mathematical morphology based methods. The watershed by immersion method and marker based watersheds are presented and applied to a nanoparticle image. H-domes method is explained and applied to a two-dimensional electrophoresis image. Scale space based blob detection. Lindebergs scale space blob detector is described and applied to the nanoparticle image. Parametric protein spot models. Two models, a Gaussian and a diusion based model are presented and their properties explored. The methods presented are all tested on real gel image data. The watershed method is known to result in over-segmentation if not restricted by markers. Together with the scale space blob detector blobs can be segmented, but the choice of scale remains non-trivial. It has been demonstrated that even if the scale is chosen so that the number of detected blobs is close to the correct number of proteins, still quite a few spots are not detected and in return other blobs, that are not proteins are detect as blobs. This is probably due to the noisy nature of the images. Some of the very weak spots and spots located close to other spots are detected, only at a very low scale where far too many local minima are found. In other words, there exist minima in the image that are stronger than the weakest (known) spot. This fact makes the spot detection based on the image data alone challenging. The parametric spot models seem to have most success at modelling relatively isolated spots. As expected, the diusion model is better at modelling highly intense (saturated) protein spots, where for smaller spots the two models perform almost equally well. The main deciency of the spot models is, that they are not able to model multiple and overlapping spots within a small area. The choice of support region size also proved dicult without prior knowledge of the spot size.
Lars Pedersen
3.5
Summary
71
A mixture model was proposed as a solution to the problems of the parametric models but it was not implemented.
72
Gel Segmentation
Lars Pedersen
Chapter
Point Pattern Matching
This chapter describes a wide range of point matching methods for general point pattern matching as well as a number of methods applied to spot matching in electrophoresis gel images. The motivation for investigating point pattern matching methods is its application in proteome analysis. Since the process of matching the protein spots is a problem of solving a correspondence problem, the eld of point pattern matching methods seems appropriate to investigate. This chapter is divided into three main parts. First, general methods for point pattern matching are described, later, methods developed especially for the matching of point patterns from electrophoresis gel images are investigated, and nally a comparison of selected methods tested on real point patterns from electrophoresis gel data is presented. Point pattern matching is a fundamental step in many tasks in the elds of image registration and warping [34, 38, 45], motion estimation, matching of shapes [6], shape analysis [15, 26, 28], structure from motion estimation [39, 80], and many more. Other terms to describe the process of point pattern matching are to solve the correspondence or solving the optimal assignment problem. The area of applications for point matching includes: biology, medicine, astrophysics, robotics, and computer vision, to mention a few.
74
In the application of point pattern matching on protein spot centres there is a number of requirements for the matching method used (as identied in 2.3). The method must: exactly and robustly match protein pairs, allow for non-linear distortions/transformations, robustly handle outliers in both sets, be able to handle point sets of stochastic/amorphous nature, and robustly match dense point sets.
4.1
Notation
Notions of point patterns, point correspondence, match matrices and motion are introduced. Denition 1 (Point pattern) Dene a two-dimensional point pattern P = {pi | i = 1, 2, . . . , J }, where pi = (xi , yi ) are the coordinates of the point in the x-y plane. The pattern, also referred to as a point set is said to have cardinality J. 2
4.1.1
Correspondence and match matrices
The comparison of two point patterns necessitates a way to dene the correspondence between homologous points in the two patterns: Denition 2 (Point correspondence) Given two point sets P and Q, two homologous points pj and qk are said to correspond: pj qk , (4.1)
i.e., the jth point in P corresponds to the kth point in Q. The symbol is normally used to describe directly similar gures, i.e., when all corresponding angles are equal and described in the same rotational sense. 2 For the entire both patterns P and Q the correspondence between their points can be described in the binary match matrix :
Lars Pedersen
4.1
Notation
75
Denition 3 (Binary match matrix) Consider two point patterns P and Q of equal cardinalities J . The J J binary match matrix m denes the correspondence between homologous points in the two point patterns: 1 if point p q j k (4.2) j J, k J mjk = 0 otherwise subject to the constraint that there is a one-to-one mapping from P to Q:
J
j J
k=1
mjk = 1
(4.3)
and
J
k J
j =1
mjk = 1.
(4.4)
In other words, the points in both sets correspond to exactly one other point in the other set, i.e., all rows and all columns of m must sum to one. In this denition, it is not necessary to allow for missing points or for multiple-to-one correspondences. 2 In many applications, points present in one set may be missing in the other and vice versa. The cardinalities may be intrinsically dierent or points may be missing. In such cases the one-to-one correspondence does not hold and other ways of describing the correspondence between homologous points must be dened. Denote points that do not have a corresponding partner point in the other pattern, singles. Singles may be due to noise or that the two point sets may be only partially overlapping. Denition 4 (Single Point) Consider two point patterns P and Q. A single point is a point in a point pattern that does not have a homologous point in the other pattern. This type of points is often also referred to as singles, outliers, contaminating points or dropouts. 2 In order to represent correspondences and singles in the same data structure the augmented binary match matrix, m is dened. Denition 5 (Augmented binary match matrix) Consider now two point patterns P and Q of possibly dierent cardinality J and K . The augmented match matrix m is a (J + 1) (K + 1) matrix dening the state of each point in two point patterns P and Q. m is similar to m in Def. 3 except that m has an extra row and an extra column to hold the information of single points. Again, 1 if point p q j k (4.5) j J, k K m jk = 0 otherwise
76
and for the last column (singles in P ) 1 if point p is a single j j J m j (K +1) = 0 otherwise and similarly for the last row (singles in Q) 1 if point q is a single k k K m (J +1)k = 0 otherwise
(4.6)
(4.7)
subject to the constraint that either a point has a corresponding point in the other set or the point is a single:
K +1
j J
k=1
mjk = 1
(4.8)
and
J +1
k K
j =1
mjk = 1.
(4.9)
Note, that there are no constraints on the (J + 1)th row or the (K + 1)th column because there is no restriction on the number of singles in either point set. Also, m (J +1)(K +1) is insignicant (it is not used) and can take any value.
2
Example 1 (Two small point patterns) Fig. 4.1 shows two small point patterns P and Q with 5 and 7 points respectively. The two sets have 4 points in common, P has 1 single and Q has 3 singles. Eq. (4.10) shows the corresponding augmented match matrix 2
m =
1 0 0 0 0 0
0 0 0 0 0 1
0 1 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 1
0 0 0 0 1 0
0 0 0 0 0 1
0 0 1 0 0
(4.10)
Lars Pedersen
4.1
Notation
77
q2
p2 q3 p1 q1 p4 q4 q7 p5 q6 q5 p3
Figure 4.1: Two small point patterns P () and Q (+) with 5 and 7 points respectively. The two sets have 4 points in common, P has 1 single and Q has 3 singles. The correspondence is dened by m in eq. (4.10).
78
Solving the correspondence problem is a matter of establishing the correct correspondence between points in two point sets, i.e., to estimate the augmented match matrix. This process is referred to as matching. When computing correspondences between point patterns, it may be benecial to assign non-binary values to the correspondence estimates, i.e., to dene a fuzzy correspondence matrix. In order to be able to handle partial correspondences the augmented fuzzy match matrix, in which entries are real numbers between 0 and 1, is dened. Denition 6 (Augmented fuzzy match matrix) Given two point patterns P and Q of possibly dierent cardinality J and K . The elements of the augmented fuzzy match matrix m equal the probabilities1 of correspondence or single status. For the inner part of m j J, k K for the last column (singles in P ) j J, k = (K + 1) and for the last row (singles in Q) k K, j = (J + 1) m jk = P {qk is single} m jk = P {pj is single} m jk = P {pj qk }
where m jk [0, 1], j (J + 1), k (K + 1). P {pj qk } means the probability that the jth point in P corresponds to the kth point in Q. Furthermore the entries in the last row and last column of m describe the probability for a point to be single. Naturally the probabilities for each point must sum to one so the row and column constraints still apply:
K +1
j J
k=1 J +1
mjk = 1
and
k K
j =1
mjk = 1.
(4.11)
As for the binary match matrix there is no constraints on the (J + 1)th row or the (K + 1)th column because there is no restriction on the number of singles in either point set. 2
1 The entries in the augmented fuzzy match matrix are termed probabilities here because of the close analogy to probabilities although, strictly speaking they are not. Other authors use the term goodness instead.
Lars Pedersen
4.1
Notation
79
To estimate the binary correspondence, m can be converted to a binary augmented match matrix m . A heuristic method for the binarization is given in Binarization(m , ) (Alg. 8). is a threshold value. It is ensured, that no points match to several points in the other set. In general, for a matrix m the jth row and the kth column is referred to as: mj mk = jth row in m = kth column in m.
4.1.2
Motion estimation
The points in two point patterns to be matched can be related by a (possibly non-linear) transformation f (, ) that brings corresponding points into register. Denition 7 (Motion) Given two point sets P and Q of equal cardinality J , and assume the correspondence to be one-to-one and j, pj qj . Find f (, ) such that
J
pj f (qj , )
j =1
(4.12)
is minimal. For this example, the sum of squared Euclidean distances has been chosen as the measure of similarity. The transformation f (, ) is dened as the motion, and for a given type of transformation, the parameters to be estimated are . 2 Depending on the application f could be a simple rigid transformation (translation and rotation), an ane (translation, rotation, scale and shear), a polynomial or even a non-rigid thin-plate spline transformation. For Euclidean and ane transformation the term pose [36] is also used for the motion. If the cardinality of the point sets is not equal, but the correspondence (m ) is known the optimal transformation can still be found. The optimisation problem (4.12) can then be written
J K
m jk pj f (qk , ) 2 ,
j =1 k=1
(4.13)
i.e., the known correspondences in m are used to nd the optimal motion, f (, ).

80
4.1.3
The classical chicken and egg problem
There exists a cyclic dilemma inherently in the point pattern matching problem. Given the correspondence it would be easy to determine a transformation f and on the other hand given an optimal transformation f , it would be easy to establish the correspondence (e.g., by nearest neighbour points).
4.1.4
Graph based methods
The point pattern matching problem is often cast as a graph matching problem. In the eld of graph theory, matching is an important subject and the problem of matching point patterns can be formulated in various terms related to graph theory. Examples are minimum weighted bipartite graph matching [50] and relational graph matching [27]. The notation and description of the graph representation will not be discussed in detail.
4.2
General Point Pattern Matching Methods
This section describes a range of general methods for point pattern matching. Tab. 4.1 gives an overview of the most important methods in the literature. Some of these methods are described in detail in the following sections. The eld of point matching is vast and there denitely exist methods not mentioned here. For a number of ways to attack the point matching problem, the table shows references to relevant work, the method used for correspondence calculation, what transformations (motion) is considered, and in what type of application the method has been used (if any). Of the methods in Tab. 4.1 the following are described in further detail: iterative closest point (ICP), dual step expectation maximisation (EM), bipartite graph matching of shape context, and robust point matching (RPM)
with emphasis on the RPM methods (and extensions) due to their suitability to match 2DGE spot patterns.
Lars Pedersen
Reference Gold and Rangarajan [35].
Transformation (motion) -
Applications Synthetic data.
4.2
Rangarajan et al. [69]. Similarity transform. Ane. Ane. 3D Ane. Ane. Ane. Ane, perspective. Ane. 2D elastic, 3D rigid. Non-rigid, spline. Non-rigid. Non-rigid. thin-plate Warping of 2D serial histological sections of mouse embryo. synthetic data, anatomical sulcal brain MRI data Hand tools, airplane silhouettes. Features in mammogram images, hurricane images. Thin-plate spline. Non-rigid. Recognition of trademarks, handwritten digits. Star constellations. autoradio-
Correspondence Robust Point Matching (RPM): deterministic annealing, softassign. Robust Point Matching. Similarity transform. autoradio-
Rangarajan et al. [67].
Chang et al. [20].
Primate cortex graphs. Primate cortex graphs. Finger prints.
Gold et al. [36]. Iterative Closest Point.
Softassign Procrustes matching. Cluster based matching, maximum pairs support. Robust Point Matching Hand-written characters, autoradiograph cortex slices. Synthetic data. Static indoor scene from two views. Star constellations. Synthetic data. Road networks, aerial imagery.
Besl and McKay [9], Zhang [91]. Ogawa [62].
Cheng [21]

Consistency graphs, Delaunay triangulation, maximal cliques. Relaxation, Select-match-pair process. Dual step EM algorithm, relational graph matching. Spectral graph theory, point proximity matrix. Correspondence by sensitivity to movement, modied ICP. RPM. Modal matching. Bipartite weighted graph matching. Hybrid of greedy and Hungarian algorithm. Bipartite graph matching of shape contexts. World view vector similarity.
Cross and Hancock [27]. Carcassoni and Hancock [17]. Guest et al. [38].
Chui and Rangarajan [23], Rangarajan et al. [68]. Sclaro and Pentland [72]. Kumar et al. [50], Kumar et al. [49].
Belongie and Malik [5], Belongie et al. [7]. Murtagh [61].
81
Table 4.1: Overview of general point pattern matching methods.
82
4.2.1
Iterative closest point
Iterative closest points (ICP), Besl and McKay [9] and Zhang [91] is one of the more well-known methods for point pattern matching. The general idea is to iteratively match points in one set to closest points in another set. [9] proposes the ICP algorithm as a 3D shape rigid registration algorithm for point sets, curves and surfaces. Given two point sets P and Q for each point in P the closest point in Q is computed and from that a transformation (translation and rotation) is calculated. This transformation is applied to the points in P . These steps are repeated in a iterative manner until the change in mean square error is below a given threshold. The algorithm is guaranteed to converge to a local minimum. One of the major disadvantages of this algorithm is its inability to handle a large proportion of outliers (singles) [9], [22].
4.2.2
Dual step EM
Cross and Hancock [27] and Carcassoni and Hancock [17] describe an approach very similar to the basic idea in RPM, namely alternately estimation of transformation parameters and correspondence matches, referred to as the dual step expectation maximisation algorithm (dual step EM). The transformation is conned to the ane or perspective geometry and relational graph matching of separate triangulation of the point sets constitutes the matching.
4.2.3
Bipartite graph matching of shape context
For the purpose of matching shapes Belongie and Malik [5], [6, 7] have developed a point matching method based on the matching of shape context descriptors. For each point, a shape context descriptor is calculated and the correspondence problem is solved by using bipartite graph matching of the descriptors. Fig. 4.2 is from [6]. The shape context descriptor of a point is a two-dimensional log-polar histogram. For the point in question, the mask in Fig. 4.2(c) is placed with the centre on the point and the count of points in each bin is assigned to the corresponding cell in the log-polar histogram. The polar coordinate is divided into 12 bins and the log r is divided into 5 bins. The histograms are used to calculate a cost matrix C , Cij = C (pi , qj ) is the cost of matching point pi with qj and the matching is solved as a square assignment problem using an ecient implementation of the Hungarian method.
Lars Pedersen
4.2
83
(a) Pattern 1.
(b) Pattern 2.
(c) Log-polar histogram bins.
log r
log r
(d) Shape context descriptor for point marked with in Fig. 4.2(a).
(e) Shape context descriptor for point marked with in Fig. 4.2(b).
Figure 4.2: Shape context and matching. From Belongie et al. [6].
log r
(f ) Shape context descriptor for point marked with in Fig. 4.2(a).
84
If the two sets do not contain the same number of points, extra dummy points (with a suciently large cost) can be added to either of the point sets. This will ensure a square Cij . The same trick is used to make the method more robust to singles. This does however require an estimate of the maximum number of singles in either set. Heavily contaminated point sets can be suspected to cause problems, and [7] provides no solution for this case. Ideas very similar to the shape context descriptor has previously been suggested by [64] (see 4.3.1). The complete matching method employs alternately estimation of the correspondence and a thin-plate-spline transformation inspired by the approach in [23] (see also 4.2.4).
4.2.4
Robust point matching
Robust point matching (RPM) is a method based on both the estimation of the point correspondence and a the transformation (motion, pose) in a deterministic annealing setting. Alternately, the correspondence and the transformation is estimated in an iterative manner and by use of the augmented fuzzy match matrix (Def. 6) and Sinkhorns method for matrix normalisation, RPM handles singles in a robust manner. The original method proposed by Gold and Rangarajan [35] has been extended to handle various types of transformations: similarity transform [69], ane transformation [36], Procrustes alignment [67], [70], and the non-rigid thin-plate spline transformation [68], [23]. In the following, the RPM method in general and some extensions to the Sinkhorns method will be described. Two types of transformations, the ane and thin-plate spline, will be discussed in detail. The level of detail is greater in the discussion of the RPM methods because their ability to match protein spot patterns from electrophoresis gels has been investigated thoroughly (4.5). Given two point sets P and Q of possibly dierent cardinality and assuming that some knowledge about the type of transformation (motion), that relates two point sets P and Q, is available, (e.g., that the motion is ane), it remains to estimate the optimal parameters in the transformation f (, ) and the correspondence m .
Lars Pedersen
4.2
85
The assignment problem and estimating the transformation parameters is formulated as an energy minimisation problem on the form:
J K
min E
m,
=
j =1 k=1 J
m jk pj f (qk , )
j =1 k=1
m jk + T
j =1 k=1
m jk (log m jk 1)
J (f ( )).
(4.14)
The rst term is the sum of squared distances between the points in the two sets. The weighting of the distances by m jk penalises distances between points with high probability to match. The second term provides a regularisation on the match matrix m ensuring that not all points are classied as singles, which would cancel out the rst term of the energy function. The third term is an entropy barrier function to ensure positivity of m . To ensure reasonable behaviour of the transformation, the fourth term serves as a regularisation on the transformation f (, ). and serve as regularisation parameters. This is a general form of the energy function and later more specic forms (in cases of f being an ane and a thin-plate spline transformation) are presented. Algorithm 1 Robust-Point-Matching(P , Q, T0 )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Initialise = 0. All parameters are initialised to zero Initialise T = T0 . Initialise m jk = 1 + . repeat Deterministic Annealing repeat at each temperature level Qjk E mjk m jk exp(Qjk /T ) Ensure positivity of m m Sinkhorn(m ) Sinkhorns normalisation Estimate transformation parameters until # of iterations > I0 Decrease T , decrease until T < Tend return m, f (, )
The RPM algorithm outlined in Alg. 1 is explained in further detail. The outer loop is the Deterministic Annealing (DA). It basically consists of slowly decreasing the temperature parameter T . T plays an important role in
86
Algorithm 2 Sinkhorn(m )
1: 2: 3: 4: 5: 6:
repeat Normalise all rows: j J + 1, k K + 1 Normalise all columns: j J + 1, k K + 1
m 1 jk
m jk K +1 jk k =1 m m 1 jk
J +1 j =1
m jk
m 1 j k
until m converges or # of iterations > I1
the estimation of the point correspondences and is used to weight the importance of the inter-point distances. Note how the current estimate of the transformation, f (, ) is used to calculate the next estimate of the match matrix, m . In the rst iterations (at large T ) the distance between points is less important for the correspondence, but as the temperature decreases, small distances are rewarded more and large distances are penalised more. E is calculated. For xed parameIn the inner loop the matrix Qjk m jk ters this is an assignment problem [35], reversing the sign because this is a minimisation instead of a maximisation as in the canonical form of the assignment problem. With the transformation m exp(Qjk /T ) (line 7) it is ensured that a small Euclidean distance between a point pair corresponds to a large probability of match for this pair. For the elements in m to be probabilities the column and row constraints (4.11) must hold. This is ensured (line 8) by the Sinkhorns matrix normalisation (Alg. in Alg. 2 (line 6) is a limit on the sum 2)1 . The convergence criterion of m of absolute dierences of matrix elements between two consecutive iterations of the loop:
J +1 K +1 j =1 k=1
|m jk m 0 jk | <
2.
After this normalisation the newly updated match matrix m is used to update the transformation parameters (line 9). The knowledge of the fuzzy correspondences allows to put more emphasis on high probability matches in the parameter estimation. Furthermore there will often be some regularisation of the parameters to avoid inappropriate behaviour of the transformation. This will be demonstrated in the specic examples of the transformation being ane and a thin-plate spline.
1 Sinkhorn(m ) is described in Alg. 2, however, it is suggested to use the ExtendedSinkhorn(,) Alg. 3 instead.
Lars Pedersen
4.2
87
After a xed number of iterations I0 at the current temperature level the inner loop is done and the temperature is decreased (line 11), e.g., by an exponential temperature schedule. Simultaneously, the regularisation weight for the regularisation of the transformation parameters is relaxed (decreased). This is because the transformation parameters need only to be bounded in the beginning of the iterations when corresponding points may be far apart. As the algorithm progresses the estimates of the transformation parameters become better and there is less need to bound them. In the rst iterations (large T ) all pairs will be assigned a small probability of match but as T decreases point pairs close to each other will be rewarded with relatively larger probability for match and vice versa. Line 7 and 8 in Alg. 1 is termed softassign [35] because the inter-distances between points are converted to fuzzy (soft) correspondences (assignment probabilities). For equal point sets, P and Q, the energy function in (4.14) is a linear assignment problem [23] or a bipartite matching problem with respect to the correspondences. The RPM algorithm minimises the energy function because softassign used within deterministic annealing has been shown to nd the optimal solution to linear assignment [47]. Although not strictly an assigment problem, Gold et al. [36] imply this result is valid for the case of unequal point sets by introducing slack variables. However, no proof is given. Furthermore, [23] state that while softassign and the transformation estimation are independently optimal, the combination is not necessarily so. This approach is only guaranteed to nd a local minimum for both the mapping and the correspondence. Application of the RPM algorithm to 2DGE data has proven to work well in practice ( 4.5).
RPM algorithm initialisation All parameters of the transformation are initialised to = 0 or so that the point set is undergoing no transformation in the beginning because no knowledge of this transformation is available. T0 is the starting temperature and should be chosen suitably large. [23] proposes to set T0 slightly larger than the greatest squared distance of all point pairs to allow for all possible matchings at rst. In cases with almost aligned point sets and small deformations (e.g., for 2DGE spot patterns) T0 can be lowered to save computational eort. The parameter serves as a threshold error distance and indicates how far two points can be apart before being treated as outliers. This parameter needs to be adjusted depending of the problem at hand.
88
The parameter can be adjusted to control the penalty on the transformation parameters and must be chosen dependent upon the problem at hand. Every element in the match matrix m jk is initialised to 1 + , where is a small number, e.g. 106 . This only has eect on the last row and the last column (for the outliers) because the inner part of m is immediately overwritten. Alternatively, [23] suggests to initialise m such that entries in the (J K ) inner 1 (which seems strange, because these elements part of the matrix are all set to J are again immediately overwritten) and the elements in the outlier row and the 1 outlier column are initialised to 100 J . The initialisation rst suggested has been used in the experiments conducted later (4.5).
Extended Sinkhorns method There are problems using the Sinkhorn method to normalise m . First, m is often a non-square matrix, but the Sinkhorn result is valid for square matrices only. Second, in the denition of the fuzzy match matrix (Def. 6) there are no constraints on the last row and the last column of m , so these should not be normalised as it happens in Alg. 2. In other words, the elements in the last column should not be normalised with respect to the sum of that column, but its elements should still enter in the normalisation of the corresponding matrix rows and vice versa for the last row. This idea has been implemented in Alg. 3. The changes from Alg. 2 (as dened in Gold et al. [36]) inside the convergence criterion has merely to do with not normalising the last row and the last column of m according to the reasons given above. One may easily think of a pathological case where Alg. 3 (and also Alg. 2) will fail. Example 2 (pathological case) Consider the augmented match matrix m representing the correspondence between two points in one set (P ) and three points in the other set (Q). There is a tie between q1 and q2 to match p1 . 1 1 0 0 m = 0 0 1 1 0 0 1 0 The row and column normalisations 1 m = 0 0 will result in the unfortunate result: 1 0 0 0 .38 .62 0 .62 0
Lars Pedersen
4.2
89
and the last two lines in Alg. 3 adjusting the outlier row and column result in: 1 1 0 -1 m = 0 0 .38 .62 0 0 .62 0 which is not an acceptable result.
2
On the other hand m was not valid from the beginning because all elements must be strictly positive. , Example 3 By adding a small constant, e.g. 105 to all elements in m + 105 m =m a valid input matrix to Extended-Sinkhorn is obtained and the result, valid to two decimal places after 79 iterations of Alg. 3 becomes: 0.50 0.50 0.00 0.00 m = 0.15 0.15 0.29 0.41 0.35 0.35 0.71 0 which is a valid match matrix with all rows and columns summing to one, save the outlier row and column. The tie has not been resolved (of course), and the binarization as proposed in B, Alg. 8 of the result with = 0.5 will result in: 0 0 0 1 m = 0 0 0 1 1 1 1 0 i.e., all points are declared outliers which seems reasonable considering the input. A less conservative setting of the binarization threshold = 0.2 results in 0 0 0 1 m = 0 0 1 0 . 2 1 1 0 0 The examples above illustrate the reason for the initialisation of m as shown in Alg. 1. It is important, that the outlier row and column do not contain zero elements. Sinkhorn proves that any square matrix with all positive elements will converge to a doubly stochastic matrix by alternately normalising rows and columns in
90
an iterative manner. A doubly stochastic matrix is a square matrix in which elements are all positive and sum to one both along rows and along columns. The problem of m often being non-square is solved by a simple heuristic. At the end of the iterations the following rule has been added: if the inner columns and rows do not sum to one this is forced to happen by appropriately adjusting the outlier values (line 8-9, Alg. 3). Algorithm 3 Extended-Sinkhorn(m )
1: 2: 3: 4: 5: 6: 7: 8: 9:
Initialise m m repeat Normalise the J top rows: j J , k K + 1 m 1 jk m jk K +1 jk k =1 m m 1 jk

J +1 j =1
Normalise the K leftmost columns j J + 1, k K m jk
m 1 j k
until m converges or # of iterations > I1 j J m j (K +1) 1 K jk force top J rows to sum to 1 k=1 m J k K m (J +1)k 1 j =1 m jk force K leftmost columns to sum to 1
Although no proof for convergence is given, these changes have proven very useful in making the point matching algorithm more robust. A simulation with 1000 matrices was carried out in order to study the convergence properties of Alg. 3. The dimensions of the matrices, J K were random integer numbers drawn from {2, 3, ..., 100}, i.e., the matrices were non-square in most cases, but also square matrices could occur. The matrix entries were random real numbers in ]0, 1]. The stop criteria was chosen so that the limit for change in m , 2 = 106 and the maximum number of iterations I1 = 1000. All simulations converged to match matrices, fullling the constraints in (4.11) in less than 105 iterations. In this thesis the Extended-Sinkhorn(m ) method substitutes the Sinkhorn(m ) method in all experiments.
RPM with ane transformation One of the important transformations implemented in the Robust Point Matching scheme is the ane pose [36].
Lars Pedersen
4.2
91
Here, the ane transformation of a point in Q is f (qk , ) = t + Aqk where t is a translation vector and A is a 2 2 matrix dened as A = s(a)R()Sh1 (b)Sh2 (c) where s( a) = ea 0 0 ea , Sh1 (b) = eb 0 0 eb , Sh2 (c) = cosh(c) sinh(c) sinh(c) cosh(c)
and R() is the well known rotation matrix: R() = cos() sin() sin() cos()
The pose parameters can be summarised in = (t, , a, b, c). In the case of ane transformation the energy function in Eq. 4.14 becomes:
J K
=
j =1 k=1 J
m jk pj t Aqk

j =1 k=1
mjk + T
j =1 k=1
m jk (log m jk 1)
+ g (A)
(4.15)
composed of four terms. The rst term is the sum of squared weighted distances between points in P and the transformed points in Q. The second term ensures that not all points are classied as singles. The parameter is a threshold error distance that determines how distant two points can be before their match is unlikely. The fourth term g (A) (= J (f ( )) in Eq. (4.14)) provides a regularisation of the ane transformation by penalising large values of the scale and shear parameters: g (A) = (a2 + b2 + c2 ),
92
where allows for adjustment of the penalty. By dierentiation of the energy function in (4.15) with respect to the match matrix and setting equal to zero [36]: E m jk = = m jk pj t Aqk pj t Aqk
2
+ T (log(m jk ) 1) + T + T log(m jk ) = 0
2
m jk m jk
= exp
pj t Aqk T
Gold et al. [36] also provides analytical solutions to update and t and suggests using Newtons method to update a, b, and c. In the case of ane transformation the Alg. 1 specialises to Alg. 4. Algorithm 4 RPM-Affine(P , Q, T0 )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
Initialise = (t, , a, b, c) = 0. All parameters are initialised to zero Initialise T = T0 . Initialise m jk = 1 + . repeat Deterministic Annealing repeat at each temperature level Qjk pj t Aqk 2 m jk exp(Qjk /T ) Ensure positivity of m m Sinkhorn(m ) Sinkhorns normalisation Update t and using analytic solutions Update pose parameters Update a, b, and c using Newtons method until # of iterations > I0 Decrease T and until T < Tend return m, f (, )
Pedersen and Ersbll [65] suggested to gradually reduce so that in the beginning, when the match matrix and the ane pose parameters are uncertain, larger distances between points is allowed for. Later (after more iterations), when the correspondence and the pose are more certain, a smaller distance between points is required for them to be likely matches. 4.3.4 embeds this method in a regionalised setting and results of application to matching of 2D electrophoresis spot patterns are given in 4.5.
Lars Pedersen
4.2
93
For the experiments in 4.5 the following values were used: T0 = 110, = 100, = 1, = 105 , 2 = 0.5, I0 = 20, I1 = 5. The annealing rate (the rate at which T and was decreased) was 0.93 for both parameters.
RPM with thin-plate spline transformation Another transformation, the non-rigid thin-plate spline is extremely exible and well suited for modelling the complex disparity eld between 2DGE point sets. Chui and Rangarajan [23] and Rangarajan et al. [68] have developed a powerful method for using the thin-plate spline transformation in conjunction within the Robust Point Matching framework. The thin-plate spline transformation is described in A. The thin-plate spline functional on the form (A.2) f (qk , ) = qk d + (qk ) c (4.16)
is the sum of an ane part and a non-ane part. The parameters are = (d, c). d is a (3 3) ane parameter matrix (using homogeneous coordinates, refer to (A)) and the non-ane parameters c is a (K 3) matrix. The energy function minimised by this algorithm is
J K 2
E (m, d, w)
=
j =1 k=1 J
m jk pj qk d (qk )c

j =1 k=1
m jk + T
j =1 k=1
m jk (log m jk 1)
+ 1 tr(cT c) + 2 tr[(d I )T (d I )].
(4.17)
The rst term is again the sum of squared distances between the points in P and the transformed points in Q. The second term also again ensures that not all points are classied as singles. The third term is an entropy barrier function to ensure positivity of m . The last two terms are regularisation terms on the ane (d) and non-ane (c) parameters respectively. The penalty on the behaviour of these parameters is split into two terms, conveniently providing separate control (1 and 2 ) over the ane and non-ane behaviour of the transformation. The parameters , 1 , and 2 must be chosen according to the problem at hand.
94
Dierentiation of (4.17) with respect to m and setting the result to zero leads to [23]: Qjk m jk = pj qk d (qk )c = exp(Qjk /T ),
2
and
i.e., an estimate of the correspondence m . After normalisation of the fuzzy correspondence estimates the thin-plate spline parameters can be estimated. A explains how, using QR decomposition, a least squares estimate of the thinplate spline parameters (d, c) can be obtained when correspondences are known (at least assumed known). Alg. 5 shows the RPM-TPS scheme. Algorithm 5 RPM-TPS(P , Q, T0 )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
Initialise = 0. Initialise T = T0 . Initialise m jk = 1 + . repeat repeat Qjk pj qk d (qk )c m jk exp(Qjk /T ) m Sinkhorn(m ) Update = (d, c) until # of iterations > I0 Decrease T Decrease 1 and 2 until T < Tend return m, f (, )
All parameters are initialised to zero
Deterministic Annealing at each temperature level Ensure positivity of m Sinkhorns normalisation Update TPS parameters
Note, that the regularisation parameters are relaxed as the iterations proceed. This way the transformations are allowed more freedom when the points are brought closer. In the beginning of the iterations, the correspondence is uncertain and the points may be far apart, but later when the match matrix is more certain, the transformation is allowed to behave more freely. For the experiments in 4.5 the following values were used: T0 = 300, = 0, 1 = 106 , 2 = 106 , = 105 , 2 = 0.5, I0 = 20, I1 = 5. The annealing rate (the rate at which T was decreased) was 0.93. 1 and 2 were kept constants.
Extended RPM TPS The RPM-TPS method uses the point locations in the estimation of correspondence and transformation. Often additional point information other than the
Lars Pedersen
4.2
95
spatial location of the points is available. The RPM-TPS method can easily be extended to include the point attribute information in the point matching. Provided that there is some correlation between the corresponding points and their attribute information, the attribute information has the ability to reduce the number of possible matches for each point, resulting in a faster and more reliable algorithm. Given D scalar attributes to each point in P and Q. The D attributes to point pj in P are the aligned in the (D 1) attribute information vector ap j and similarly for Q. With oset in the energy function (4.17) a sixth term is added to include the attribute information in the matching process:
K q m jk (ap j , ak )
j =J k=1
where (, ) is some distance measure. is a parameter to control the inuence of the attribute information. The nal energy function gathers to:
E (m, d, w)
=
j =1 k=1 J
m jk pj qk d (qk )c

j =1 k=1
m jk + T
j =1 k=1
m jk (log m jk 1)
+ 1 tr(cT c) + 2 tr[(d I )T (d I )]
J K q m jk (ap j , ak ) . j =J k=1
(4.18)
Setting
E m jk
= 0 results in Qjk = pj qk d (qk )c

2 q + (ap j , ak ) ,
which will require a corresponding change of Alg. 5, line 6.

96
4.3
Point Matching of Protein Spot Patterns
To remind of the protein spot pattern nature, Fig. 4.3 repeats Fig. 2.13 from 2. The known correspondences in Fig. 4.3 show the very complex nature of
Figure 4.3: Known correspondence between spots in gel A and gel B. Corresponding spots are connected with small arrows. Left: Before initial alignment. Right: Residual after correction for 1st order polynomial transformation using landmarks. Manually dened landmarks are marked with stars.
the protein spot patterns. The plot of two point sets with corresponding points connected with arrows or lines is referred to as the disparity eld for the two point sets. The left plot in Fig. 4.3 shows the disparity eld of two point sets P and Q. The dominant rotation and translation suggests that a large part of the disparity eld can be recovered by a global alignment of the point patterns. Therefore, a disparity eld model on the following form is suggested: = g + l , (4.19)
where the total disparity is a sum of two contributions, a global eld and a local eld. The rst term, g is the global disparity eld and can be described by a fairly simple parametric transformation that accounts for the dierent location of the gels during the image capture process. The second term, l models the local non-linear deformations of the gels due to non-uniform diusion constants in the gel, non-homogeneity of the chemical solutions and the electrical potential elds involved in the protein separation process.
Lars Pedersen
4.3
97
The methods presented in the rest of this section are developed to estimate point correspondences between two sets of protein spot centres. The input to these methods have all been corrected for the global disparity eld g . The matching of points instead of protein spots is based on successful segmentation of images into spots and non-spots. Please refer to 3 for details on the gel segmentations. Comparison of matching results in the literature is dicult because of the variety of visualisation methods used to produce the gel images. Some methods generate images with only a small number of spots while methods more sensitive (as [35 S]-methionine labelling detected with a phosphor imager) are able to detect far more proteins from the same sample (see Fig. 2.5). The sensitive methods detect more proteins in the same area, hence more dense point sets. Intuitively, the risk of making a mistake in the matching is higher when matching dense point sets. Another factor impeding the comparison of match results within the literature is that, even using the same visualisation method, inter-laboratory dierences (procedure, detergents, equipment, etc.) inuence the gel images and therefore also the density of the spots and the reproducibility of the gels in general.
Figure 4.4: Principal sketch of (partial) correspondences between protein spots in two gel images. In order to compare protein expression levels between two gels the correct correspondence between matching spots is necessary. Repetition of Fig. 2.11.
A literature overview of important attempts to match protein spots is given in Tab. 4.2. Selected methods in the areas of: neighbourhood based matching,
Reference Watanabe et al. [88], Watanabe and Takahashi [87], Watanabe and Takahashi [86]. P anek and Vohradsk y [64]. Akutsu et al. [2].
Correspondence Graph matching of Delaunay and relative neighbourhood graphs. Breadth rst search. Maximal similarity of neighbourhood descriptors and relative position. Neighbourhood based, dynamic programming, practical heuristic algorithm. Heuristic cluster matching. Incremental Delaunay triangulation, hashing. Uses spot intensities. Visual comparison by colour super-imposition.
Transformation (motion) -
Outlier handling Yes.
Commercial software DNAinsight.
Yes.
Appel et al. [3], Miller et al. [60]. Homann et al. [40], Homann et al. [41], Pleissner et al. [66]. Horgan et al. [42].
Yes. Yes.
Melanie II. CAROL
Menard and Soulie [59]
Graph matching of topological maps
Ane, thin-plate spline warp of entire gel image using manually selected landmarks. -
No.
Table 4.2: Point pattern matching methods applied to 2D electrophoresis protein spot patterns.
98
Lars Pedersen
Yes.
4.3
99
graph based matching, successive point matching, and regionalised robust point matching (regionalised RPM) will be discussed with emphasis on the regionalised RPM methods.
4.3.1
Neighbourhood based matching
Comparison of spot neighbourhoods [64], [2], [3] is a popular approach in matching of electrophoretic protein spot patterns. Given the usual matching problem: two gels, a reference gel p and a match gel q with point patterns P and Q. Find the best match matrix m . The common idea behind these methods is: For a given point pj in P , nd its homologous point qk in Q. It is likely that the neighbourhood of qk looks like the neighbourhood of pj . A neighbourhood descriptor is dened for each point in both sets and these neighbourhood descriptors are then compared in order to nd the best match for each point. This idea attempts to mimic the method of an investigator performing the matching of spot patterns by eye.
P anek and Vohradsk y One of the more promising of the existing approaches to matching protein spot patterns is the one by P anek and Vohradsk y [64]. They attempt to match the spot patterns by comparing spot neighbourhoods. A syntactic descriptor of the neighbourhood is dened and together with a metric denition of position similarity a simple comparison is done with ve possible candidate points. For a point pj the candidate points in Q are the ve closest points to the position of pj in Q. The syntactic neighbourhood descriptor is related to the shape context descriptor in [5]. P anek and Vohradsk y [64] also divides the neighbourhood around the spot into segments (concentric rings and wedges) each with a binary code. See Fig. 4.5. The segment codes of neighbouring segments dier by only one binary digit. Note the similarity to the binary Gray-code. Furthermore, codes for opposite segments have no digits in common.
100
Spots in the neighbourhood of a spot are assigned the binary code corresponding to their relative position in the neighbourhood. These segment codes are arranged in a descriptor matrix. The similarity between two descriptor matrices (one from P and one from Q) can then be used to nd the best match for a spot. The paper also describes a measure of goodness of match by comparing so called directional vectors. Given a number of landmarks in both P and Q the dierence in length and absolute angles of the directional vectors from the point to the nearest landmark is used to determine the goodness of match. In other words, for a match (pj qk ) to be good pj and qk must be located in a similar way relatively to the same landmarks.
M
LNOM (1000) LNEM (1100)
N
PNOM (0000)
O
110 100
PNOQ (0001)
111
L
LDEM (1110) LDEQ (1111) PDEQ (0111) PDOQ (0011)
000
D
(a) Wedges.
Q
(b) Rings.
Figure 4.5: Panek and Vohradsky neighbourhood segment description. From [64].
The method has been implemented and tested in 4.5.
Miller, Appel et al. The work by Miller et al. [60] and Appel et al. [3] is another example of neighbourhood based matching. Spots are matched one by one and the spot in question is called the primary spot. The neighbour spots within a certain distance from the primary spot is termed the secondary spots or cluster of the primary spot. A probabilistic criterion is used to determine the correct match. The probability of two clusters to mismatch (match by random) is approximated and if this probability is less than 0.01, the two clusters are considered
Lars Pedersen
4.3
101
a match. To further reduce the risk of mismatch, the secondary clusters (i.e., the clusters of the secondary spots) are compared.
4.3.2
Graph based matching
Among the approaches using graph matching, Watanabe et al. [88] has reported good results on matching spots in restriction landmark genome scanning (RLGS) electrophoresis gels for the purpose of high-speed genome scanning. Although these are not protein spots, the images have some similarities to the 2DGE images discussed here. They use an elliptic operator to detect the spots in X-ray images. For the matching, a relative neighbourhood graph (RNG) of the reference pattern is matched to a Delaunay net (DN) of the match pattern using a breadth rst search progressing from initially given corresponding points.
4.3.3
Successive point matching
In situations where more information than the spot centre location (x, y ) is available, it seems natural to exploit that extra information in the matching process. It seems likely, that other spot attributes as percent integrated intensity (%II) or spot area can be used to reduce the matching eort and increase the robustness of the matching. This assumes, of course, that corresponding spots, besides having similar (x, y )-coordinates ((pI, MWt) properties), also have similar intensity or area values. Although this should be true for the majority of spot pairs, this is not always the case in particular not for the interesting marker proteins, and therefore caution must be taken when attribute information is used. The above considerations also apply to the approach in 4.2.4. Homann et al. [40] present the idea of ranking spots from both sets according to their integrated intensity and then match spots with similar rank only. The idea is to match most important (intense) spots rst and then, in an iterative manner, proceed with the matching of less important spots. Please refer to Tab. 4.2 for more references to work by this group.
102
Discretisation [40] propose a heuristic method to assign a discrete intensity level to each spot based on its continuous integrated intensity value, i.e., a binning of the continuous integrated intensities into a number (Lmax ) of discrete intensity levels, [1, 2, . . . , Lmax ]. = (P , ap ). P is the well-known set of spot centre Given a set of J spots P coordinates and ap is a vector of J real spot intensity values, one for each spot. Assign a discrete intensity level to each spot so that: half of the spots1 (J/2) are assigned one of the top Lmax /2 discrete levels, the top Lmax /2 discrete levels contain equally many spots, and the sum of spot intensities in the bottom half discrete levels is the same.
The algorithm The idea behind successive spot matching is, in a successive way to match spots to spots in Q = (Q, aq ) that diers no more than two discrete intensities from P (l = lc ) denote the spots in P with discrete intensity . Let P c = P from spots in P c = Q (lmin l lmax ) denote the spots in Q level l = lc (centre level) and Q with discrete intensity level l in the range lmin to lmax . The successive matching as proposed in [40] is outlined in Alg. 6. c , Q c) Algorithm 6 Successive Spot Matching(P
1: Initialise Lmin = 1, Lmax = 10. 2: for lc = Lmax downto Lmin do 3: lmin = max(Lmin , lc 2) and lmax = min(Lmax , lc + 2) (l = lc ) and Q c = Q (lmin l lmax ) c = P 4: P 5: match Pc to Qc 6: update global match matrix 7: end for
Alg. 6 is explained in further detail. Line 1 is initialisation of the maximum and minimum discrete levels. Line 2 to 7 is the main loop where the centre level lc is decreased from Lmax to Lmin . For the current centre level lc the lower and ) is calculated in upper limits for the discrete levels (used for selection from Q c line 3. Line 4: Pc is the subset of P with discrete intensity level l = lc and Q
1 Homann et al.[40] set this number to 500 but does not mention the total number of spots. Assigning a fraction, e.g., half, of the spots to the top levels is a more general rule.
Lars Pedersen
4.3
103
with discrete intensity levels in the range lmin to lmax . The is the subset of Q c because P c c is usually much larger than the cardinality of P cardinality of Q c contains spots from 3 up to 5 levels. contains spots from 1 level only and Q In line 5 the two subsets Pc and Qc are matched using the some matching method robust to a large number of outliers in at least one of the sets, e.g., from the family of Robust Point Matching methods ( 4.2.4). Instead of an RPM method [40] use incremental Delaunay triangulation and a 2-step hashing technique to match the spots. In line 6 the global match matrix is updated with the pairs found from the c. c and Q matching of P
Results of discretisation
The discretisation of the point sets has an undesired side eect, namely a number of irrecoverable singles that will never be matched. As mentioned, for each c and Q c are of quite dierent size. Q c centre level lc , the two point subsets P is chosen to contain points with centre level lc 2 levels in order to ensure that c . This is, however, not always the case. c have a partner in Q all points in P Pc can have quite a few singles despite the eorts to avoid this. These singles, introduced by the discretisation, can inherently never be matched to their cor by Alg. 6, at least not without some post-processing responding points in Q matching step. A small experiment using point and attribute data from gel 1A and gel 2A as and Q respectively (see C.1.1) shows the number of irrecoverable singles in P c for each centre level lc . Tab. 4.3 shows the discretisation eects of two spot P sets P and Q with 1919 and 1918 spots respectively. Note, that the number c exceeds 25 at several discretisation levels. The total number of of singles in P c for all levels is 183 or almost 10% of the total number of spots. singles in P The results above show that caution should be exercised when matching discretised point sets. The spots representing the extra singles, introduced by the discretisation, are often particularly interesting from a biological point of view because they vary in %II. Such variations in protein levels across dierent samples can be used to monitoring of diseases development ( 2.1.1) and should not be overlooked. Extensions to the successive point matching approach could include
104
lc lmin lmax c P c Q # pairs c # singles in P c # singles in Q
10 8 10 192 576 190 2 386
9 7 10 192 768 188 4 580
8 6 10 192 960 186 6 774
7 5 9 192 864 169 23 695
6 4 8 192 785 161 31 624
5 3 7 98 731 80 18 651
4 2 6 120 724 95 25 629
3 1 5 145 958 128 17 830
2 1 4 189 862 161 28 701
1 1 3 407 749 378 29 371
Table 4.3: Results of protein spot set discretisation. Top row: centre discrete intensity level lc . 2nd and 3rd row: lower and upper limit for discrete intensity level c = P (lc ). 5th row: number of spots in range. 4th row: number of spots in P c = Q (lmin : lmax ). 6th row: number of pairs at discrete intensity level lc . 7th row: Q c . Bottom row: number of singles (outliers) in Q c. number of singles (outliers) in P
a post-processing step to catch up on singles introduced by the discretisation, . If a spot in Q has been exclusion of already matched spots from Q at an earlier discretisation level, there is no need matched to a spot in P to attempt matching of this spot again, and use of even more attribute information. It would be desirable to be able to make the attributes discrete in a manner that would leave as few outliers c at every level. Often more attributes than the integrated as possible in P intensity (actually %II, as used here) are available and. Discretisation of a (linear) combination of the attributes available could perhaps reduce the c s. Attributes available could be, spot area, number of singles in the P spot peak intensity, background intensity, percentage integrated intensity, and percentage Gaussian integrated intensity.
4.3.4
Regionalised robust point matching
The transformation to be recovered when matching two point sets can be complex, especially if the point sets are large. Most methods for general point pattern matching described earlier have diculties matching large point sets with complex distortion. A regionalised approach, as proposed here, divides a large matching problem into several smaller (local) problems and then successively solves the sub problems. Finally, the individual sub results are merged to form the result of the large problem.
Lars Pedersen
4.3
105
The regionalisation is a general approach that can be used with any point matching method, robust to a large number of outliers. Here, the RPM methods from 4.2.4 will be applied. The regionalised point matching with RPM-ane was proposed in Pedersen and Ersbll [65].
Gel regions Denition 8 (Region) Given a gel p containing protein spots represented by their spatial locations P . A region p is a square area inside the borders of the gel so that p p . The size of p is (sp sp ) and its centre coordinates are (xr , yr ) relative to the origin of p . The J r points inside the borders of p are denoted P r so that P r P . 2 Denition 9 (Region Pair) Given two gels p and q containing point sets P and Q respectively. Dene a pair of square regions (p , q ) both with centre coordinates (xr , yr ) relative to their gel origin. p is of size (sp sp ) and q is of size (sq sq ). The number of points in the two sub point sets P r and Qr are of size J r and K r respectively. For a region pair (p , q ), furthermore denote the regional (J r + 1) (K r + 1) match matrix m r and the regional transformation f r . 2
Note that p and q , from two dierent gels, can be of dierent sizes although they are still centred around the same coordinates.
Region sizes The following contains some general considerations on the choice of region sizes sp and sq . The size of the square regions should be suciently large to ensure a suitable number of points inside the region (enough points to avoid ill-posed problems in the computation of f r ). Also, if the region becomes too large the regional transformation f r will maybe not to be able to capture all the motion in the area covered by the region. For all point matching algorithms the regional computation time depends on the number of points to match so this may also be a reason to not have too large regions.
rc rc , q ) are centred around As mentioned earlier, the regions in the region pair (p the same spatial coordinates. Provided that the deformations are not too large,
106
this ensures that the two sub sets P r and Qr contain a large number of homologous points. With equal region sizes (sp = sq ) both P r and Qr can have singles due to the local deformations. These false singles are not real singles actually in P or Q but they have been introduced because of the regionalisation. By choosing dierent region sizes for p and q so that q is always larger than p (sq > sp ), it can be guaranteed that the smaller point set P r contains no false singles. This will on the other hand introduce a number of extra false singles in Qr . In other words let sq = s p + s b (4.20)
where the buer size sb > 0. The choice of letting q be larger than p is arbitrary. If sb is selected larger than the largest distance between any two homologous points it is guaranteed that there are no so called false singles in P r . The largest distance between any two homologous points can not be known since the correspondence is unknown, so it must be estimated to select a suitable value of sb . The number of false singles in Qr can be large depending on the buer size sb , but a point matching method that is robust to a large number of singles can handle this. Fig. 4.6 shows the principle of the regional point matching. To the top left is p with P and a small square marking p centred around (xr , y r ) and similarly, to the top right is q with Q and a little larger square marking the corresponding region q , also centred at (xr , y r ). A zoom of the region in question is shown at the bottom with spots from both gels together. The solid line shows the outline of the small region, p with P r (the grey spots) inside and the dashed line shows the border of the large region, q with Qr (the white spots) inside. Homologous spots in P r and Qr are connected with lines.
Overlapping regions Having specied how to dene a region pair (p , q ) one question remaining is how to apply the proposed regional point matching to the entire gel areas, p and q . The ideal, but computationally unsuitable way would be to slide the region pair around the gel area and then solve the matching problem for all possible positions of the region centres. This would reduce possible side eects of dividing the problem into sub problems but also it would result in an immense and often redundant number of regional matches.
Lars Pedersen
4.3
107
Figure 4.6: Principle of the regional point/spot matching. To the top left is p with P and a (sp sp ) square marking p centred at (xr , y r ). To the top right is q with Q and a ((sp + sb ) (sp + sb )) square marking the corresponding region q also centred at (xr , y r ). In the bottom is shown a zoom of the region in question. The solid line shows the small region, p with P r (the grey spots) inside and the dashed line shows the border of the large region, q with Qr (the white spots) inside. The lines connect homologous spots.
108
It is therefore proposed to position the region centres on a regular equidistant grid so that neighbour regions overlap with a certain ratio. Now, for a moment, consider a single gel. Denition 10 (Regionalisation) The division of an area into a set of square (possibly overlapping) regions { rc } with the horizontal and vertical distance between neighbour region centres dc is a regionalisation of the area. rc is the specic region at the grid position (r, c), i.e., in row r and column c of the grid (see also Fig. 4.9). 2
Specication of the region overlap For dc 2sp , where sp is the region size, there is no overlap between neighbouring regions. This is of course not valid if all points (the entire gel area) need to be processed. For dc , > 0 the situation with a sliding region is approached, but this extreme is computationally unsuitable. Fig. 4.7 denes the distances.
sp
$
do
)
dc
Figure 4.7: Neighbouring regions. Main region (bold quadratic) with right nearest neighbour region (dashed quadratic). Region size sp , neighbour region distance dc , and overlap distance do . The region overlap is shaded.
With the region overlap distance s d p c do = 0 the neighbour region overlap is dened:
if dc [0; sp ] otherwise
(4.21)
Lars Pedersen
4.3
109
Denition 11 (Region overlap ratio) Given a quadratic region of region size sp . The region overlap ratio is the amount of the region area overlapped by each of the four nearest neighbour regions. 2 The region overlap ratio can be expressed in simple terms of the region overlap distance as: o = do sp /(sp sp ) 100% = do /sp 100%. (4.22)
Fig. 4.8 shows the degree of overlap for dierent values of o. The grey areas are the parts of the centre region overlapped by its four nearest neighbour regions. To the right is seen the case of o = 50%. For this particular choice of overlap each point will be processed 4 times.
) ) )
) ) )
) ) )
) ) )
$
)
(a) 10%
$
)
) ) ) ) $ ) ) ) )
(b) 30%
(c) 50%
Figure 4.8: Overlapping neighbour regions. Centre region (bold line square) shown with eight nearest neighbour regions (dashed lines). For (a), (b), and (c) the region size is the same, but the region overlap ratio is 10%, 30%, and 50% respectively. Areas inside the centre region are colour-coded. White area: no neighbour regions overlap so only the centre region covers this area. Light grey: one neighbour region overlaps this area, i.e., a total of two regions cover this area. Dark grey: three neighbour regions overlap this area. Including the centre region, four regions cover this area. For the rightmost example where o = 50% the entire centre region is covered by four regions.
The algorithm The regionalisation is a general approach that can be used with any point matching method, robust to a large number of outliers. The robust point matching (RPM) methods presented earlier (4.2.4) full this requirement. The basic idea is, that equipped with the RPM methods, the matching problem rc rc , q ) in the grid and then gather can be solved for each of the region pairs (p
110
the regional correspondence information (the match matrices, m rc ) into a global match matrix, m . In the following a regular grid with equidistant (dc ) nodes (region centres) and overlap ratio o = 50% is assumed. Let R be the number of rows in the grid and let C denote the number of columns. Fig. 4.9 shows the regionalisation grid rc rc , q ) denotes the region pair at position (r, c) in the for this conguration. (p grid. Furthermore the regions size for p is sp . The region size for q is somewhat larger: sq = sp + sb . The Regionalised RPM(p , q , P , Q, T0 , sp , sq , o, R, C ) algorithm is outlined in Alg. 7. The global match matrix m is initialised to the zero matrix and the two gels p and q are divided into regions. For all positions on the grid the points in the rc rc , q ) are matched using a Robust Point Matching method. region pairs (p A simple voting strategy is used to transfer the regional matches (m rc ) to the global match matrix m . This is done by adding the entries in m rc to the corresponding entries in m . Because of the chosen overlap ratio o = 50% each point enters the matching process four times in total. m rc has real entries in [0, 1] so the global match matrix, m will, after matching of all region pairs, contain entries in [0, 4]. Algorithm 7 Regionalised RPM(p , q , P , Q, T0 , sp , sq , o, R, C )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
Initialise m 0 Regionalise p (R C ) grid of (sp sp ) square regions Regionalise q (R C ) grid of (sq sq ) square regions for all r such that 1 R do rc rc for all c such that 1 C do match region pair (p , q ) Extract P rc and Qrc [m rc , f rc ] Robust-Point-Matching(P rc , Qrc , T0 ) update m with m rc end for end for return m
Choice of transformation and region size The choice of transformation in the RPM method and the choice of region size is related. In general, the ane method is faster, but of course not as exible as
Lars Pedersen
4.4
Match Evaluation
111
(1,1)
% % % % % % % % %
% % % % % % % % %
% % % % %
% % % % % % % % %
% % % % % % % % %
(1,C)
% % % % % % % %
(R,1)
% % % % % % % %
(R,C) (6,4)
$
% % %
Figure 4.9: Regionalisation grid, o = 50%. The region in the 6th row and the 4th column, 64 is marked with bold lines. The gel p has been regionalised into R C regions. For most regions, only region centres are marked and for the region 64 the region borders are shown.
the thin-plate spline based method. The ane transformation ( 4.2.4) cannot capture complex local deformations and therefore the region size should not be too large. On a 2048 2048 gel pair with approximately 1900 points in each point set, experiments with real spot location data ( C.2) have shown good results setting sp 128. The thin-plate spline transformation ( 4.2.4) can handle almost arbitrarily complex local deformations so the region size does not matter in this case. However, the computational cost can become very high for large point sets and this limits the region size. Matching results with the same data as mentioned above have indicated that sp 256 is reasonable for this method. The regional robust point matching approach has been applied to matching of 2D gel electrophoresis spot location centres using the two dierent transformations for the motion, namely the ane transformation. The results of these experiments can be found in 4.5.
4.4
Match Evaluation
The matching result, i.e., the correspondence estimate of two point sets P and Q is dened by the match matrix m . This section describes an evaluation scheme
112
of the match matrix, given the ground truth correspondence. The ground truth , usually established by an operator. correspondence is dened by the matrix M For the match methods returning a fuzzy match matrix m initial binarization of this matrix to obtain m is necessary. Given m and a threshold value Alg. 8 provides a method to do the binarization so that draws are resolved. Values in m equal to or above the threshold, will result in a 1 in m and values below will result in a 0 in m . Clearly, the threshold value is important. High values of results in a conservative binarization where, in cases of doubt, it is preferred to mark a point as an outlier instead of risking a false match. For low values of a number of draws can occur in m . If more than one element in a row or a column is 1 a draw has occurred. Alg. 8 resolves the draws in a conservative manner to avoid false matches, i.e., if in doubt, it is preferred to classify a point as an outlier. The evaluation of the match results is a comparison of the two binary match . For a point 4 natural types (or classes) of the match result matrices m and M can be identied. T1 T2 T3 T4 A pair point correctly matched to another pair point. A single point correctly detected as single. A pair point or a single incorrectly matched with another. pair point or with a single. A pair point incorrectly detected as single.
A point belongs to exactly one of these types. T1 and T2 are successful results and T3 and T4 are faults. For both point sets P and Q, the number of points classied as each type is counted, T1 , T2 , etc. and these are used to dene 4 scores to evaluate the results:
S1 = T1 + T2 T1 + T2 + T3 + T4 T1 + T2 T3 T1 + T2 + T3 + T4 T3 T1 + T2 + T3 + T4 T4 T1 + T2 + T3 + T4 (4.23)
S2
(4.24)
S3
(4.25)
S4
(4.26)
S1 is the fraction correct matches and singles and S2 is the same, but adjusted by subtracting the number of serious faults, T3 in the nominator. S3 is fraction incorrect pairs and S4 is fraction incorrect singles.
Lars Pedersen
4.5
113
For a matching to be successful, S1 and S2 should be as high as possible, and S3 and S4 should be as small as possible.
4.5
This section presents experiments and results of dierent point pattern matching methods applied to a number of electrophoresis gel data sets. The methods are: anek and Vohradsk ys method [64], M1 P M2 regionalised RPM (ane transformation). Region size sp = 128, and buer size sb = 32 (for RPM parameters, see 4.2.4, p.92), and M3 regionalised RPM (thin-plate spline transformation). Region size sp = 256, and buer size sb = 32 (for RPM parameters, see 4.2.4, p.94) all methods previously described in this chapter. The data set is described in C and the 15 experiment combinations (gel pairs) specied in Tab. 4.4 have been used. The correspondence has been established by a skilled operator for all combinations of gels. This correspondence is denoted the ground truth correspondence although there might be a few errors due to human mistakes.
Group 1 A B C x Group 2 A B C x x x x x Group 3 A B C x x x x x x x x x
Group 1
Group 2
Group 3
A B C A B C A B C
Table 4.4: Experiment specication. Group 1, 2, and 3. Matching experiments marked with x. E.g., gel 1A vs. gel 1B, gel 1A vs gel 2A, etc.
The gel pair, gel 1A vs. gel 2A will be used as example in the following and Fig. 4.10 shows the gel images with known spot centres (P and Q) overlaid. The known correspondence (the expert established correspondence) is shown in Fig. 4.11. This correspondence is aimed to be recovered using the point pattern matching methods M1 , M2 , and M3 .
114
Figure 4.10: Gel images with known spot centres overlaid as points. Left: gel A (1919 spots), right: gel B (1918 spots). 1918 spots in common, which means that one spot present in gel A is missing in gel B.
4.5.1
The trade-o resulting from binarization
The match matrix binarization introduces an important trade-o in the evaluation which will be studied in the following. For M1 , the match matrix is already binary and no binarization is needed. For the regionalised matching methods, M2 and M3 an overlap ratio, o = 50% has been used in the following experiments. In 4.3.4 it is shown that with this choice of overlap ratio M2 and M3 return match matrices with real elements in [0, 4]. For these methods, it is necessary to choose . Fig. 4.12 shows how the M2 match scores vary for dierent values of the binarization threshold . This is the results of matching gel 1A with gel 2A using the regionalised robust point matching with ane transformation, M2 . The higher the value of , the more trustworthy the matches of the point pairs. As a consequence the number of correct matches decreases as the threshold is increased but so will the number of false (incorrect) matches. In other words, there is a trade-o between having many correct matches or few false matches. If no (or very few) incorrect matches can be tolerated has to be chosen large. This behaviour is common for matching methods returning fuzzy match matrices.
Lars Pedersen
4.5
115
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.11: Known correspondence between P and Q (for the gels shown in Fig. 4.10).
0.98 0.97 0.96 score score 0.95 0.94 0.93 0.92 0.5 1 1.5 2 2.5 3 3.5
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.5 1 1.5 2 2.5 3 3.5
Figure 4.12: Binarization eect. M2 scores for point set P at dierent levels of the binarization threshold . Left: Solid line: S1 , dotted line: S2 . Right: Solid line: S3 , dotted line: S4 . The scores S1 and S2 declines as increases. S3 (false pair points) becomes very small as increases at the expense of more false singles (S4 ).
116
Tab. 4.5 shows, together with the ground truth (GT), the type counts and the scores for matching gel 1A with gel 2A using M1 , M2 , and M3 . The binarization threshold, has been set very liberally to 0.5. In the similar Tab. 4.6 a more conservative choice of binarization levels is shown. For M2 , = 3.5 and for M3 , = 2.5. The counts and scores for M1 are the same in both tables because this method needs no binarization of the match matrix.
P T1 T2 T3 T4 check S1 S2 S3 S4 GT 1918 1 1919 1 1 0 0 M1 1827 1 4 87 1919 0.9526 0.9505 0.0021 0.0453 M2 1853 1 18 47 1919 0.9661 0.9567 0.0094 0.0245 M3 1892 1 11 15 1919 0.9865 0.9807 0.0057 0.0078 GT 1918 0 1918 1 1 0 0 M1 1827 0 4 87 1918 0.9526 0.9505 0.0021 0.0454 Q M2 1853 0 18 47 1918 0.9661 0.9567 0.0094 0.0245 M3 1892 0 11 15 1918 0.9864 0.9807 0.0057 0.0078
Table 4.5: Evaluation of match result. Gel pair 1A vs. 2A. For M2 and M3 , the binarization threshold, = 0.5. GT is the ground truth.
P T1 T2 T3 T4 check S1 S2 S3 S4 GT 1918 1 1919 1 1 0 0 M1 1827 1 4 87 1919 0.9526 0.9505 0.0021 0.0453 M2 1782 1 8 128 1919 0.9291 0.9250 0.0044 0.0667 M3 1859 1 5 54 1919 0.9693 0.9666 0.0026 0.0281 GT 1918 0 1918 1 1 0 0 M1 1827 0 4 87 1918 0.9526 0.9505 0.0021 0.0454 Q M2 1782 0 8 128 1918 0.9291 0.9249 0.0042 0.0667 M3 1859 0 5 54 1918 0.9692 0.9666 0.0026 0.0282
Table 4.6: Evaluation of match result. Gel pair 1A vs. 2A. For M2 = 3.5 and for M3 , = 2.5. GT is the ground truth.
4.5.2
Method comparison
To compare the performance of the three point matching methods, Fig. 4.13 presents the matching scores for each method in the matching of all 15 experiment pairs. The binarization threshold has been chosen individually for the two RPM methods. For M2 , = 3.5 and for M3 , = 2.7. Tab. 4.7 shows the average scores across all 15 experiment pairs.
Lars Pedersen
4.5

M1 0.9502 0.9470 0.0032 0.0467 M2 0.9197 0.9134 0.0063 0.0740 M3 0.9610 0.9589 0.0021 0.0369
117
S1 S2 S3 S4
Table 4.7: Average scores across 15 experiment pairs.
In general, M3 seems to outperform both other methods. It has the highest S1 score in 11 of 15 experiments and the lowest S3 score in 12 of 15 experiments. Considering the false matches (S3 ), M1 is lowest in 3 of 15 experiments, and the method results in a large number of wrong singles yielding weaker S1 and S4 scores. M2 exhibits the poorest performance in all of these experiments. Note however, that except for M2 in two cases, none of the methods have more than 1% false matches (S3 ) and in the main part of the experiments with M1 and M3 have less than 0.4% false matches.
4.5.3
Error locations
The three gures (4.14-4.16) show the spatial location of the errors corresponding to the example results in Tab. 4.5. Correct matches are marked with cyan (grey) lines (T1 ) and missing matches are shown with blue (black) lines. Wrongly matched points drawn with black (black) lines.
Accumulated errors location
The spatial locations of the errors across all 15 experiments have been recorded for each method. The plot in Fig. 4.17 shows the location of the false matches from all 15 experiments in the same plot. The Figs. 4.18-4.19 show the similar plots for the M2 and M3 methods respectively. The general trend is, that the methods seem to fail more often in the border areas of the gels, where geometrical distortions are largest. The three plots have been plotted together in Fig. 4.20. All methods seem to have problems with the same ve groups of spots near the right edge of the gels.
118
S1 1 1
S2
0.95
0.95
0.9
0.9
0.85 0 2 4 6 8 10 12 14 16
0.85 0 2 4 6 8 10 12 14 16
S3 0.012 0.01 0.008 0.006 0.06 0.004 0.002 0 0 2 4 6 8 10 12 14 16 0.04 0.02 0 0 2 4 6 0.14 0.12 0.1 0.08
S4
10 12 14 16
Experiment pair numbers
Figure 4.13: Test scores for M1 ( ), M2 ( 15 experiment pairs.
), = 3.5, and M3 ( ), = 2.7 on all
Lars Pedersen
4.5
119
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.14: Error locations. Method M1 . The cyan (grey) lines with no end markers show the correct matches (T1 ). The blue (black) lines with no end markers are the missing matches (T4 ), i.e., where the algorithm should have found a correspondence. The black (black) lines with markers (stars) at the ends show the wrong matches (T3 ).
120
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.15: Error locations. Method M2 . = 0.5. The cyan (grey) lines with no end markers show the correct matches (T1 ). The blue (black) lines with no end markers are the missing matches (T4 ), i.e., where the algorithm should have found a correspondence. The black (black) lines with markers (stars) at the ends show the wrong matches (T3 ).
Lars Pedersen
4.5
121
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.16: Error locations. Method M3 . = 0.5. The cyan (grey) lines with no end markers show the correct matches (T1 ). The blue (black) lines with no end markers are the missing matches (T4 ), i.e., where the algorithm should have found a correspondence. The black (black) lines with markers (stars) at the ends show the wrong matches (T3 ).
122
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.17: Spatial location of errors in all experiments, M1 .
Lars Pedersen
4.5
123
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.18: Spatial location of errors in all experiments, M2 ( = 3.5).
124
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.19: Spatial location of errors in all experiments, M3 ( = 2.7).
Lars Pedersen
4.5
125
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000
Figure 4.20: Spatial location of errors in all experiments. M1 (+), M2 (o), and M3 (x).
126
4.6
Summary
After having provided the notation necessary to describe correspondence, match matrices, and motion estimation a range of general point pattern matching methods was described. A group of these, namely the family of robust point matching (RPM) methods was explained in detail. The general RPM method was extended by a modication of Sinkhorns matrix normalisation algorithm to improve robustness of the method. After the detailed description of the special cases of ane and thin-plate spline transformations, the energy function of the latter was extended to take extra point attribute information into account. The nature of protein spot patterns requires special properties of the point matching algorithm ( 2.3), but not many of the methods in the literature full these requirements. The robust point matching method with thin-plate spline transformation has been identied as capable of fullling all requirements. The most important, existing methods for point matching of 2DGE spot patterns were presented and a new regionalised approach was proposed. This new approach was based on the RPM matching methods, but any point matching method, robust to a large number of outliers/singles could be used. It was shown, that although the idea is appealing care should be taken if using successive matching of the spots, where most intense spots are matched together rst and so on. This is due to the quite dynamic behaviour of the protein expression levels between samples. The variability of a proteins expression level can be so large that too much weight should not be put on the spot intensities when matching the spots. A method proposed by P anek and Vohradsk y [64] and two variants of the regionalised RPM methods were applied to 15 combinations of real 2DGE protein spot data. The matching results of the experiments demonstrated the superiority of the regional RPM method using the thin-plate spline (TPS) transformation. The average S1 score (correct matches) for this method was 0.9610 (96.1%) with an average S3 score (false matches) of only 0.0021 (0.21%). An average of 3.7% of the spots were declared to dicult to match and falsely classied as singles. The second method tested, M2 was the poorest performing of the three methods. This method used regionalised RPM with ane transformation and since the same method using TPS transformation had quite good performance, the
Lars Pedersen
4.6
Summary
127
results indicate, that the ane transformation was not able to properly capture the motion (deformation) within each region. Even though the region size sp was only half the region size for the M3 method. By further lowering the region size, the local deformation inside each region should approach the ane transformation and could maybe improve the performance of this method. Dependent on an appropriate choice of binarization threshold for the fuzzy match matrix, the regionalised RPM-TPS (M3 ) was able to match 98.7% of the points correctly while making only 0.6% false matches in the matching of two typical sets of points with 1900+ points. The remaining fraction was false singles. By lowering the binarization threshold for M3 , the fraction of correctly matched pairs could be increased at expense of more false matches, or by raising the threshold false matches could be avoided at the expense of fewer correct matches and more false singles. From a practical viewpoint it is better to have a few extra false singles which are easier to nd by eye than a few incorrect matches which are hidden by all the other matches. The spatial location of errors was investigated and it was shown, that the methods failed more often in the edge areas of the gel, where distortions are known to be large and more complex. For the regional methods, this indicate that the buer size sb = 32 pixels was maybe chosen too small.
128
Lars Pedersen
Chapter
Elastic Graph Matching
This chapter is condential and is omitted from this edition of the thesis.
130
Elastic Graph Matching
Lars Pedersen
Chapter
6 Conclusion
This thesis addresses the main issues in pattern analysis of two-dimensional gel electrophoresis data. The two-dimensional electrophoresis (2DGE) technique, used for protein separation in proteome analysis, results in grey level intensity images showing thousands of proteins more or less focused in spots. In the analysis of the 2DGE images, a number of issues have been identied and they divide naturally into two groups: issues in image segmentation and issues in protein spot matching. Segmentation of the 2DGE images was shown to be a non-trivial task because of the relatively large number of weak spots, the high spatial density of spots, the large number of overlapping spots, and the varying background across the images. A number of dierent approaches to the segmentation of 2DGE images was demonstrated. H-domes and marker based watershed segmentation, both methods from the mathematical morphology, were applied to the segmentation of the gel images. The h-domes technique experienced diculties segmenting overlapping spots.
132
Conclusion
When a scale space blob detector was used to dene the markers for the marker based watershed segmentation, reasonable results were obtained without the notorious over-segmentation from the watershed segmentation. Regarding the noisy nature of the 2DGE images, an experiment on a small gel region with known locations of protein spots yielded a remarkable result: The image contained non-spot blobs, stronger than the weakest known spots, and also, at the (known) location of some very weak spots, no local minima could be detected. Parametric modelling of protein spots using a diusion based spot model gave promising results for relatively isolated protein spots. However, for the large number of spots with close neighbours, there is a need for a mixture spot model taking several spots into account. The outlines of such a model was proposed. The main focus of the thesis was the protein spot matching. For this purpose, a number of point matching methods for general point matching and methods designed specically for matching patterns of protein spots were investigated. A successful matching of two point patterns consists of establishing the correct point correspondence between all point pairs in the sets and identify points that have no homologous point in the other set. Based on the complex nature of the protein spot patterns, ve requirements for point matching methods to solve the correspondence problem were identied. An ideal point matching method must: exactly and robustly match protein pairs, allow for non-linear distortions/transformations, robustly handle outliers in both sets, be able to handle point sets of stochastic/amorphous nature, and robustly match dense point sets. The only methods found in the literature of general point pattern matching methods, capable of satisfying all requirements is the family of robust point matching (RPM) methods. These methods were extended and embedded in a regionalised setting to improve robustness, speed and exibility. Two variants of regionalised robust point matching (ane and thin-plate-spline) were tested on real 2DGE spot data and among the matching methods for
Lars Pedersen
133
2DGE protein spot patterns, the most promising method [64] was implemented to provide a level of comparison. In 15 experiments with matching dierent pairs of 2DGE spot sets, the regionalised RPM-TPS method proposed here showed to be superior to the other methods tested with average correct matching of 96.1% of the protein pairs and 0.2% wrong matches. By lowering the binarization threshold, the fraction of correctly matched pairs could be increased at expense of more false matches or by raising the threshold false matches could be avoided at the expense of fewer correct matches and more false singles. By examination of the errors spatial location it was shown, that the methods tested failed more often in the edge areas of the gels. This could be due to the larger distortion typically found near the gel edges. As an alternative to the traditional sequential approach to analysis of the 2DGE gels, namely the image segmentation followed by spot matching, a new unied approach was suggested. The elastic graph matching (EGM) exploits prior knowledge of the spot conguration (from the reference image ) and the image information available in the incoming gel (the match image). This method has not been extended to its full potential but experiments conducted showed promising results and a wide range of suggestions to improve on the method was given. The main conclusion from this work is that the task of analysing 2DGE images in a robust and objective way is far from being trivial, but the methods developed here will most likely provide more robust and objective means in the analysis. Also, the sequential approach based on image segmentation succeeded by spot matching is probably not the most suitable approach, although the most common in commercially available software packages. Mainly because both tasks are performed without using all information available. A unied approach, on the other hand, as the one suggested here (EGM) has the potential of signicantly improve the image analysis / pattern recognition part in proteomics and thereby reduce time and labour costs in this part of the process.
134
Conclusion
Lars Pedersen
Appendix
A Thin-Plate Spline Transformation
The exibility of the thin-plate spline (TPS) transformation makes it suitable to model the complex disparity patterns of protein spots in two-dimensional electrophoresis gels (2DGE). The literature on the thin-plate spline transformation and its applications in image analysis is vast and the central references include [15, 84] with additional and alternative descriptions in [29, 37, 45, 71, 78]. We will show some properties of the thin-plate spline in two dimensions, but references will be given to extensions to higher dimensions. The thin-plate spline functional is an extension of the cubic spline in 1 dimension. Consider two point sets P = {pi }, i = 1, . . . , n, pi = p x1 (i), p x2 (i) and Q = {qj }, j = 1, . . . , n, qj = q x1 (j ), q x2 (j ) of n corresponding points in the two-dimensional space 2 (D = 2). By correspondence we mean that the point sets are pairwise homologous, i.e. pk qk k 1 . . . n. In the literature these points have multiple names: landmarks, ducial markers etc., dependent on the application. In medical imaging these points often correspond to some anatomical feature in the image, in aerial imagery geographical landmarks are common landmark points. The centre of distinct and signicant protein spots can be used as landmarks in images of electrophoresis gels.
136
Thin-Plate Spline Transformation
In homogeneous coordinates we denote 1 1 p= . . . x2 (1) p x2 (2) . . . 1 p x1 ( n ) p x2 ( n )

p
x1 (1) p x1 (2) . . .
and
1 1 q= . . .
x2 (1) q x2 (2) . . . . 1 q x1 ( n ) q x2 ( n )
q
x1 (1) q x1 (2) . . .
(A.1)
In mapping of landmarks describing biological shapes the thin-plate spline (TPS) transformation has proven useful [15]. The mapping can be formulated as a variational problem where we wish to nd a function f : 2 2 that minimises:
ET P S =
1 n
pi f (qi )
i=1
D + Jm (f ),
(A.2)
where f (qi ) is the thin-plate spline transformation (or warp) of the points in Q. D is the general form of the thin-plate penalty functional as dened in [84]. Jm In two dimensions (D = 2) and using degree of derivatives (m = 2)
2 =J = J2
2f x2 1
2f x1 x2
2f x2 2
dx1 dx2 .
(A.3)
By minimising the rst term in (A.2) the points in Q is mapped as closely as possible to their corresponding points in P . Including the second term in the minimisation imposes a limit on the second partial derivatives of f , i.e., a smoothness constraint on the bending energy of the spline. + is introduced to regularise the problem. For 0 there is no penalty on the bending energy of the spline and f can behave freely, i.e., in the interpolating form. For increasing values of f becomes gradually more smooth.
Writing t = (x1 , x2 ), and in homogeneous coordinates ti = (x1 (i), x2 (i)) (A.4)
1 (t) = 1,
2 (t) = x1 ,
Lars Pedersen
3 (t) = x2
(A.5)
137
it can be shown [84, 29] that if t1 , . . . , tn are such that least squares regression on 1 , . . . , n is unique then the unique minimiser f of (A.2) is
3 n
f (t) =
=1
d (t) +
i=1
ci E (t, ti ),
(A.6)
where E (t, ti ) = E ( ) is the Greens function or the thin-plate spline kernel. = |t ti | is the Euclidean distance between a point t and the landmark point ti . In two dimensions 1 | |2 ln| |. (A.7) E (ti , tj ) = E ( ) = 16 In (A.6) the unknowns are d = (d1 , . . . d3 ) and c = (c1 , . . . , cn ). Substitution of (A.6) into (A.2) results in minimisation of 1 2 p qd Kc + c Kc (A.8) n subject to q c = 0. Here, the points in P and Q have been arranged in n 3 matrices p and q (homogeneous coordinates), d is a 3 3 matrix holding the ane parameters of the transformation, c is a n 3 matrix with the parameters specifying the non-ane part of the transformation. K is an n n matrix of thin-plate spline kernel values, i.e. the ij th entry K (i, j ) = E (ti , tj ). ET P S 2 (c, d) = c and d are the minimisers of (A.8) and in order to nd them a QR-decomposition of q is used: q = q 1 : q2 r . 0 (A.9)
The QR-decomposition results in the two matrices q1 : q2 and r. The rst is orthogonal and r is a 3 3 upper triangular matrix. q1 is n 3 and q2 is n (n 3). Setting q2 = c, substituting into (A.8) and using q2 q = 0 yields: ET P S = 1 p q1 : q2 n q2 K q2 = 1 q p r d q1 K q2 n 1 1 q p q2 K q2 n 2 q2 K q2 .
2 2
r d K q2 0
(A.10)
(A.11) + (A.12)
(A.13) (A.14)
138
Thin-Plate Spline Transformation
It can then be shown that rd = q2 p = q1 p + q 1 K q 2 = q 1 ( p K q 2 ) , q2 K q2 + I = (q2 K q2 + I ) and (A.15) (A.16)
leading to expressions for the minimisers d and : = q2 K q2 + I

1
q2 p
(A.17) (A.18)
d = r 1 q1 p + q1 K q2
Lars Pedersen
Appendix
B Algorithms
B.1
Binarization of fuzzy match matrix
This section provides the structure of three algorithms: Binarization(m, ), j, c), and Select-best-match-in-row(m, r m, k, r). Select-best-match-in-column(m, c m, The dual functions Alg. 9 and 10 are called by Binarization(m, ). This algorithm provides a heuristic method for the binarization of the fuzzy augmented match matrix m given a threshold value, . The output of the algorithm is a binary augmented match matrix, m . Denitions of fuzzy and binary match matrices are given in 4.1.1. A simple threshold of the input matrix, m , so that entries above or equal to the threshold value are set to 1, and entries below are set to 0 could, for low values of ( 0.5), result in ambiguities, i.e., in one-to-many or many-to-one correspondences.
140
Algorithms
It is ensured by heuristics in the algorithm, that the output matrix connes to the requirements in 4.1.1, so that ambiguities are avoided. The result of simple thresholding m is processed row-wise and column-wise separately using Alg. 9 and 10, respectively and the two sub-results are combined using elementwise multiplication. The two sub-functions, Alg. 9 and 10 are dual and work on rows and columns separately. In cases of more than one element in a row (or column) above or equal to the threshold value, these functions are used to select the best match in the row (column). The best match is the largest element in the original input matrix m . Other candidates above the threshold are set to zero. In cases of doubt (equal fuzzy match entries), the entire row (column) is set to zero and the point is marked as a single. Algorithm 8 Binarization(m , )
1: r m m 2: for j = 1 to J do 3: c candidate columns 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
threshold each matrix element for all rows, resolve draws if any columns that candidate for the match
if card c > 1 then Select-best-match-in-row(m, r m, j, c) see Alg. 9 else if card c == 0 then none over the treshold r m j = 0 zero jth row r m j (K +1) = 1 mark as single end if end for c m m threshold each matrix element for k = 1 to K do for all cols, resolve draws if any r candidate rows rows that candidate for the match if card r > 1 then Select-best-match-in-column(m, c m, k, r) see Alg. 10 else if card r == 0 then none over the treshold c m k = 0 zero kth column c m (J +1)k = 1 mark as single end if end for m = mr mc elementwise multiplication j J m j (K +1) 1 K m force top J rows to sum to 1 jk k=1 J k K m (J +1)k 1 j =1 m jk force K leftmost columns to sum to 1
Lars Pedersen
B.1
Binarization of fuzzy match matrix
141
Algorithm 9 Select-best-match-in-row(m, r m, j, c)
1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
max m j max value in row j maxc elements in row j that equals max candidates K k=1 (m == ) number of elements in row j that equals if card maxc > 1 then more than one maximum r m j = 0 zero all elements in the j th row r m j (K +1) = 1 mark this row as single else exactly one maximum in the row r m j = 0 zero all elements in the j th row r m j maxc = 1 except for the maximum in column maxc end if
Algorithm 10 Select-best-match-in-column(m, c m, k, r)
1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
max m k max value in column k maxr elements in column k that equals max candidates J ( m == ) number of elements in row k that equals j =1 if card maxr > 1 then more than one maximum c m k = 0 zero all elements in the kth column c m (J +1)k = 1 mark this column as single else exactly one maximum in the column c m k = 0 zero all elements in the kth row c m maxr k = 1 except for the maximum in row maxr end if
142
Algorithms
Lars Pedersen
Appendix
C Data material
Centre for Proteome Analysis in Life Sciences (CPA) has kindly provided the data material. A data set of 15 16-bit grey level images with various attribute information available for every protein spot in every image. This information was extracted by a skilled technician from a manually guided segmentation and matching process using commercially available state-of-the-art software.
C.1
Gel Images
The radioactive marked proteins were separated in the rst dimension on commercially available immobilised pH gradient gels spanning the pH interval from 4 to 7 using in total 50kVhrs. Subsequently the proteins were separated in the second dimension by their molecular weight on the 12.5% polyacrylamide gels. After electrophoresis the gels were vacuum dried on a paper support and exposed a phosphor plate for 5 days. The images were captured using a phosphor imager (Agfa) in a 12-bit format. See also 2.1.2.
144
Data material
C.1.1
Data set
The dimensions of the 15 images are approximately 2000 by 1800 pixels. The protein samples used to produce the images are prepared from bakers yeast Saccharomyces cerevisiae strain Fy1679-28C EC [pRS315] and strain Fy1679-28C EC [pRS315::PDR1]. The 15 images of data set 1 is divided in 5 groups (1 to 5) with each 3 images (A, B, C). Images from group 1, 2, and 3 are comparable, i.e., in a biological context it makes sense to compare the samples. Images from group 4 and 5 are comparable, but cannot be compared to samples images from group 1, 2 or 3 and vice versa. See Fig. C.1.
Group 1 Group 2 Group 3 Group 4 Group 5
B A
C A
C A
C A
C A
Figure C.1: Overview of images in Data Set 1.
C.2
Protein Spot Attribute Information
For each spot a range of attribute information is available: ID area x centre coordinate y centre coordinate integrated intensity (II) peak intensity background intensity % integrated intensity (%II) Gaussian integrated intensity
Lars Pedersen
C.3
Disparity Analysis
145
The attribute information has been generated and extracted from a commercial software product BioImageTM for analysis of 2D electrophoresis images.
C.2.1
Match information
The correspondence between homologous protein spots is known across gel images within the same data set. Spots with same spot ID correspond. This information can be represented in a match matrix m (see 4.1.1).
C.3
Disparity Analysis
Fig. C.2 shows the spots stemming from two dierent gels plotted together as points. Homologous (corresponding) points are connected with arrows. Long arrows mean that corresponding points are far apart and vice versa. The arrows form a so called disparity eld. The top plot displays the initial or raw disparity eld by simply superimposing the two point sets and drawing lines between corresponding points. The bottom plot shows the residual disparity eld after correction for an initial alignment (1st order polynomial transformation) of the two point sets. Fig. C.3 shows the corrected disparity eld of another gel pair and the axes of the plot show histograms of the disparity lengths. It gives an impression of the average disparity in dierent areas of the gel.
146
Data material
0 200 400 600 800 1000 1200 1400 1600 1800 2000
200
400
600
800
1000
1200
1400
1600
(a) Initial disparity eld.
0 200 400 600 800 1000 1200 1400 1600 1800 2000
200
400
600
800
1000
1200
1400
1600
(b) Disparity eld after correction for global 1st order polynomial transformation. Figure C.2: Group 1A vs. Group 1B. Lars Pedersen
C.3
Disparity Analysis
147
Figure C.3: Disparity eld for gel 1A vs. gel 2A with average disparity histograms.
148
Data material
Lars Pedersen
Appendix
Grey level based warping
The method proposed by Conradsen and Pedersen [25] does not provide means for regularisation of the disparity maps and therefore produces quite rough mappings. Warping of the gel images according to the rough disparity maps results in transformed gel images which may be aligned well but also are deformed so that the spots do not resemble protein spots any longer. At increasing levels of resolution, disparity maps are calculated in the line (horizontal) direction, h and in the column (vertical) direction, v . Starting at low resolution, the disparity maps are calculated by minimising sums of squared dierences and this disparity information is propagated to the next (higher) resolution level. A simple regularisation could consist of smoothing the disparity maps using Gaussian convolution. Given an image I and an isotropic 2D Gaussian g (x; ), g ( x; ) =
y2 1 x2 +2 e 2 . 2 2
is referred to as the standard deviation of the 2D Gaussian and L is dened as the convolution of I with g (x; ): L(x; ) = I g (x; ).
150
By Gaussian smoothing of the disparity maps v and h at each level of resolution before the warping step (see [25], Fig. 4) a regularisation is introduced. The standard deviation, of the Gaussian convolution kernel serves as regularisation parameter. The larger , the more the disparity maps are smoothed and the warping becomes less rough. In the following the eect of regularisation will be demonstrated by showing an example of grey level registration of two 2DGE gel regions.
D.1
Experiments
Fig. D.1 shows corresponding 512 512 sub-regions of two gel images. These regions will be the subject in the following examples showing the results of the original [25] method without regularisation of the disparity maps as well as results of regularisation using Gaussian smoothing of the disparity maps.
Figure D.1: Original 512 512 region of two gel images. Left: Reference image (A). Centre: Match image (B). Right: Pixel-wise dierence A-B.
Following [25], the reference image (A) is warped according to the horizontal disparity map h and the match image (B) is warped according to the vertical disparity map v .
D.1.1
No regularisation
Fig. D.2 shows the disparity maps and a 512 512 pixel regular grid warped according to the disparity maps.
Lars Pedersen
D.1
Experiments
151
Figure D.2: No regularisation of disparity maps. Disparity maps and warped grid. Left: Horizontal disparity map h . Centre: Vertical disparity map v . Right: Regular grid warped according to disparity maps.
Fig. D.3 shows the result of the warping. Note by comparison with the original images in Fig. D.1 how the rough disparity maps have deformed the protein spots.
Figure D.3: No regularisation of disparity maps. Warped versions of original 512 512 images. Left: Reference image warped according to h (Aw). Centre: Match image warped according to v (Bw). Right: Pixel-wise dierence Aw-Bw.
D.1.2
Gaussian smoothing
Fig. D.4 shows the disparity maps and a 512 512 pixel regular grid warped according to the smoothed disparity maps. In this experiment, the standard deviation of the 2D Gaussian was chosen to = 3.
152
Figure D.4: Gaussian smoothing of disparity maps. Disparity maps and warped grid. Left: Horizontal disparity map h . Centre: Vertical disparity map v . Right: Regular grid warped according to disparity maps.
Fig. D.5 shows the result of the warping. Note by comparison with the original images in Fig. D.1 how the smooth disparity maps have preserved the shapes of the spots better than without regularisation.
Figure D.5: Gaussian smoothing of disparity maps, = 3. Warped versions of original 512512 images. Left: Reference image warped according to h (Aw). Centre: Match image warped according to v (Bw). Right: Pixel-wise dierence Aw-Bw.
Fig. D.6 shows the dierence images from Figs. D.1, D.3 and D.5 displayed in common grey level range. Note how the regularisation of the disparity maps also results in the smallest residuals after the warp. Fig. D.7 displays the image pairs before warp and after both types of warp using a pseudo-colour display[75] where the reference gel is displayed as green and the match gel is magenta. Good alignment will appear as white, grey or black because of the overlay of magenta and green. Wherever there is dierential
Lars Pedersen
D.2
Application in Point Matching
153
Figure D.6: Dierence images from Figs. D.1, D.3 and D.5 displayed in common grey level range. Left: Dierence image before warp. Centre: Dierence image after warp without regularisation of the disparity maps. Right: Dierence image after warp with regularisation of the disparity maps.
expression or the images are not well aligned, the image will appear in green or magenta tones.
Figure D.7: Pseudo-colour display of image pairs. Left: Original images A (green) and B (magenta). Centre: Aw (green) and Bw (magenta) after warp without regularisation of the disparity maps. Right: Aw (green) and Bw (magenta) after warp with regularisation of the disparity maps.
D.2
Application in Point Matching
In matching of point sets (see 4) the grey level matching technique described here may serve as a preprocessing step. By warping the point sets according to the Gaussian smoothed disparity maps, corresponding points are positioned closer to each other and the correspondence is expected to be easier to resolve.
154
Experiments with this approach using the regionalised RPM-ane method ( 4.2.4, 4.3.4) did however not result in signicantly better matching results. This may be explained by the fact, that the corresponding points are brought closer prior to the point matching, but the local deformation no longer is ane. Similar experiments have not been conducted with the regionalised RPM-TPS method but it is expected that for this method, the preprocessing step will be benecial.
Lars Pedersen
Bibliography
[1] A Abbott. A post-genomic challenge: learning to read patterns of protein synthesis. Nature, 402:715720, 1999. [2] T Akutsu, K Kanaya, A Ohyama, and A Fujiyama. Matching of spots in 2D electrophoresis images. Point matching under non-uniform distortions. In Combinational Pattern Matching. 10th Annual Symposium, CPM99, pages 212222, 1999. [3] R D Appel, J R Vargas, P M Palagi, D Walther, and D F Hochstrasser. Melanie II - a third-generation software package for analysis of twodimensional electrophoresis images: II. Algorithms. Electrophoresis, 18: 27352748, 1997. [4] F Aurenhammer. Voronoi diagram - a survey of a fundamental geometric data structure. ACM Computing Surveys, 1991. [5] S Belongie and J Malik. Matching with shape contexts. In IEEE Workshop on Content-Based Access of Image and Video Libraries, 2000. [6] S Belongie, J Malik, and J Puzicha. Matching shapes. In Eighth IEEE International Conference on Computer Vision, 2001. [7] S Belongie, J Malik, and J Puzicha. Shape matching and object recognition using shape contexts. Technical report, U.C. Berkely, 2001. [8] J E Besag. On the statistical analysis of dirty pictures. J. Royal Statistical Society, 48(B):259302, 1986. [9] P J Besl and H D McKay. A method for registration of 3-d shapes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 14(2):239256, 1992. ISSN 01628828.
156
[10] E Bettens, P Scheunders, J Sijbers, D Van Dyck, and L Moens. Automatic segmentation and modelling of two-dimensional electrophoresis gels. In International Conference on Image Processing, volume 2, pages 665668, 1996. [11] S Beucher. Extrema of grey-tone functions and mathematical morphology. In Proc. of the Colloquium on Math. Morp., Stereol. and Image Analysis, Prague, Tchecoslovaquia, pages 5970, 1982. [12] S Beucher and C Lantuejoul. Use of watersheds in contour detection. In International Workshop on image processing, real-time edge and motion detection/estimation, Rennes, France, 1979. [13] A Blomberg, L Blomberg, J Norbeck, S J Fey, P M Larsen, M Larsen, P Roepstor, H Degand, M Boutry, A Posch, and A G org. Interlaboratory reproducibility of yeast protein patterns analyzed by immobilized pH gradient two-dimensional gel electrophoresis. Electrophoresis, 16(10): 19351945, 1995. [14] D Blostein and N Ahuja. A multiscale region detector. Computer Vision, Graphics, and Image Processing, 45(1):2241, 1989. ISSN 0734189x. [15] F L Bookstein. Principal Warps: Thin-Plate Splines and the Decomposition of Deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6):567585, 1989. [16] L G Brown. A Survey of Image Registration Techniques. ACM Computing Surveys, 24(4), 1992. [17] M Carcassoni and E R Hancock. Point pattern matching with robust spectral correspondence. Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), pages 64955 vol.1, 2000. [18] J M Carstensen. Description and simulation of visual texture. PhD thesis, Institute of Mathematical Statistics and Operational Research, Technical University of Denmark, 1992. [19] J M Carstensen. An active lattice model in a bayesian framework. Computer Vision and Image Understanding, 63(2):380387, 1996. [20] S-H Chang, F-H Cheng, W-H Hsu, and G-Z Wu. Fast algorithm for point pattern matching: invariant to transslations, rotations and scale changes. Pattern Recognition, 30(2):311320, 1997. [21] F-H Cheng. Point pattern matching algorithm invariant to geometrical transformation and distortion. Pattern Recognition Letters, 17:14291435, 1996.
157
[22] H Chui. Non-Rigid Point Matching: Algorithms, Extensions and Applications. PhD thesis, Yale University, May 2001. [23] H Chui and A Rangarajan. A new algorithm for non-rigid point matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 4451, 2000. [24] H Cohn and M Fielding. Simulated Annealing: Searching for an Optimal Temperature Schedule. SIAM Journal on Optimization, 9(3):779802, 1999. [25] K Conradsen and J Pedersen. Analysis of two-dimensional electrophoresis gels. Biometrics, 42:12731287, 1992. [26] T F Cootes, C J Taylor, D H Cooper, and J Graham. Active shape modelstheir training and application. Computer Vision and Image Understanding, 61(1):3859, 1995. ISSN 10773142. [27] D J Cross and E R Hancock. Graph matching with a dual-step EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):12361253, 1998. [28] I L Dryden and K V Mardia. Statistical Shape Analysis. Wiley, 1998. [29] L Duchon. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, pages 85100. Springer Verlag, Berlin, 1977. [30] S J Fey. Personal communication, 2002. [31] S J Fey and P Mose-Larsen. Image analysis method for e.g. related or similar gel electrophoresis images - using master composite image, generated by averaging selected images, to analyse and interpret existing or new images. Patent numbers WO9811508-A1, AU9741325-A, EP925554-A1, JP2001500614-W, KR2000036155-A, 1998. [32] S J Fey and P Mose Larsen. 2D or not 2D. Two-dimensional gel electrophoresis. Current Opinion in Chemical Biology, 5(1):2633, 2001. [33] S J Fey, A Nawrocki, M R Larsen, A G org, P Roepstor, G N Skews, R Williams, and P Mose Larsen. Proteome analysis of saccharomyces cerevisiae: a methodological outline. Electrophoresis, 18(8):136172, 1997. [34] C A Glasbey and K V Mardia. A review of image warping methods. Journal of Applied Statistics, 25:155171, 1998. [35] S Gold and A Rangarajan. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377388, 1996.
158
[36] S Gold, A Rangarajan, C-P Lu, S Pappu, and E Mjolsness. New algorithms for 2D and 3D point matching: pose estimation and correspondence. Pattern Recognition, 31(8):10191031, 1998. [37] P J Green and B W Silverman. Nonparametric Regression and Generalized Linear Models - a roughness penalty approach. Chapman & Hall, 1994. [38] E Guest, E Berry, R A Baldock, M Fidrich, and M A Smith. Robust point correspondence applied to two- and three-dimensional image registration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2): 165179, 2001. [39] R Hartley and A Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. [40] F Homann, K Kriegel, and C Wenk. Matching 2D patterns of protein spots. In Proceedings of the fourteenth annual Symposium on Computational Geometry, pages 231239, 1998. [41] F Homann, K Kriegel, and C Wenk. An applied point pattern matching problem: comparing 2D patterns of protein spots. Discrete Applied Mathematics, 93:7588, 1999. [42] G W Horgan, A Creasey, and B Fenton. Superimposing two-dimensional gels to study genetic variation in malaria parasites. Electrophoresis, 13: 871875, 1992. [43] G W Horgan and C A Glasbey. Uses of digital image analysis in electrophoresis. Electrophoresis, 16:298305, 1995. [44] A K Jain, R P W Duin, and Jianchang Mao. Statistical pattern recognition: a review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(1):437, 2000. [45] H J Johnson and G E Christensen. Deformable Registration - Landmark and intensity-based, consistent thin-plate spline image registration. Lecture Notes in Computer Science, 2082:329343, 2001. [46] M Kass, A Witkin, and D Terzopoulos. Snakes: active contour models. International Journal of Computer Vision, 1(4):32131, 1987. ISSN 09205691. [47] J J Kosowsky and A L Yuille. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7(3):477 490, 1994. [48] C Kotropoulos, A Tefas, and I Pitas. Morphological elastic graph matching applied to frontal face authentication under well-controlled and real conditions. Pattern Recognition, 33(12):19351947, 2000. ISSN 00313203.
159
[49] S Kumar, M Sallam, and D Goldgof. Matching point features under small nonrigid motion. Pattern Recognition, 34(12):23532365, 2001. ISSN 00313203. [50] S Kumar, M Sallam, Dmitry Goldgof, and Kevin Bowyer. Point Correspondence in Unstructured Nonrigid Motion. In Proceeings International Symposium on Computer Vision, pages 28994, 1995. [51] M Lades, J C Vorbruggen, J Buhmann, J Lange, C von der Malsburg, R P Wurtz, and W Konen. Distortion invariant object recognition in the dynamic link architecture. Computers, IEEE Transactions on, 42(3):300 311, 1993. ISSN 00189340. [52] Hyung-Ji Lee, Wan-Su Lee, and Jae-Ho Chung. Face recognition using sherface algorithm and elastic graph matching. Image Processing, 2001. Proceedings. 2001 International Conference on, 1:9981001, 2001. [53] P F Lemkin. Comparing two-dimensional electrophoretic gel images across the internet. Electrophoresis, 18(3-4):461470, 1997. ISSN 01730835. [54] T Lindeberg. Discrete derivative approximations with scale-space properties: A basis for low-level feature extraction. Journal of Mathematical Imaging and Vision, JMIV, 3(349-376), 1993. [55] T Lindeberg. Scale-space theory in computer vision. Kluwer Academic Publishers, Netherlands, 1994. [56] T Lindeberg. Scale-space: A framework for handling image structures at multiple scales. In CERN School of Computing, 1996. [57] T Lindeberg. Feature Detection with Automatic Scale Selection. International Journal of Computer Vision, 1998. [58] J B A Maintz and M A Viergever. A survey of medical image registration. Medical Image Analysis, 2(1):136, 1998. [59] F Menard and F Fogelman Soulie. Application of the topological maps algorithm to the recognition of bidimensional electrophoresis images. INNC 90 Paris. International Neural Network Conference, 1:99102, 1990. [60] J M Miller, A D Olson, and S S Thorgeirsson. Computer analysis of twodimensional gels: Automatic matching. Electrophoresis, 1984. [61] F Murtagh. A new approach to point-pattern matching. Publications of the Astronomical Society of the Pacic, 104(674):3017, 1992. ISSN 00046280. [62] H Ogawa. Labeled Point Pattern Matching by Delaunay Triangulation and Maximal Cliques. Pattern Recognition, 1986.
160
[63] O F Olsen and M Nielsen. Multi-scale gradient magnitude watershed segmentation. pages 613 vol.1. Springer-Verlag, 1997. ISBN 3540635076. [64] J P anek and J Vohradsk y. Point pattern matching in the analysis of twodimensional gel electropherograms. Electrophoresis, 20:34833491, 1999. [65] L Pedersen and B Ersbll. Protein spot correspondence in two-dimensional electrophoresis gels. In Scandinavian Conference on Image Analysis, pages 118125, 2001. [66] K-P Pleissner, F Homann, K Kriegel, C Wenk, S Wegner, A Sahlstr om, H Oswald, H Alt, and E Fleck. Proteome data analysis and management new algorithmic approaches to protein spot detection and pattern matching in two-dimensional electrophoresis gel databases. Electrophoresis, 20(4-5): 755765, 1999. ISSN 01730835. [67] A Rangarajan, H Chui, and F L Bookstein. The softassign procrustes matching algorithm. Lecture Notes in Computer Science, pages 2936, 1997. [68] A Rangarajan, H Chui, and E Mjolsness. A new distance measure for nonrigid image matching. Energy Minimization Methods in Computer Vision and Pattern Recognition. Second International Workshop, EMMCVPR99. Proceedings (Lecture Notes in Computer Science Vol.1654), pages 23752, 1999. [69] A Rangarajan, H Chui, E Mjolsness, S Pappu, L Davachi, P GoldmanRakic, and J Duncan. A robust point-matching algorithm for autoradiograph alignment. Medical Image Analysis, 1(4):379398, 1997. ISSN 13618415. [70] A Rangarajan and J Duncan. Matching point features using mutual information. In Workshop on Biomedical Image Analysis, pages 172181, 1998. [71] K Rohr, H S Stiehl, R Sprengel, and W Beil. Point-based elastic registration of medical image Data using approximating thin-plate splines. Lecture Notes in Computer Vision, 1131:297306, 1996. [72] S Sclaro and A P Pentland. Modal matching for correspondence and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(6):545561, 1995. ISSN 01628828. [73] J Serra and P Soille, editors. Mathematical morphology and its applications to image processing. Kluwer Academic Publishers, 1994. [74] M Skolnick. Application of morphological transformations to the analysis of two-dimensional electrophoretic gels of biological materials. Computer Vision, Graphics, and Image Processing, 35:306332, 1986.
161
[75] Z Smilansky. Automatic registration for images of two-dimensional protein gels. Electrophoresis, pages 16161626, 2001. [76] M Sonka, V Hlavac, and R Boyle. Image processing, analysis, and machine vision. Brooks/Cole Publ., 1999. [77] J Sporring, M Nielsen, L Florack, and P Johansen, editors. Gaussian scale-space theory. Kluwer Academic Publishers, 1997. [78] R Sprengel, K Rohr, and H S Stiehl. Thin-plate spline approximation for image registration. In 18th Internat. Conf. of the IEEE Engineering in Medicine and Biology Society, pages 11901191, 1996. [79] A Tefas, C Kotropoulos, and I Pitas. Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(7):735746, 2001. ISSN 01628828. [80] B Triggs, A Zisserman, and R Szeliski, editors. Vision Algorithms: Theory and Practice, International Workshop on Vision Algorithms, held during ICCV 99, Corfu, Greece, September 21-22, 1999, Proceedings, volume 1883 of Lecture Notes in Computer Science, 2000. Springer. ISBN 3-54067973-1. [81] L Vincent. Morphological grayscale reconstruction in image analysis: applications and ecient algorithms. Image Processing, IEEE Transactions on, 2(2):176201, 1993. ISSN 10577149. [82] L Vincent and P Soille. Watersheds in digital spaces: an ecient algorithm based on immersion simulations. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 13(6):583 598, 1991. ISSN 01628828. [83] T Voss and P Haberl. Observations on the reproducibility and matching eciency of two-dimensional electrophoresis gels: Consequences for comprehensive data analysis. Electrophoresis, 2000. [84] G Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990. [85] V C Wasinger, S J Cordwell, A Cerpa-Poljak, J X Yan, A A Gooley, M R Wilkins, M W Duncan, R Harris, K L Williams, and I HumpherySmith. Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophoresis, 16(7):10904, Jul 1995. [86] Y Watanabe and K Takahashi. A fast structural matching and its application to pattern analysis of 2-D electrophoresis images. In Proceedings 1998 International Conference on Image Processing, pages 8048. IEEE Comput. Soc, 1998.
162
[87] Y Watanabe and K Takahashi. An analysis system for two-dimensional gel electrophoresis images of genomic DNA. In Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170), volume 1, pages 368371. IEEE Comput. Soc, 1998. [88] Y Watanabe, K Takahashi, and M Nakazawa. Automated detection and macthing of spots in autoradiogram images of two-dimensional electrophoresis for high-speed genome scanning. In Proceedings. International Conference on Image Processing, volume 3, pages 496499. IEEE Comput. Soc, 1997. [89] M R Wilkins. From Proteins to Proteomes Large-Scale Protein Identication by Two-Dimensional Electrophoresis and Amino Acid Analysis. Bio Technology, 14:6265, 1996. [90] L Wiskott, J-M Fellous, N Kuiger, and C von der Malsburg. Face recognition by elastic bunch graph matching. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7):775 779, 1997. ISSN 01628828. [91] Z Zhang. Iterative point matching for registration of free-form curves. Technical Report 1658, Institut National de Recherce en Informatic et en Automatique, INRIA, 1992.
Ph.D. theses from IMM

1. Larsen, Rasmus. (1994). Estimation of visual motion in image sequences. xiv + 143 pp. 2. Rygard, Jens Moberg. (1994). Design and optimization of exible manufacturing systems. xiii + 232 pp. 3. Lassen, Niels Christian Krieger. (1994). Automated determination of crystal orientations from electron backscattering patterns. xv + 136 pp. 4. Melgaard, Henrik. (1994). Identication of physical models. xvii + 246 pp. 5. Wang, Chunyan. (1994). Stochastic dierential equations and a biological system. xxii + 153 pp. 6. Nielsen, Allan Aasbjerg. (1994). Analysis of regularly and irregularly sampled spatial, multivariate, and multi-temporal data. xxiv + 213 pp. 7. Ersbll, Annette Kjr. (1994). On the spatial and temporal correlations in experimentation with agricultural applications. xviii + 345 pp. 8. Mller, Dorte. (1994). Methods for analysis and design of heterogeneous telecommunication networks. Volume 1-2, xxxviii + 282 pp., 283-569 pp. 9. Jensen, Jens Christian. (1995). Teoretiske og eksperimentelle dynamiske undersgelser af jernbanekretjer. viii + 174 pp. 10. Kuhlmann, Lionel. (1995). On automatic visual inspection of reective surfaces. Volume 1, xviii + 220 pp., (Volume 2, vi + 54 pp., fortrolig). 11. Lazarides, Nikolaos. (1995). Nonlinearity in superconductivity and Josephson Junctions. iv + 154 pp. 12. Rostgaard, Morten. (1995). Modelling, estimation and control of fast sampled dynamical systems. xiv + 348 pp. 13. Schultz, Nette. (1995). Segmentation and classication of biological objects. xiv + 194 pp. 14. Jrgensen, Michael Finn. (1995). Nonlinear Hamiltonian systems. xiv + 120 pp. 15. Balle, Susanne M. (1995). Distributed-memory matrix computations. iii + 101 pp. 16. Kohl, Niklas. (1995). Exact methods for time constrained routing and related scheduling problems. xviii + 234 pp.
17. Rogon, Thomas. (1995). Porous media: Analysis, reconstruction and percolation. xiv + 165 pp. 18. Andersen, Allan Theodor. (1995). Modelling of packet trac with matrix analytic methods. xvi + 242 pp. 19. Hesthaven, Jan. (1995). Numerical studies of unsteady coherent structures and transport in two-dimensional ows. Ris-R-835(EN) 203 pp. 20. Slivsgaard, Eva Charlotte. (l995). On the interaction between wheels and rails in railway dynamics. viii + 196 pp. 21. Hartelius, Karsten. (1996). Analysis of irregularly distributed points. xvi + 260 pp. 22. Hansen, Anca Daniela. (1996). Predictive control and identication Applications to steering dynamics. xviii + 307 pp. 23. Sadegh, Payman. (1996). Experiment design and optimization in complex systems. xiv + 162 pp. 24. Skands, Ulrik. (1996). Quantitative methods for the analysis of electron microscope images. xvi + 198 pp. 25. Bro-Nielsen, Morten. (1996). Medical image registration and surgery simulation. xxvii + 274 pp. 26. Bendtsen, Claus. (1996). Parallel numerical algorithms for the solution of systems of ordinary dierential equations. viii + 79 pp. 27. Lauritsen, Morten Bach. (1997). Delta-domain predictive control and identication for control. xxii + 292 pp. 28. Bischo, Svend. (1997). Modelling colliding-pulse mode-locked semiconductor lasers. xxii + 217 pp. 29. Arnbjerg-Nielsen, Karsten. (1997). Statistical analysis of urban hydrology with special emphasis on rainfall modelling. Institut for Miljteknik, DTU. xiv + 161 pp. 30. Jacobsen, Judith L. (1997). Dynamic modelling of processes in rivers aected by precipitation runo. xix + 213 pp. 31. Sommer, Helle Mlgaard. (1997). Variability in microbiological degradation experiments - Analysis and case study. xiv + 211 pp. 32. Ma, Xin. (1997). Adaptive extremum control and wind turbine control. xix + 293 pp. 33. Rasmussen, Kim rskov. (1997). Nonlinear and stochastic dynamics of coherent structures. x + 215 pp.
34. Hansen, Lars Henrik. (1997). Stochastic modelling of central heating systems. xxii + 301 pp. 35. Jrgensen, Claus. (1997). Driftsoptimering p a kraftvarmesystemer. 290 pp. 36. Stauning, Ole. (1997). Automatic validation of numerical solutions. viii + 116 pp. 37. Pedersen, Morten With. (1997). Optimization of recurrent neural networks for time series modeling. x + 322 pp. 38. Thorsen, Rune. (1997). Restoration of hand function in tetraplegics using myoelectrically controlled functional electrical stimulation of the controlling muscle. x + 154 pp. + Appendix. 39. Rosholm, Anders. (1997). Statistical methods for segmentation and classication of images. xvi + 183 pp. 40. Petersen, Kim Tilgaard. (1997). Estimation of speech quality in telecommunication systems. x + 259 pp. 41. Jensen, Carsten Nordstrm. (1997). Nonlinear systems with discrete and continuous elements. 195 pp. 42. Hansen, Peter S.K. (1997). Signal subspace methods for speech enhancement. x + 226 pp. 43. Nielsen, Ole Mller. (1998). Wavelets in scientic computing. xiv + 232 pp. 44. Kjems, Ulrik. (1998). Bayesian signal processing and interpretation of brain scans. iv + 129 pp. 45. Hansen, Michael Pilegaard. (1998). Metaheuristics for multiple objective combinatorial optimization. x + 163 pp. 46. Riis, Sren Kamaric. (1998). Hidden markov models and neural networks for speech recognition. x + 223 pp. 47. Mrch, Niels Jacob Sand. (1998). A multivariate approach to functional neuro modeling. xvi + 147 pp. 48. Frydendal, Ib. (1998.) Quality inspection of sugar beets using vision. iv + 97 pp. + app. 49. Lundin, Lars Kristian. (1998). Parallel computation of rotating ows. viii + 106 pp. 50. Borges, Pedro. (1998). Multicriteria planning and optimization. Heuristic approaches. xiv + 219 pp.
51. Nielsen, Jakob Birkedal. (1998). New developments in the theory of wheel/rail contact mechanics. xviii + 223 pp. 52. Fog, Torben. (1998). Condition monitoring and fault diagnosis in marine diesel engines. xii + 178 pp. 53. Knudsen, Ole. (1998). Industrial vision. xii + 129 pp. 54. Andersen, Jens Strodl. (1998). Statistical analysis of biotests. - Applied to complex polluted samples. xx + 207 pp. 55. Philipsen, Peter Alshede. (1998). Reconstruction and restoration of PET images. vi + 132 pp. 56. Thygesen, Ue Hgsbro. (1998). Robust performance and dissipation of stochastic control systems. 185 pp. 57. Hintz-Madsen, Mads. (1998). A probabilistic framework for classication of dermatoscopic images. xi + 153 pp. 58. Schramm-Nielsen, Karina. (1998). Environmental reference materials methods and case studies. xxvi + 261 pp. 59. Skyggebjerg, Ole. (1999). Acquisition and analysis of complex dynamic intra- and intercellular signaling events. 83 pp. 60. Jensen, K are Jean. (1999). Signal processing for distribution network monitoring. xv + 199 pp. 61. Folm-Hansen, Jrgen. (1999). On chromatic and geometrical calibration. xiv + 238 pp. 62. Larsen, Jesper. (1999). Parallelization of the vehicle routing problem with time windows.xx + 266 pp. 63. Clausen, Carl Balslev. (1999). Spatial solitons in quasi-phase matched struktures. vi + (ere pag.) 64. Kvist, Trine. (1999). Statistical modelling of sh stocks. xiv + 173 pp. 65. Andresen, Per Rnsholt. (1999). Surface-bounded growth modeling applied to human mandibles. xxii + 125 pp. 66. Srensen, Per Settergren. (1999). Spatial distribution maps for benthic communities. 67. Andersen, Helle. (1999). Statistical models for standardized toxicity studies. viii + (ere pag.) 68. Andersen, Lars Nonboe. (1999). Signal processing in the dolphin sonar system. xii + 214 pp.
69. Bechmann, Henrik. (1999). Modelling of wastewater systems. xviii + 161 pp. 70. Nielsen, Henrik Aalborg. (1999). Parametric and non-parametric system modelling. xviii + 209 pp. 71. Gramkow, Claus. (1999). 2D and 3D object measurement for control and quality assurance in the industry. xxvi + 236 pp. 72. Nielsen, Jan Nygaard. (1999). Stochastic modelling of dynamic systems. xvi + 225 pp. 73. Larsen, Allan. (2000). The dynamic vehicle routing problem. xvi + 185 pp. 74. Halkjr, Sren. (2000). Elastic wave propagation in anisotropic inhomogeneous materials. xiv + 133 pp. 75. Larsen, Theis Leth. (2000). Phosphorus diusion in oat zone silicon crystal growth. viii + 119 pp. 76. Dirscherl, Kai. (2000). Online correction of scanning probe microscopes with pixel accuracy. 146 pp. 77. Fisker, Rune. (2000). Making deformable template models operational. xx + 217 pp. 78. Hultberg, Tim Helge. (2000).Topics in computational linear optimization. xiv + 194 pp. 79. Andersen, Klaus Kaae. (2001).Stochastic modelling of energy systems. xiv + 191 pp. 80. Thyregod, Peter. (2001). Modelling and monitoring in injection molding. xvi + 132 pp. 81. Schjdt-Eriksen, Jens. (2001). Arresting of collapse in inhomogeneous and ultrafast Kerr media. 82. Bennetsen, Jens Christian. (2000). Numerical simulation of turbulent airow in livestock buildings. xi + 205 pp + Appendix. 83. Hjen-Srensen, Pedro A.d.F.R. (2001). Approximating methods for intractable probabilistic models: - Applications in neuroscience. xi + 104 pp + Appendix. 84. Nielsen, Torben Skov. (2001). On-line prediction and control in nonlinear stochastic systems. xviii + 242 pp. 85. Ojelund, Henrik. (2001). Multivariate calibration of chemical sensors. xviii + 182 pp.
86. Adeler, Pernille Thorup. (2001). Hemodynamic simulation of the heart using a 2D model and MR data. xv + 180 pp. 87. Nielsen, Finn Arup. roimaging. 330 pp. (2001). Neuroinformatics in functional neu-
88. Kidmose, Preben. (2001). Blind separation of heavy tail signals. viii + 136 pp. 89. Hilger, Klaus Baggesen. (2001). Exploratory analysis of multivariate data. xxiv + 186 pp. 90. Antonov, Anton. (2001). Object-oriented framework for large scale air pollution models. 156 pp. + (ere pag). 91. Poulsen, Mikael Zebbelin. (2001). Practical analysis of DAEs. 130 pp. 92. Keijzer, Maarten. (2001). Scientic discovery using genetic programming. 93. Sidaros, Karam. (2002). Slice prole eects in MR perfusion imaging using pulsed arterial spin labelling. xi + 191 pp. 94. Kolenda, Thomas. (2002). Adaptive tools in virtual environments. Independent component analysis for multimedia. xiii + 112 pp. 95. Berglund, Ann-Charlotte. (2002). Nonlinear regularization with applications in geophysics. xiii + 116 pp. 96. Pedersen, Lars. (2002). Analysis of two-dimensional electrophoresis gel images. xxii + 162 pp.

2d Gel Electrophoresis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2d Gel Electrophoresis

Uploaded by

Copyright:

Available Formats

ANALYSIS OF TWO-DIMENSIONAL ELECTROPHORESIS GEL IMAGES

c Copyright 2002 by Lars Pedersen

Printed by IMM/Technical University of Denmark

Kgs. Lyngby, February 2002

iii v vii ix xi xv xvii xxi 1 2 4 4 7 8 8 10 16 22 24 26 26

IMM Image Analysis and Computer Graphics

3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7

48 49 81 98 104 113 116 116 117

2.1 2.2 2.3

IMM Image Analysis and Computer Graphics

85 86 90 92 94 102 110 140 141 141

IMM Image Analysis and Computer Graphics

56% 20% 25%

31% Known Homologous Unknown

Two-dimensional gel electrophoresis

INCUBATION 2nd DIMENSION

Figure 2.4: Diagram of the 2DGE process. By courtesy of CPA.

IMM Image Analysis and Computer Graphics

Issues in Image Segmentation

Issues in Image Segmentation

MWt (kD) MWt (kD)

IMM Image Analysis and Computer Graphics

Issues in Image Segmentation

400 300 200

250 200 150 100

50 0 0.5 %II 1 0 -8 -6 -4 -2 log(%II ) 0

IMM Image Analysis and Computer Graphics

Issues in Image Segmentation

50 100 150 200 250 300 350 100 200 300

IMM Image Analysis and Computer Graphics

Issues in Spot Matching

Issues in Spot Matching

Properties of 2DGE spot patterns

Issues in Spot Matching

(b) Stochastic pattern. 2D electrophoresis gel spot centres.

Protein Pattern Databases

Protein Pattern Databases

M and N are estimated from

Figure 2.15: Construction of a 2D gel image database. By courtesy of CPA.

IMM Image Analysis and Computer Graphics

Mathematical Morphology Based Methods

Mathematical Morphology Based Methods

Scale Space Blob Detection

Scale Space Blob Detection

Scale Space Blob Detection

IMM Image Analysis and Computer Graphics

Scale Space Blob Detection

IMM Image Analysis and Computer Graphics

Scale Space Blob Detection

IMM Image Analysis and Computer Graphics

Parametric Spot Models

Parametric Spot Models

Gaussian spot model

Parametric Spot Models

Figure 3.9: Spot 11 from dierent view points.

IMM Image Analysis and Computer Graphics

Diusion spot model

Parametric Spot Models

solution to the corresponding diusion equation as CD ( x, ) = B + 1 C0 erf 2

(ar ) Dt (a+r) e 4Dt e 4Dt

By elimination of symmetric parameters the nal model reduces to: C D ( x, ) = B + 1 C0 erf 2 1 e

Following [10] the 7 parameters in (3.8) to be estimated are = (B, C0 , x0 , y0 , a = D a, Dx = Dx t, Dy = Dy t). t

Increasing t from top other parameters constant

Parametric Spot Models