Professional Documents
Culture Documents
Thse prsente la Facult des tudes suprieures en vue de lobtention du grade de Philosophi Doctor (Ph.D.) en informatique
Universit de Montral
Facult des tudes suprieures
Yoshua Bengio
directeur de recherche
Jean-Jules Brault
membre du jury
Sam Roweis
examinateur externe
Lael Parrott
reprsentant du doyen de la FES
Rsum
La plupart des problmes concrets auxquels on souhaite appliquer les algorithmes dapprentissage apparaissent en dimension leve. Or le au de la dimensionalit pose un d pour obtenir de bonnes performances. Aussi le succs des Machines Vecteurs de Support (SVMs) noyaux, particulirement sur des problmes en haute dimension, a engendr un regain dintrt pour les mthodes noyaux. Cette thse propose trois algorithmes noyaux, pour lessentiel des extensions dalgorithmes classiques permettant de grandement amliorer leur performance en haute dimension, et de surpasser les SVMs. Ce faisant, nous amliorons galement notre comprhension des caractristiques des problmes en haute dimension. La premire partie de louvrage est une introduction au domaine de lapprentissage statistique. La seconde partie prsente les algorithmes noyaux les plus connus. Dans la troisime partie nous prsentons notre recherche, au travers de trois articles. Enn la quatrime partie effectue une synthse et suggre des pistes pour aller plus loin. Le premier article, Kernel Matching Pursuit, dnit un algorithme constructif donnant lieu une solution ayant la mme forme que les SVMs mais permettant un strict contrle du nombre de points de support et donnant lieu des solutions davantage clairsemes que les SVMs. Le second
iv article, K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms, propose une variante de lalgorithme des plus proches voisins, en rednissant la distance dun point une classe comme la distance la varit linaire supporte par les voisins de ce point. Cette extension permet de surpasser les SVMs sur des problmes concrets en haute dimension. Le troisime article, Manifold Parzen Windows, tudie une variante de lestimateur de densit classique de Parzen. En utilisant des Gaussiennes aplaties orientes selon les directions principales apparaissant dans les donnes du voisinage, on peut mieux reprsenter une densit concentre le long dune varit non linaire de plus faible dimension, ce qui savre protable en haute dimension. La principale conclusion de ces travaux est double. Dune part ils montrent que des algorithmes dinspiration non paramtrique classique, qui ne font aucunement appel lastuce du noyau, sont capables de performance aussi bonne, voire suprieure, celle des SVMs. Dautre part ils montrent quen haute dimension il y a beaucoup gagner dvelopper des algorithmes sachant tirer partie de lhypothse selon laquelle les donnes seraient davantage concentres le long de varits non linaires de faible dimension. Ceci constitue un espoir pour battre le au de la dimensionalit.
Mots-cls : mthodes noyaux, statistiques non paramtriques, au de la dimensionalit, Machines Vecteurs de Support, solutions clairsemes, k plus proches voisins, fentres de Parzen.
Abstract
Most real world problems, for which one wishes to apply machine learning techniques, appear in high dimensional spaces where the curse of dimensionality poses a serious challenge. Thus the success of kernelized Support Vector Machines (SVMs) on a number of high dimensional tasks has prompted a renewed interest in kernel methods. In this thesis, we propose three alternative kernel methods, mostly extensions of classical algorithms, with greatly improved performance in high dimension, that are able to outperform SVMs. In the process, we also increase our understanding of important characteristics of high dimensional problems. Part one of the document is a general introduction to machine learning. Part two describes the most common kernel methods. In part three, we present our research through three published articles. Finally, part four provides a synthesis of our contribution, and hints to possible future developments. The rst article, Kernel Matching Pursuit, denes a constructive algorithm that leads to solutions of the same form as SVMs, but allows a strict control over the number of support vectors, leading to much sparser solutions. The second article, K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms, is a variation of the nearest neighbors algorithm in which we dene the distance between a point and
vi a class as the distance to the linear sub-manifold supported by the neighbors of that point. This extension allows to outperform SVMs on a number of high dimensional problems. The third article, Manifold Parzen Windows, studies an extension of the classical Parzen density estimator, in which we use attened Gaussian pancakes oriented along the principal directions appearing in the neighborhood of a point. This allows to better capture a density when it is concentrated along a lower dimensional non linear manifold, which proves useful in high dimension. The important contribution of our research is twofold. First it shows that algorithms of classical non-parametric inspiration, that do not call upon the kernel trick in any way, are able to perform as well or better than SVMs. Second, our results indicate that in high dimension, much is to be gained by designing algorithms that take into account the manifold hypothesis, i.e. that data might be more concentrated along lower dimensional non linear sub-manifolds. This is so far the best hope we have to beat the curse of dimensionality.
Keywords : kernel methods, non-parametric statistics, curse of dimensionality, Support Vector Machines, sparsity, k nearest neighbors, Parzen windows.
Rsum Abstract
iii v
Remerciements
xxi
I Introduction
1 Prsentation du domaine de lapprentissage automatique 1.1 1.2 Quest-ce que lapprentissage automatique . . . . . . . . . . . . . Situation historique multi-disciplinaire . . . . . . . . . . . . . . . 1.2.1 Lapprentissage automatique par rapport aux statistiques classiques . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Les tches de lapprentissage . . . . . . . . . . . . . . . . . . . . 1.3.1 Lapprentissage supervis . . . . . . . . . . . . . . . . .
3
4 4 5
6 8 8
viii 1.3.2 1.3.3 1.3.4 Lapprentissage non-supervis . . . . . . . . . . . . . . . Lapprentissage par renforcement . . . . . . . . . . . . . Inter-relations entre les techniques . . . . . . . . . . . . . 9 11 11
La gnralisation : le grand d de lapprentissage 2.1 2.2 2.3 2.4 2.5 Mmoriser nest pas gnraliser . . . . . . . . . . . . . . . . . . Notations et formalisation du problme . . . . . . . . . . . . . . Mesure de la performance de gnralisation . . . . . . . . . . . . Quelques notions de thorie dapprentissage . . . . . . . . . . . . Pratiques courantes de contrle de capacit . . . . . . . . . . . .
13 13 14 16 18 20 22 23 24 25 26 28 29 30 30
Une tentative de classication des algorithmes dapprentissage 3.1 3.2 3.3 3.4 3.5 Les modles gnratifs . . . . . . . . . . . . . . . . . . . . . . . Modlisation directe de la surface de dcision . . . . . . . . . . . Extraction progressive de caractristiques . . . . . . . . . . . . . Modles bass sur des distances des prototypes . . . . . . . . . Valeur relative de cette taxonomie . . . . . . . . . . . . . . . . .
Les ds de la haute dimensionalit 4.1 4.2 Le au de la dimensionalit . . . . . . . . . . . . . . . . . . . . Intuitions gomtriques en haute dimension . . . . . . . . . . . .
II Modles noyaux
5 Mthodes noyau classiques et modernes 5.1 5.2 Noyaux et distances . . . . . . . . . . . . . . . . . . . . . . . . . Mthodes noyau classiques : non paramtriques . . . . . . . . . 5.2.1 5.2.2 5.2.3 5.3 Lalgorithme des k plus proches voisins (KNN) . . . . . . La rgression noyau : lestimateur de Nadaraya-Watson . Les fentres de Parzen pour lestimation de densit . . . .
34
35 35 37 37 37 38 39 39 40 42 44 48 48 49
Les mthodes noyau modernes . . . . . . . . . . . . . . . . . 5.3.1 5.3.2 5.3.3 5.3.4 Les machines vecteurs de support (SVM) linaires . . . Du linaire au non-linaire . . . . . . . . . . . . . . . . . Lastuce du noyau . . . . . . . . . . . . . . . . . . . . . Utilisation de lastuce du noyau dans les algorithmes . . .
La forme du noyau 6.1 6.2 Inteprtation classique ou moderne ? . . . . . . . . . . . . . . . . Importance de la forme du noyau ou de la mtrique . . . . . . . .
54
55 55 57 58
Prsentation du premier article 8.1 8.2 Contexte et objectifs de cette recherche . . . . . . . . . . . . . . Motivations dun contrle plus prcis du nombre de vecteurs de support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 8.4 Dcouverte dalgorithmes semblables . . . . . . . . . . . . . . . Contributions au domaine . . . . . . . . . . . . . . . . . . . . . .
60 60
61 63 64 66 67 69 70 74 77 78
Kernel Matching Pursuit 9.1 9.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three avors of Matching Pursuit . . . . . . . . . . . . . . . . . 9.2.1 9.2.2 9.2.3 9.2.4 Basic Matching Pursuit . . . . . . . . . . . . . . . . . . . Matching Pursuit with back-tting . . . . . . . . . . . . . Matching Pursuit with pre-tting . . . . . . . . . . . . . . Summary of the three variations of MP . . . . . . . . . .
xi 9.3 Extension to non-squared error loss . . . . . . . . . . . . . . . . 9.3.1 9.3.2 Gradient descent in function space . . . . . . . . . . . . . Margin loss functions versus traditional loss functions for classication . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Kernel Matching Pursuit and links with other paradigms . . . . . 9.4.1 9.4.2 9.4.3 9.4.4 9.4.5 9.4.6 9.5 Matching pursuit with a kernel-based dictionary . . . . . . Similarities and differences with SVMs . . . . . . . . . . Link with Radial Basis Functions . . . . . . . . . . . . . Boosting with kernels . . . . . . . . . . . . . . . . . . . Matching pursuit versus Basis pursuit . . . . . . . . . . . Kernel Matching pursuit versus Kernel Perceptron . . . . 82 87 87 89 89 90 90 92 94 95 96 98 80 80
Experimental results on binary classication . . . . . . . . . . . . 9.5.1 9.5.2 9.5.3 2D experiments . . . . . . . . . . . . . . . . . . . . . . . US Postal Service Database . . . . . . . . . . . . . . . . Benchmark datasets . . . . . . . . . . . . . . . . . . . .
9.6
xii 11 K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms 105
11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 11.2 Fixing a broken Nearest Neighbor algorithm . . . . . . . . . . . . 107 11.2.1 Setting and denitions . . . . . . . . . . . . . . . . . . . 107 11.2.2 The intuition . . . . . . . . . . . . . . . . . . . . . . . . 109 11.2.3 The basic algorithm . . . . . . . . . . . . . . . . . . . . . 111 11.2.4 Links with other paradigms . . . . . . . . . . . . . . . . 113 11.3 Fixing the basic HKNN algorithm . . . . . . . . . . . . . . . . . 114 11.3.1 Problem arising for large K . . . . . . . . . . . . . . . . . 114 11.3.2 The convex hull solution . . . . . . . . . . . . . . . . . . 115 11.3.3 The weight decay penalty solution . . . . . . . . . . . . 115 11.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 116 11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 12 Prsentation du troisime article 121
12.1 Contexte et objectifs de cette recherche . . . . . . . . . . . . . . 121 12.2 Remarque sur le choix de la spirale . . . . . . . . . . . . . . . . . 122 12.3 Contribution au domaine . . . . . . . . . . . . . . . . . . . . . . 123 13 Manifold Parzen Windows 124
xiii 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 13.2 The Manifold Parzen Windows algorithm . . . . . . . . . . . . . 127 13.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 13.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 133 13.4.1 Experiment on 2D articial data . . . . . . . . . . . . . . 133 13.4.2 Density estimation on OCR data . . . . . . . . . . . . . . 135 13.4.3 Classication performance . . . . . . . . . . . . . . . . . 136 13.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
IV
Synthse
140
141
14 Discussion et synthse
14.1 Synthse des algorithmes proposs . . . . . . . . . . . . . . . . . 141 14.2 A propos du caractre clairsem . . . . . . . . . . . . . . . . . . 144 14.3 Un pendant probabiliste HKNN pour lestimation de densit . . 145 14.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Bibliographie A Autorisations des coauteurs 149 164
11.1 HKNN: rsultats sur USPS et MNIST . . . . . . . . . . . . . . . 117 11.2 HKNN: performance avec ensemble de points rduit . . . . . . . 120 13.1 Manifold Parzen: log vraissemblance pour la spirale . . . . . . . . 134 13.2 Manifold Parzen: estimation de densit du chiffre 2 de MNIST . . 136 13.3 Manifold Parzen: erreur de classication obtenue sur USPS . . . . 137 13.4 Manifold Parzen: comparaison de la log vraisemblance conditionnelle obtenue sur USPS . . . . . . . . . . . . . . . . . . . . . . . 137
xvi 11.1 HKNN: lintuition . . . . . . . . . . . . . . . . . . . . . . . . . . 110 11.2 HKNN: exemple 2D . . . . . . . . . . . . . . . . . . . . . . . . 118 11.3 HKNN: importance du weight decay . . . . . . . . . . . . . . . . 119 13.1 Illustration du problme du zig-zag . . . . . . . . . . . . . . . . . 125 13.2 Illustration qualitative de la diffrence entre Parzen ordinaire et Manifold Parzen . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xviii
SVM
KMP
HKNN
Lalgorithme K-Local Hyperplane Distance Nearest Neighbor Algorithms prsent dans le deuxime article
CKNN
Lalgorithme K-Local Convex Distance Nearest Neighbor Algorithms prsent dans le deuxime article
Notations mathmatiques
7 P H F E A $@CIG$C B 2 R91Q 7 P$@CHIFGE$DAB 2 @986 C 5 3 2 42 0 ) 1 # ( ! % # $&% " $ "#
Ensemble des nombres rels. Ensemble des nombres rels positifs ou nuls. Avec distribution gaussienne normale centre en
et de variance Avec
et de matrice de covariance
Valeur absolue de .
et .
vectoriel) pour
2 5
2 5
2 5
2 9 2 9
'
Norme
de .
#
2 9
Remerciements
Je tiens remercier mon directeur de recherches Yoshua Bengio, pour mavoir accueilli et guid toute ces annes au travers des joies, des difcults et des doutes de la carrire de chercheur, pour sa conance, et surtout pour avoir su partager sa passion et communiquer son enthousiasme pour ce merveilleux domaine de recherches. Je remercie galement tous les tudiants et chercheurs qui sont passs par le laboratoire Lisa, tout particulirement ceux dentre eux qui ont t contraint de vivre avec mes imparfaites crations logicielles, pour leur patience et leur amiti.
Introduction gnrale
Notre facult dapprendre est ce qui nous permet de constamment nous adapter notre environnement changeant, de grandir et de progresser. Cest en grande partie elle que lhumanit doit sa survie et ses plus grands succs. Lapprentissage a toujours t le fer de lance de la branche Connexioniste de lIntelligence Articielle, et les algorithmes et techniques quelle a su dvelopper au l des annes, trouvent de nos jours des applications pratiques dans un nombre grandissant de domaines : de la reconnaissance de la parole ou de lcriture, la nance et au projet de gnome humain. Mais mme si ses succs rcents sont impressionnants, en comparaison avec les facults humaines, nous nen sommes quaux balbutiements. De nombreux mystres demeurent, et le secteur des algorithmes dapprentissage demeure un domaine de recherche trs actif. Dans cette thse nous nous attardons sur les algorithmes dapprentissage noyaux, et prsentons, au travers de trois articles, notre contribution ce domaine qui connat depuis quelques annes un regain dintrt, avec lespoir de parvenir des algorithmes performants en haute dimension.
2 Louvrage est divis en quatre parties. La premire partie est essentiellement une prsentation du paradigme de lapprentissage et un survol du domaine. Son but premier est de familiariser le lecteur, qui ne le serait pas dj, avec les concepts essentiels du domaine et avec sa terminologie. La seconde partie dresse un tableau des algorithmes noyaux les plus connus, et passe brivement en revue un certain nombre de travaux antrieurs. La troisime partie prsente et inclue trois articles que nous avons contribus la recherche sur les modles noyaux. Enn la quatrime partie tente une synthse, discute des mrites et des limites de ce que nous avons propos, et suggre des pistes pour aller plus loin.
5 dexemples (les donnes correspondant lexprience passe), den assimiler la nature an de pouvoir appliquer ce quils ont ainsi appris aux cas futurs.
6 mathmatiques et thoriques telles que la thorie de linformation et le traitement du signal, loptimisation non-linaire, mais surtout et de faon prpondrante ces dernires annes avec le point de vue statistique.
7 intelligence articielle sub-symbolique ont pris un chemin davantage empirique, se satisfaisant trs bien de produire des monstres mathmatiques comme les rseaux de neurones, du moment quils fonctionnaient et donnaient de bons rsultats ! Dans la mesure o les modles utiliss taient plus complexes, les questions de slection de modle et du contrle de leur capacit se sont imposes naturellement avec force. Mais on voit que, bien plus quune diffrence de fond entre les deux domaines, ce qui les spare est une diffrence de culture et demphase : les tudes statistiques classiques se sont souvent auto-limites des modles se prtant bien une analyse mathmatique (modles assez simples, en faible dimension). En comparaison, la recherche en intelligence articielle tait rsolument engage sur la voie de la complexit, avec pour seule limite la capacit du matriel informatique, et pousse par le besoin de mettre au point des systmes rpondant aux problmes concrets du moment. Nanmoins, avec le temps, le domaine de lapprentissage automatique a mri, sest formalis, thoris, et sest ainsi inluctablement rapproch des statistiques, au point dtre rebaptis apprentissage statistique. Pour autant, bien que stant considrablement rduit, le foss culturel nest pas totalement combl, notamment en ce qui concerne les conventions quant aux faons de procder et la terminologie. Nous esprons donc que le lecteur davantage familier avec le formalisme statistique saura pardonner lapproche certainement moins rigoureuse de lauteur.
Classication Dans les problmes de classication, lentre correspond une instance dune classe, et la sortie qui y est associe indique la classe. Par exemple pour un problme de reconnaissance de visage, lentre serait limage bitmap dune personne
9 telle que fournie par une camra, et la sortie indiquerait de quelle personne il sagit (parmi lensemble de personnes que lon souhaite voir le systme reconnatre).
Rgression Dans les problmes de rgression, lentre nest pas associe une classe, mais dans le cas gnral, une ou plusieurs valeurs relles (un vecteur). Par exemple, pour une exprience de biochimie, on pourrait vouloir prdire le taux de raction dun organisme en fonction des taux de diffrentes substances qui lui sont administres.
Sries temporelles Dans les problmes de sries temporelles, il sagit typiquement de prdire les valeurs futures dune certaine quantit connaissant ses valeurs passes ainsi que dautres informations. Par exemple le rendement dune action en bourse. . . Une diffrence importante avec les problmes de rgression ou de classication est que les donnes suivent typiquement une distribution non stationnaire.
10 Estimation de densit Dans un problme destimation de densit, on cherche modliser convenablement la distribution des donnes. Lestimateur obtenu
un bon estim de la densit de probabilit un point de test distribution (inconnue) que les donnes dapprentissage.
Partitionnement Le problme du partitionnement2 est le pendant non-supervis de la classication. Un algorithme de partitionnement tente de partitionner lespace dentre en un certain nombre de classes en se basant sur un ensemble dapprentissage ni, ne contenant aucune information de classe explicite. Les critres utiliss pour dcider si deux points devraient appartenir la mme classe ou des classes diffrents sont spciques chaque algorithme, mais sont trs souvent lis une mesure de distance entre points. Lexemple le plus classique dalgorithme de partitionnement est lalgorithme K-Means.
Rduction de dimensionalit Le but dun algorithme de rduction de dimensionalit est de parvenir rsumer linformation prsente dans les coordonnes dun point en haute dimension ( , grand) par un nombre plus rduit de caractristiques (
Anglais : clustering
) # 9 )
# 9
( #
11 ture sous-jacente qui ne serait pas immdiatement apparente dans les donnes dorigine en haute dimension. Lexemple le plus classique dalgorithme de rduction de dimensionalit est lAnalyse en Composantes Principales (ACP) [52].
12 dun autre ct lestimation de densit est souvent un problme plus difcile, en pratique avec un nombre ni de donnes dentranement. Prcisons que, dans la suite de lexpos, nous limiterons notre attention aux problmes de classication, rgression, et estimation de densit dans
tation de donne la plus souvent utilise). Nous porterons un intrt tout particulier aux cas o
(la reprsen-
bien que la faon dont ils accdent cette mmoire ne soit pas qualitativement trs diffrente
de le faon dont un tre humain accde lencyclopdie de sa bibliothque - sa mmoire tendue -, juste un peu plus rapide. . .
14 Le second type dapprentissage se distingue fondamentalement du premier en cela quil fait largement appel notre facult de gnraliser. Ainsi pour apprendre lire, on doit tre capable didentier un mot crit dune manire que lon na encore jamais vue auparavant. Bien quil soit trivial de mmoriser une grande quantit dexemples, les gnraliser la rsolution de nouveaux cas, mme sils ne diffrent que lgrement, est loin dtre un problme vident. Ce que lon entend par apprentissage dans les algorithmes dapprentissage, et qui en fait tout lintrt et toute la difcult est bel et bien la capacit de gnraliser, et non pas simplement celle dun apprentissage par coeur.
de dcision. Mais dans ce qui suit, par un lger abus de notation, nous utiliserons un
nous rappelons que nous ne considrons pas ici le cas des donnes temporelles, dont la nature
#
2" )
continue et
discrte,
sera une
2 5
majuscules pour identier des probabilits, et des tant rserv pour les fonctions
)
" $#
Typiquement, entre et sortie sont reprsentes sous forme dun vecteur rel :
Le problme de lapprentissage supervis est alors, partir de lensemble dapprentissage (et possiblement de connaissances priori que lon possde du do
un moyen de calculer le associ au en commettant le moins derreurs possible. Lapproche la plus couramment utilise consiste tout dabord utiliser trouver la meilleure fonction , (o
pour
Puis utiliser la fonction ainsi modlise pour associer un test. On dispose en gnral dune fonction de cot
de
calcul rel . Par exemple, pour la classication, si reprsente le numro de la classe, on pourrait compter le
ou risque espr :
) # ) #
A
2 5
B 5 ) # 9 6 B 2 9 16 9 7 C A % 8 9 9 3 9 ) A@86 % ) 4 ) % ) ) 3 79 5
2 5 ) # ) # ) #
,
)
B )
% &
9 )
) G #
"$ ## !9
)
)
12 ) 0 ) ) ) ) ( ) ) '
2
et une pnalit
qui minimise
Cette approche qui consiste dabord trouver une fonction partir des donnes dentranement, pour ensuite appliquer cette fonction sur les nouvelles donnes de test est lapproche inductive. Une approche lgrement diffrente, qui consiste trouver une fonction dpendant galement du ou des points de test considrs est dite transductive. La fonction ainsi obtenue constitue un modle, dans le sens quelle permet de modliser la relation entre entre et sortie. Loptimisation des paramtres se nomme la phase dentranement ou dapprentissage du modle, et permet dobtenir un modle entran.
9 # 9 )
9
2 5
, on doit se contenter
9
@ 5
17 videmment biaise. Pour obtenir un estim non biais de lerreur de gnralisation, il est crucial de mesurer lerreur sur des exemples qui nont pas servi entraner le modle. Pour cela, on divise lensemble des donnes disponibles en deux parties : un sous ensemble dentranement, dont les donnes serviront lapprentissage (ou entranement) du modle ; un sous ensemble de test, dont les donnes seront utilises uniquement pour valuer la performance du modle entran. Ce sont les donnes hors-chantillon dentranement. On obtient ainsi lerreur de test qui est un estim bruit, mais non biais de lerreur de gnralisation. Par ailleurs lensemble dentranement est souvent lui-mme partag entre un sous-ensemble qui sert apprendre les paramtres dun modle proprement dit, et un autre sous-ensemble dit de validation, qui sert la slection de modle. La slection de modle est ici comprise au sens large : il peut sagir de choisir le meilleur entre plusieurs familles de modles trs diffrents, ou entre des variantes trs semblables dun mme modle, dues uniquement des variations des valeurs dun hyper-paramtre contrlant la dnition du modle. Dans les cas o lon dispose de trop peu de donnes, on peut utiliser une technique de validation croise[92], pour gnrer plusieurs paires de sous-ensembles entranement/test. Cette technique est coteuse en temps de calcul, mais permet dobtenir un bon estim de lerreur de test, tout en conservant sufsamment dexemples pour lentranement. Lorsque, dans le prsent document, nous parlons de performance dun modle, nous entendons la performance en test, telle que mesure sur un ensemble de test ou par validation croise.
18
ftrue
biais variance
f*
fF
9
de la vraie fonction en un certain nombre de points : notre ensemble dapprentissage. Un algorithme dapprentissage nous permet en principe de trouver dans lensemble la fonction
typiquement deux problmes inconciliables : La vraie fonction idale ne se trouve peut-tre pas dans lensemble
que nous avons choisi. En consquence de quoi on aurait tendance choisir un ensemble de fonction plus vaste pour limiter ce problme. Le nombre limit dexemples dentranement dont nous disposons nest pas sufsant pour localiser de faon prcise la fonction
Le terme derreur d au premier problme se nomme le biais, et celui d au second se nomme la variance (car il est d la variabilit de lchantillon ni quest lensemble dapprentissage que lon nous donne). Et le dilemme que cela occasionne est appel dilemme biais-variance. Voir la gure 2.1. On voit que la taille ou complexit de lensemble de fonctions rle fondamental. Ce que lon nomme la capacit complexit.
1
de
parmi notre ensemble de fonctions. Ce problme peut logiquement tre plus petit.
joue un
9
) #
20
est appele terme de rgularisation : on favorise ainsi les fonctions plus simples (faible capacit) et pnalise davantage les plus complexes (capacit leve). Voici quelques mthodes couramment utilises pour traduire ce principe en pratique. Le lecteur intress la mise en pratique concrte de ces techniques particulirement pour lentranement des rseaux de neurones est invit se rfrer [70]. weight-decay : introduit une pnalit leurs des paramtres
arrt prmatur : technique permettant de dcider darrter un algorithme itratif avant quil natteigne la fonction
limite ainsi la taille ou capacit effective de lespace de fonctions explor. Pour rsumer, un algorithme dapprentissage performant pour une tche donne sobtient avant tout en combinant un nombre sufsamment lev de donnes den-
9
proches de 0.
. La fonction de pnalit
9
9
9
21 tranement et de bonnes connaissances priori sur le problme, condition quon puisse en disposer.
23
teur de densit
dapprentissage. Puis, au moment de prendre une dcision quant la classe dun nouveau
qui nous est prsent, on utilise la rgle de Bayes pour obtenir la pro-
facteur de normalisation).
est un simple
#
2 5
P @CIF H # P IF P C IF H H
@ @@ @
#
#
" $# 2" # 5
2 5 5
24 Une alternative ladoption de la solution de maximum de vraisemblance, est la pure approche Bayesienne qui considre lintgrale sur toutes les valeurs possible des paramtres du modle, en tenant compte dune probabilit priori sur ces valeurs (un priori sur le modle). Bien que trs attrayante dun point de vue thorique, lapproche Bayesienne prsente souvent des difcults contraignantes de mise en oeuvre en pratique, impliquant le recours des approximations, et limitant leur applicabilit. Parmi les approches inspires de modles gnratifs qui ont emport un grand succs, on peut citer les Chanes de Markov Caches (utilises notamment dans les systmes de reconnaissance de la parole), et les modles de Mixtures de Gaussiennes.
les zones des deux classes, et que lon nomme surface de dcision. Il sagit dune surface de dimension dans lespace dentre de dimension .
cision, mais seuls quelques uns en font leur point de dpart : ils partent dune
) ) 3
)
) 3
#
),
25 hypothse quant la forme de cette surface de dcision1 : linaire, polynomiale, etc. . . puis cherchent, pour cette classe de fonctions, les paramtres qui vont minimiser le critre de cot dsir (idalement lesprance des erreurs de classication). Ainsi, lanctre des rseaux de neurones, lalgorithme du Perceptron [77] recherche une surface de dcision linaire, reprsente par une fonction de dcision
Il en va de mme pour le plus rcent et trs populaire algorithme des SVM [11, 102], sur lequel nous reviendrons en dtails au chapitre 5. Bien que le critre optimis, qui repose sur des fondations thoriques plus solides, soit quelque peu diffrent de celui du Perceptron, et quune astuce2 permette dtendre aisment lalgorithme la modlisation de surfaces de dcision non linaires (Polynomiales. . . ), le point de dpart des SVM nen reste pas moins une faon de trouver une simple surface de dcision linaire convenable.
plus spciquement ils partent dune hypothse quant la la forme de la fonction de dcision dont le signe indique la dcision, et qui induit la surface lastuce du noyau (Anglais : kernel trick), qui peut dailleurs galement sappliquer au Per-
ceptron [37].
)' % 0(&$#"!
' 8 # 9
2
0 #
# 9
# 9
26 processus inverse, qui part des entres, et par transformations et calculs successifs, produit en sortie la dcision quant la classe correspondante, et ceci sans supposer que ces transformations aient quoi que ce soit voir avec un processus qui aurait gnr les donnes. Dans cette catgorie on trouve les rseaux de neurones multi-couches, qui se sont largement inspirs de nos connaissances sur le systme nerveux (en particulier les architectures en couches des premiers tages du processus visuel). Dans cette optique, la couche dentre reprsente les donnes sensorielles brutes, et chaque couche subsquente calcule des caractristiques de plus haut niveau, et ce progressivement jusqu la dernire couche, qui calcule un score pour chaque classe. Le plus bel exemple de succs de ce type dapproche est sans doute larchitecture LeNet pour la reconnaissance de caractres [57, 12, 58].
27 cherche-t-on inventer des prototypes qui rsumeraient lensemble dapprentissage (comment ?). Le choix de la distance permettant de mesurer la similarit entre les points. Le choix de loin le plus courant pour les problmes dans distance Euclidienne. Comment linformation de distance aux prototypes est utilise pour prendre la dcision. Parmi les algorithmes qui conservent comme prototypes la totalit des points de lensemble dapprentissage, on peut inscrire les mthodes statistiques non paramtriques classiques : K plus proches voisins, rgression noyau, estimateur de densit fentres de Parzen. Nous prsenterons plus en dtails tous ces algorithmes dans la section 5.2. On regroupe parfois ce type de mthodes sous les termes anglais memory based, template matching, ou encore lazy learning. On peut voir que loptique est trs diffrente de celle des modles gnratifs et de lapproche statistique paramtrique exposs la section 3.1. Parmi les algorithmes qui construisent leurs prototypes, on peut mentionner lalgorithme du centrode, qui rsume chaque classe par le centrode des points dentranement appartenant cette classe (voir la section 5.3.4). Les rseaux de neurones de type RBF [75] peuvent aussi tre vus comme apprenant un petit nombre de prototypes, et produisant une fonction de dcision qui est une combinaison linaire de noyaux Gaussiens (fonction de la distance aux prototypes).
28
30
4.1 Le au de la dimensionalit
Le au de la dimensionalit [8] fait rfrence la croissance exponentielle de lespace explorable avec le nombre de dimensions. Il suft de penser par exemple au nombre de coins dun hyper-cube en dimension En dimension 2 (un carr) : En dimension 3 (un cube) : En dimension 100 :
On voit quen dimension 100 dj, le nombre est trs largement suprieur la taille des bases de donnes dapprentissage auxquelles on va typiquement avoir faire. Ce qui nempche nullement ces bases de donnes dtre en dimension 100 ou plus ! Ainsi le nombre dexemples dont on dispose est ridiculement faible par rapport la taille de lespace dans lequel ils sont disposs. . . Notez que le nombre de coins dun hyper-cube en dimensions de combinaisons possibles de valeurs que peuvent prendre
c.a.d. qui ne peuvent prendre chacune que deux valeurs possibles. Si les variables peuvent prendre davantage de valeurs, leffet est encore plus dramatique, et pour des variables continues, cela devient difcile concevoir !
est le nombre
variables binaires,
31 Une particularit est que linformation de distance Euclidienne entre deux points en faible dimension est beaucoup plus informative quen haute dimension. On peut le comprendre dans le sens o la distance est nalement une statistique scalaire rsumant variables (les diffrences entre les
points). En haute dimension, seule une valeur de distance proche de 0 est rellement informative, car elle indique que toutes les
valeurs de distance leves indiquent simplement que certaines variables diffrent, sans contenir aucune information quant leur identit (lesquelles diffrent beaucoup et lesquelles peu ou pas du tout, et il y a bien davantage de possibilits quant leur identit en dimension leve). Nous qualions ce phnomne de myopie de la distance Euclidienne, car elle ne permet pas de clairement voir loin, ntant informative que trs localement. Cette myopie saccrot avec la dimensionalit, car plus la dimension est leve, plus la proportion de points loigns par rapport aux points proches devient crasante (si lon suppose des donnes tires dune distribution uniforme dans un hypercube par exemple). Un autre point qui vaut dtre mentionn est limportance de lextrapolation par rapport linterpolation. En faible dimension, on a tendance penser en termes dinterpolation (entre un petit nombre de points dune courbe par exemple). Mais en haute dimension, la probabilit quun point de test appartienne la fermeture convexe des donnes dapprentissage devient trs faible. On se trouve donc la plupart du temps en situation dextrapolation.
32
sont observes. Si lon suppose en outre que de petites variations continues de ces facteurs devrait se traduire par de petites variations des variables observes,
Cette notion de varit est aussi ce qui sous-tend la technique des distances tangentes [87], o le sous-espace tangent en chaque point est dduit de connaissances priori sur des transformations invariantes pour la tche en question (petites ro2
Anglais : manifold
de dimension
alors ces
33 tations ou translations de tous les pixels de limage dun chiffre manuscrit, qui ne changent pas sa classe).
F IG . 4.1 Illustration du concept de varit. Une varit non-linaire de dimension 2 dans un espace de dimension 3. Deux points sont reprsents, avec pour chacun deux vecteurs dnissant le sous-espace tangent en ce point.
34
36
frence lorsque nous utilisons le terme mesure de similarit/dissimilarit plutt qu une dnition formelle de distance. On considre habituellement un noyau comme une fonctionnelle
dont on peut xer le centre et en faire ainsi une fonction valeur relle. On parlera alors dun noyau centr en
Le noyau le plus couramment utilis est le noyau Gaussien, qui correspond une densit de loi Normale multivarie. En haute dimension, pour mtrisations plus ou moins riches sont possibles, notamment : La Gaussienne sphrique ou isotrope de variance
et .
est le dterminant, et
linverse de la matrice
Pour une Gaussienne sphrique, on parlera souvent de sa largeur . Notez que les formulations utilises en pratique pour le noyau Gaussien sphrique diffrent souvent un peu de celle nonce ci-dessus. En particulier beaucoup dalgorithmes ne ncessitent pas que le noyau soit normalis (quil intgre un), et omettent le facteur de normalisation, ou encore ils utilisent une expression un peu diffrente pour la largeur. Ainsi ne soyez pas surpris si vous rencontrez une dnition plus simple de noyau Gaussien telle que par exemple
2 2 3% # ) & ) (
# #
, des para-
#
#
( )
% # $# # $"!
# 7
#
#
" "
# &%
37
, une fonction de distance , et un entier pour lequel il doit prendre une dcision,
plus proche permet une certaine robustesse aux erreurs dtiquetage 2 . Une variante de cet algorithme peut tre utilise pour la rgression : on attribue alors au point de test la moyenne des valeurs associes ses
voisins.
distance , et attribue
les
#
au sens de la voisins.
38 L encore, on dispose dun ensemble de donnes dapprentissage sous formes de paires (entre, sortie) : dun noyau
doit correspondre une densit et donc intgrer 1 (typiquement un noyau Gaussien sphrique de largeur xe). Lestimateur de densit de Parzen, qui peut tre vu comme un lissage de la densit empirique provenant des donnes dapprentissage est :
) # D ) # ) #
# # # # ) B # #
. On dispose galement
#
@ 5 @ 5
# #
#
#
@ )
# 9 )
#
qui
39
sparation linaire qui maximise la marge. La marge peut tre dnie comme la distance euclidienne minimale entre la surface de sparation (un hyperplan) et le point le plus proche de lensemble dapprentissage4 . Cest l la version dite marge dure5 [11] (voir la gure 5.1). Une extension dite marge molle6 [22] permet de traiter les cas non sparables, o lon accepte que certains points se situent
3 4
une autre acception de marge dite marge fonctionnelle sur laquelle nous nous attarderons dans la section sur les fonctions de cot de marge (Margin loss functions) de notre premier article. 5 Anglais : hard-margin 6 Anglais : soft-margin
)3
et la
40 en de de la marge, voire du mauvais ct de la surface de dcision, au prix dune pnalit linaire, contrle par un paramtre de capacit
La solution linaire trouve par cet algorithme peut sexprimer, tout comme celle du Perceptron, sous la forme suivante : , o
et .
lhyperplan.
Nous ne nous attardons pas ici sur les dtails de lalgorithme qui permet de trouver cette solution de marge maximale. Signalons simplement que cela revient rsoudre un problme de programmation quadratique sous contrainte, lequel est prcis dans notre premier article au Chapitre 9.
,
' ( # 9
0 # #
!
#
0 #
P H
# 9
41
de dcision linaire dans cet espace transform correspondra une surface de dcision polynomiale dordre dans lespace dentre de dpart.
semble de donne.
Mais en revanche, si on construit un classieur linaire , non pas sur lespace des , mais sur un espace tendu de dimension suprieure, par exemple
# # # # # # # # # # # # #
# # !
#
#
#
# # # # # # # #
#
qui indique le
pour reprsenter la
Soit
, c.a.d.
# # # #
0 #
0 #
Illustration :
(5.1)
( # 9
# 9
#
42 ce qui, on le voit, correspond une surface de dcision polynomiale dordre 2. Cette faon de procder nest pas nouvelle, mais elle posait souvent de nombreux
gnral exponentiellement avec le nombre de dimensions de lespace dentre, et il devient rapidement impossible en pratique de calculer et stocker les vecteurs tendus.
produit par lalgorithme nest nalement quune somme pondre dexemples dapprentissage : et
Cest le cas en particulier des SVMs ainsi que du Perceptron. Lapplication du modle un point test
Lastuce du noyau (introduite pour la premire fois par [1]) consiste alors utiliser un noyau tendu :
0 # # (
: (5.2)
crot en
# ## 0 # # # #
0 #
@ 5
(
# 9
les conditions de Mercer7 [23]. Par exemple le noyau Gaussien prcdemment voqu correspond un produit scalaire dans un espace transform de dimension innie [74], et est couramment utilis dans les SVM avec succs.
7
# # # # #
, mais le principe
#
0 # ( ) # # ) # # # # # # # # # # # )
#
, mais en utilisant
# #
#
#
) 3
# #
0
#
#
#
0 1 # (
) # ) )
0
#
#
#
(
44
Pour illustrer comment on peut concrtement appliquer lastuce du noyau un algorithme existant, prenons lexemple de lalgorithme du centrode pour la classication.
tant donn un ensemble dapprentissage ni, on peut facilement calculer le centrode des points de chaque classe. Lalgorithme du centrode pour la classication attribue alors un point de test
distance Euclidienne). Lorsquon na que deux classes, donc deux centrodes, cela engendre une surface de dcision linaire : lhyper-plan bissecteur du segment reliant les deux centrodes. Le carr de la distance entre un point de test
9 @ 0 9 # # ( ) 0 # # ( 3 @ # @ ( 0 # @ # ( 3 # ) 0 ) ( ) 3
@ # ) # #
est
et un centrode est :
0 # ( 0
# %
0 # # ( 0 # # ( 0 # # (
# (
# &%
#
45 On voit ds lors quon peut appliquer lastuce du noyau, pour calculer la distance
Si on utilise cette distance- , avec un noyau Gaussien par exemple, on obtiendra dans le cas de deux classes une surface de dcision non-linaire dans lespace dentre des (bien quelle soit linaire dans lespace ).
Les SVMs noyau Lalgorithme des SVMs noyau est construit partir des SVMs linaire de la mme manire que nous avons appliqu lastuce du noyau au cas du centrode. La formulation usuelle de la fonction de dcision des SVMs noyau est :
avec
et les
# # )
# 9
) 3 )
)
I
donn et
Gaussien,
constante
# #
9
#
# #
@ 3 #
dans lespace
#
# #
#
(
#
46 Autres algorithmes De nombreux algorithmes classiques, essentiellement linaires, peuvent semblablement tre tendus grce lastuce du noyau. Mentionnons notamment les extensions noyau du Perceptron [37, 45], et de lanalyse en composantes principales [80, 81]. Pour nir notre rapide tour dhorizon des algorithmes noyau, nous nous devons au moins de mentionner les Processus Gaussiens8 [107], trs lgants dun point de vue thorique, mais dont la complexit calculatoire a beaucoup limit lapplicabilit en pratique. Mais des dveloppements rcents [56, 83] pourraient changer la situation.
47
marge
surface de dcision
F IG . 5.1 Surface de dcision marge maximale des SVMs. Les vecteurs de support sont reprs par un cercle.
ailleurs, lexemple du centrode noyau, montre bien quon peut parfois mettre
49 en vidence des liens entre les deux approches (dans ce cas ci, avec lestimateur de densit de Parzen). Une des diffrences les plus fondamentales est peut-tre que lutilisation de lastuce du noyau impose un certain nombre de contraintes sur la forme du noyau (qui doit absolument tre positif dni) et sur la forme de la solution (quel espace
correspondrait un noyau Gaussien de largeur diffrente sur chaque point ?) pour pouvoir justier lapplicabilit de lastuce. Aucun des algorithmes que nous proposons et tudions dans notre recherche nest fond sur lastuce du noyau, et il en rsulte notre avis, une exibilit accrue.
Dans notre discussion, nous utilisons indiffremment les termes distance ou mtrique, au sens
50 Ainsi, [18] ont tudi linuence de diffrents types de noyaux de SVM sur un problme de classication dimages (base de donne Corel). Leur recherche a dbut aprs quils aient constat quun simple KNN avec une distance
des rsultats signicativement meilleurs quune SVM avec un noyau Gaussien. Un autre trs bel exemple de dveloppement dune mesure de similarit qui incorpore des connaissances priori spciques au problme en question (dans ce cas la reconnaissance de caractres) est la distance tangente [86, 87]. Cette mtrique est invariante par rapport de petites translations, rotations, agrandissements et rductions des caractres et a t employe avec beaucoup de succs dans un KNN sur la base de donne NIST de chiffres manuscrits. tant donn limportance qua la forme de la mtrique ou du noyau sur la performance de ces algorithmes, plusieurs travaux ont tudi la possibilit de les apprendre automatiquement. Il sagit alors dapprendre les paramtres dun noyau paramtr ou dune mtrique paramtre (pouvant servir de base la dnition dun noyau Gaussien). On peut distinguer essentiellement deux approches, selon que lon essaye dapprendre une mtrique globale, valable sur tout lespace, ou bien dadapter localement les paramtres dune distance ou dun noyau dans le voisinage dun point. Un exemple classique de mtrique globale que lon peut apprendre est la distance de Mahalanobis : o
matrice de covariance des donnes, et constitue les paramtres appris. Remarquez que cela revient une distance Euclidienne sur les donnes pr-traites :
& & # ' & # ' #
encore plus simple, ce mme principe dans un autre pr-traitement courant que
'
donnait
#
3 4 #
#
3 #
# #
#
77
7
est la
# #
7
51 lon nomme gnralement normalisation des donnes, et qui consiste diviser chaque coordonne par son cart type2 . Lapprentissage de ce genre de mtrique globale a notamment fait lobjet dtudes dans le but damliorer les performances de classication dalgorithmes de type KNN [6, 7, 63], de mme que lapprentissage des paramtres dun noyau a t explor dans le but damliorer la performance dalgorithmes de type SVM [101, 3] ou dautres algorithmes noyau [69]. Des formes paramtriques plus complexes que Mahalanobis ont galement t tudies. Mentionnons notamment que des mtriques et noyaux peuvent tre dnis partir de modles gnratifs [109, 25, 51], ou incorpors des architectures de rseaux de neurones an dtre entranes de manire discriminante [15, 5, 96] Dans [104], nous avions nous-mme commenc explorer la voie de lapprentissage des paramtres dun noyau global en proposant larchitecture de rseau de neurones particulire illustre la Figure 6.1. Il sagit dune architecture entranable globalement par descente de gradient permettant doptimiser simultanment les paramtres du noyau et les poids
de support. Mais nous navons pas persvr sur cette voie de recherche, en partie cause de la complexit du modle le rendant difcile entraner dans des temps raisonnables. Aussi nous nous contentons ici que mentionner ces travaux, en invitant le lecteur intress se rfrer larticle en question, mais surtout aux travaux de [55] pour les dveloppements rcents de cette autre voie de recherche. Nos recherches se sont par la suite davantage concentres sur lutilisation, dans des algorithmes classiques, de mtriques locales ou de noyaux paramtres appris localement dans le voisinage dun point dintrt. Ce sont ces recherches,
2
52 davantage proches des travaux prcurseurs de [38], que nous prsentons dans les deuxime et troisime articles de cette thse. Prcisons que cette approche de mtrique locale ou noyaux structure locale cadre bien avec le point de vue des mthodes noyaux non paramtriques classiques mais quil est difcile de concevoir de ce point de vue ce que signierait lastuce du noyau (quel est lespace
correspondant quand on na pas un noyau unique ?). Cest pourquoi nous considrons lastuce du noyau avant tout comme une astuce lgante, mais qui constitue une limitation intrinsque dalgorithmes du genre SVM. Les algorithmes que nous allons prsenter ici ny font aucunement appel. Par ailleurs, nous tenons attirer lattention du lecteur sur le rsultat thorique important de [110]. Il y est dmontr que nimporte quelle courbe lisse sparant deux ensembles de points peut correspondre un hyper-plan sparateur marge maximale dans un certain espace- , induit par un noyau satisant les conditions de Mercer. Autrement dit nimporte quelle surface de dcision sparatrice lisse, aussi absurde soit-elle, peut tre dclare marge maximale : il suft pour cela de choisir le bon noyau ! Cest un rsultat marquant, car il relativise grandement limportance de la maximisation de marge quand on utilise lastuce du noyau, et montre en revanche le rle crucial du choix du noyau.
53
f(x)
Points de support:
Vecteur dentre x:
Vraie classe y de x
F IG . 6.1 Architecture de rseaux de neurones pour lapprentissage des paramtres dun noyau global ainsi que des poids
de la combinaison linaire.
54
Contrairement aux rseaux de neurones, dont lanalyse thorique se heurte leur complexit,
56 (ii) Lattrait pratique. Les SVMs ont acquis une rputation de solution cl en main, tant beaucoup moins dlicats entraner que des rseaux de neurones par exemple, et donnant souvent de trs bonnes performances, notamment sur des problmes en haute dimension. Cette popularit a donn lieu quantit de recherches au cours des dernires annes : justications thoriques, extensions, amliorations, applications pratiques des SVMs, sans oublier la rhabilitation de nombreux anciens algorithmes par la simple application de lastuce du noyau. Notre propre recherche a t motive par lobjectif premier de dvelopper de nouveaux algorithmes dapprentissage gnriques (c.a.d. nutilisant pas de connaissances priori du problme), capables de surpasser les performances des SVMs en pratique. Ce faisant, nous voulions aussi tenter de mieux cerner les caractristiques qui font quun algorithme gnrique afche de bonnes performances en haute dimension. Notre approche dans llaboration de ces algorithmes a t pour lessentiel empirique, guide par des intuitions plus ou moins inspires de certaines caractristiques des SVMs, mais sans directement prendre les SVMs comme point de dpart, et en vitant de recourir lastuce du noyau. Plutt que de mener la dcouverte dalgorithmes radicalement nouveaux, cette voie nous a conduit redcouvrir danciens algorithmes sous un jour nouveau. Cela nous a permis notamment de proposer des extensions les rendant comptitifs avec les SVMs, et daccrotre notre comprhension du problme de la haute dimensionalit.
57
58 des algorithmes noyaux structure locale, elles nont pas t incluses dans la prsente thse. Les articles inclus reprennent quasiment lidentique le texte original des publications. Nous y avons parfois ajout quelques prcisions sous forme de notes de bas de page, ou corrig une erreur de frappe. Le formatage a par ailleurs t modi pour tre conforme aux standards de prsentation de thse de lUniversit de Montral. Les accords des coauteurs et diteurs dinclure ces textes dans cette thse sont joints en annexe. Dans les pages qui prcdent chaque article, nous tentons de mettre en vidence le contexte, les questions, les objectifs, et le cheminement intellectuel qui a motiv cette recherche. Nous y rsumons aussi les contributions de larticle au domaine.
pixels).
La popularit des SVMs en pratique est historiquement due en partie leur bonne performance sur USPS et MNIST, telle que rapporte dans [82, 58]. Parmi les algorithmes gnriques (c.a.d. nincluant pas de connaissance priori
) )
pixels) et MNIST en
59 du problme) les SVMs noyau ont longtemps t celui donnant les meilleurs performances sur MNIST. Pour autant, on sait quil est possible de faire mieux que les SVMs sur MNIST (LeNet [57, 12, 58] bat largement les SVMs, mais ce nest pas un algorithme gnrique comme les SVMs). Notre ambition tait ds lors de faire mieux que les SVMs, ou du moins aussi bien, avec de nouveaux algorithmes gnriques. Autre point important : ces bases sont sufsamment grandes pour que les diffrences mesures entre les algorithmes puissent tre statistiquement signicatives.
61 des SVMs faisaient lobjet dtudes ou taient supports par des considrations thoriques, susceptibles de justier leur bonne performance : La notion de maximisation de la marge (telle que dnie la section 5.3.1) : [102] donne des bornes sur lerreur gnralisation lies la largeur de la marge. Le caractre clairsem1 de la solution, c.a.d. le fait quil y ait un petit nombre de points de support. En effet, des rdultats reliant lerreur de gnralisation espre au caractre clairsem de la solution existent pour les SVMs [102, 103] ainsi que pour dautres modles similaires [61, 34, 45]. La forme de la solution des SVMs, savoir une combinaison linaire de noyaux centrs sur un sous-ensemble des points dapprentissage. En effet un thorme (representer theorem[54]) montre que cest l la forme qua la fonction optimale qui minimise un cot rgularis particulirement intressant 2 . Dans les SVMs, cette forme rsulte de lutilisation de lastuce du noyau.
nombre de vecteurs de support trouv dpend du problme, de la forme et largeur du noyau, ainsi que du paramtre
SVM marge molle, mais est en pratique difcile contrler. Des techniques vi1 2
Anglais : sparsity le terme de rgularisation correspond la norme dun RKHS (reproducing Kernel Hilbert
' 1
). Le
62 sant rduire posteriori le nombre de vecteurs de support ont t dveloppes pour amliorer les temps de rponse lorsque leur nombre est initialement lev [16]. Ces techniques ont t dveloppes avant tout dans le but dacclrer les algorithmes dans leur phase dutilisation sur des donnes de test. Mais il y a galement un lien direct vident entre le nombre de centres utiliss, et la taille de lensemble de fonctions
cette optique que nous avons commenc explorer des algorithmes permettant un contrle stricte du nombre de centres, en vue de contrler la capacit et ainsi esprer amliorer lerreur de gnralisation. Prcisons que le choix du nombre de centres est couramment utilis pour contrler la capacit dalgorithmes tels que les rseaux RBF. Mais contrairement aux RBF, nos tudes se concentrent sur des solutions o les centres sont, tout comme pour les SVM, un sous-ensemble des points dapprentissage. Cela an dviter que le nombre de paramtre libres du modle, et donc sa capacit, ne croisse avec la dimensionalit de lespace dentre. Ignorant dans un premier temps les considrations de marge, nous avons donc cherch dvelopper un algorithme qui conserverait la forme de la solution des SVMs, tout en permettant un contrle stricte du nombre de vecteurs de support. Nous voulions aussi viter de recourir explicitement lastuce du noyau. Cest ainsi que nous avons conu un simple algorithme glouton3 constructif, qui ajoute les points de support un par un, utilis avec une technique darrt prmatur.
3
Anglais : greedy
63
64 Finalement il est apparu quune des variantes de notre algorithme tait quasiment identique lalgorithme Orthogonal Least Squares RBF [21].
65 mme les Processus Gaussiens de rgression 5 [107], souffraient, au moment de la parution de notre article, dun problme de complexit algorithmique bien plus svre que les SVMs, ce qui en faisait un algorithme sduisant thoriquement, mais rarement applicable en pratique. De faon intressante, les plus rcents dveloppements [56, 83] adoptent tout comme KMP une stratgie gloutonne pour grandement acclrer cette famille dalgorithmes.
67
9.1 Introduction
Recently, there has been a renewed interest for kernel-based methods, due in great part to the success of the Support Vector Machine approach [11, 102]. Kernelbased learning algorithms represent the function value linear combination of terms of the form
vector associated to one of the training examples, and denite kernel function.
is a symmetric positive
Support Vector Machines (SVMs) are kernel-based learning algorithms in which only a fraction of the training examples are used in the solution (these are called the Support Vectors), and where the objective of learning is to maximize a margin around the decision surface (in the case of classication). Matching Pursuit was originally introduced in the signal-processing community as an algorithm that decomposes any signal into a linear expansion of waveforms that are selected from a redundant dictionary of functions. [64]. It is a general, greedy, sparse function approximation scheme with the squared error loss, which iteratively adds new functions (i.e. basis functions) to the linear expansion. If we
of the form
where
is the input part of a training example, then the linear expansion has essentially the same form as a Support Vector Machine. Matching Pursuit and its variants were developed primarily in the signal-processing and wavelets community, but there are many interesting links with the research on kernel-based learning algorithms developed in the machine learning community. Connections between a related algorithm (basis pursuit [20]) and SVMs had already been reported in [73]. More recently, [89] shows connections between Matching Pursuit, Kernel-PCA, Sparse
#
, where
# 9 #
# #
68 Kernel Feature analysis, and how this kind of greedy algorithm can be used to compress the design-matrix in SVMs to allow handling of huge data sets. Another recent work, very much related to ours, that also uses a Matching-Pursuit like algorithm is [90]. Sparsity of representation is an important issue, both for the computational efciency of the resulting representation, and for its inuence on generalization performance (see [45] and [34]). However the sparsity of the solutions found by the SVM algorithm is hardly controllable, and often these solutions are not very sparse. Our research started as a search for a exible alternative framework that would allow us to directly control the sparsity (in terms of number of support vectors) of the solution and remove the requirements of positive deniteness of the representation of
(and
It lead us to uncover connections between greedy Matching Pursuit algorithms, Radial Basis Function training procedures, and boosting algorithms (section 9.4). We will discuss these together with a description of the proposed algorithm and extensions thereof to use margin loss functions. We rst (section 9.2) give an overview of the Matching Pursuit family of algorithms (the basic version and two renements thereof), as a general framework, taking a machine learning viewpoint. We also give a detailed description of our particular implementation that yields a choice of the next basis function to add to the expansion by minimizing simultaneously across the expansion weights and the choice of the basis function, in a computationally efcient manner.
1
69 We then show (section 9.3) how this framework can be extended, to allow the use of other differentiable loss functions than the squared error to which the original algorithms are limited. This might be more appropriate for some classication problems (although, in our experiments, we have used the squared loss for many classication problems, always with successful results). This is followed by a discussion about margin loss functions, underlining their similarity with more traditional loss functions that are commonly used for neural networks. In section 9.4 we explain how the matching pursuit family of algorithms can be used to build kernel-based solutions to machine learning problems, and how this relates to other machine learning algorithms, namely SVMs, boosting algorithms, and Radial Basis Function training procedures. Finally, in section 9.5, we provide an experimental comparison between SVMs and different variants of Matching Pursuit, performed on articial data, USPS digits classication, and UCI machine learning databases benchmarks. The main experimental result is that Kernel Matching Pursuit algorithms can yield generalization performance as good as Support Vector Machines, but often using signicantly fewer support vectors.
70
points
where
is the number of basis functions in the expansion, shall be called the basis of the expansion, is the set of corresponding coefcients of the expan-
sion,
designs an approximation of
ordered as they appear in the dictionary, and the particular dictionary functions
such that
with
to choosing a set
of indices.
. There is a correspon-
9
) #
7 7 @ 7 9
of
We are given
noisy observations
of a target function
)
9
) "!0
# #
at
(9.1)
71 For any function we will use to represent the -dimensional vector on the training points:
is the residue.
will be used to represent the usual dot product between two vectors .
and
The algorithms described below use the dictionary functions as actual functions only when applying the learned approximation on new test data. During training, only their values at the training points is relevant, so that they can be understood as working entirely in an -dimensional vector space. The basis and the corresponding coefcients
are to be chosen such that they minimize the squared norm of the residue:
This corresponds to reducing the usual squared reconstruction error. Later we will see how to extend these algorithms to other kinds of loss functions (section 9.3), but for now, we shall consider only least-squares approximations. In the general case, when it is not possible to use particular properties of the family of functions that constitute the dictionary, nding the optimal basis for a given number
@ # 9 3 )
'
%
9 3
% %
# 9
9
B%
9 3 ) B ) ) ) # 9 9
% %
). As it would be computationally prohibitive to try all these combinations, the matching pursuit algorithm proceeds in a greedy, constructive, fashion:
empty basis, at each stage , trying to reduce the norm of the residue
by searching for
For any
, the
that minimizes
is given by
%
Formally:
% 7 7 3 7 B% 7 7 79 3 )% % 79 3 )%
7 7 7 7 7 9 7 9
Given
we build
and for
79 3 ) 7 B
P H
%
' 9
%
possibilities
7 B%
79
(9.2)
(9.3)
(9.4)
the dictionary whose corresponding vector is most collinear with the current residue. In summary, the mizes
values to nd the best one, we merely have to choose an appropriate criterion to decide when to stop adding new functions to the expansion. In the signal processing literature the algorithm is usually stopped when the reconstruction error
we shall rather use the error estimated on an independent validation set 2 to decide when to stop. In any case,
mined by the early-stopping criterion) can be seen as the primary capacity-control parameter of the algorithm.
2
2 # 2 &
&
2 2
So the
7 % B % 3 % 7 B % 7 % B % 7 B 7 % B % 3 % 7 B % % % 7 % B % 3 7 B %
.
%
%
3 7 B%
2 2 & &
(9.5)
%
B%
74 The pseudo-code for the corresponding algorithm is given in gure 9.1 (there are slight differences in the notation, in particular vector
corresponds to vector
represent a temporary vector always containing the current residue, as we dont need to store all intermediate residues
we only work with vectors and matrices, seen as one and two dimensional arrays, and there is no possible ambiguity with corresponding functions).
be corrected in a step often called back-tting or back-projection and the resulting algorithm is known as Orthogonal Matching Pursuit (OMP) [72, 27]:
set of coefcients
Note that this is just like a linear regression with parameters projection step also has a geometrical interpretation:
and
can be decomposed as
Let
the sub-space of
and let
. This back-
@ ! 3 7
7
5
B B
&
&
7
ob-
7 7 P H 7 7
75
FOR
RESULT:
@7
#
7
#
and
. . .
. . .
..
# 1 #
#
#
) # ) #
9
7 7 3 B D % 7 % B 7 B% D D % @
) B )
B 7
7
. . .
and
contributes to reducing the norm of the residue. which increases the norm of the residue.
the previous coefcients of the expansion: this is what the back-projection does.
Bn
P g
Bn
Bn
7 7
g
5 5
7B
7B
to be as small as possible, so given the basis at . This is what (9.6) insures. found by (9.3) to the expansion, we
7 7 5 7 7 5
79
y fn P g B
n
Rn
only then optimize (9.6), we might be picking a dictionary function other than the one that would give the best t. Instead, it is possible to directly optimize
We shall call this procedure pre-tting to distinguish it from the former back-tting (as back-tting is done only after the choice of ).
This can be achieved almost as efciently as back-tting. Our implementation maintains a representation of both the target and all dictionary vectors as a de
and
In our implementation, we used a slightly modied version of this approach, described in the
As before, let
7 @ ! 3 7 7 P
space to facili-
7
&
&
H
P 7 7 H 7
(9.7)
-dimensional vector).
component
lies in
We also maintain the same representation for the target , namely its decomposi
cedure requires, at every step, only two passes through the dictionary (searching
, then updating the representation) where basic matching pursuit requires one.
choose
as the
whose
. This pro-
7 7B 7 7 7B
7 79
. : we
a y ts G G h f
% gf
INPUT: data set dictionary of functions number of basis functions desired in the expansion (or, alternatively, a validation set to decide when to stop)
)
%
is initially empty, and gets appended an additional row at each step (thus, ignore the during the rst iteration when ) expressions that involve FOR (or until performance on validation set stops improving):
Now update the dictionary representation to take into account the new basis function AND : FOR
y
6
798 $ !
444 5 # ! $"
@ AB% 4 44 $
..
and
13 2% # 0 !
de 5% ! ! A
% 0y
'
% &
RESULT:
79
xed (equation 9.3). Then we nd the optimal set of coefcients the new basis (equation 9.6). pre-tting version: We nd at the same time the optimal set of coefcients
(equation 9.7).
When making use of orthogonality properties for efcient implementations of the back-tting and pre-tting version (as in our previously described implementation of the pre-tting algorithm), all three algorithms have a computational complexity
rather than
the usual
7 7
P H
7B
7 7
P H
79 3
7B
# 9
81
i.e.
This would correspond to basic matching pursuit (notice how the original squared
) to minimize the target cost (with a conjugate gradient optimizer for in-
stance):
As this can be quite time-consuming (we cannot use any orthogonality property in this general case), it may be desirable to do it every few steps instead of every single step. The corresponding algorithm is described in more details in the pseudo-code of gure 9.4 (as previously there are slight differences in the nota
tion, in particular
D
#
'
@ ' @ ! ) P 7 7 H 7
'
(instead of only
(9.11)
3
# 7 9 " # 7 9 )
7 7 ' @ ! 7 # # 9 ) 7 7 % B 7% 7
'
"
7 3 7 # 9 "
# 9 )
'
&
&
'
"
3 7B
7B
(9.8)
(9.9)
(9.10)
).
in the
82 Finally, lets mention that it should in theory also be possible to do pre-tting with
the general case (when we cannot use any orthogonal decomposition) would involve solving equation 9.11 in turn for each dictionary function in order to choose the next one to append to the basis, which is computationally prohibitive.
9.3.2 Margin loss functions versus traditional loss functions for classication
Now that we have seen how the matching pursuit family of algorithms can be extended to use arbitrary loss functions, let us discuss the merits of various loss functions. In particular the relationship between loss functions and the notion of margin is of primary interest here, as we wanted to build an alternative to SVMs 4 . While the original notion of margin in classication problems comes from the geometrically inspired hard-margin of linear SVMs (the smallest Euclidean distance between the decision surface and the training points), a slightly different perspective has emerged in the boosting community along with the notion of margin loss
whose good generalization abilities are believed to be due to margin-maximization. Called the functional margin. Strictly speaking, in boosting the margin has to be normalized
But the purpose of our discussion is qualitative, so let us ignore these specic details for now.
margin of SVMs.
by the
where
. As it uses the
G) #
in
! " r sr
# 9 )
, with
) ) 3 &
norm its
83
INPUT: data set dictionary of functions number of basis functions desired in the expansion (or, alternatively, a validation set to decide when to stop) how often to do a full back-tting: every update steps a loss function
and update : If is a multiple of do a full back-tting (for ex. with gradient descent):
a y G s G h
RESULT:
% f
and recompute
` E
G E
a ab` ` s h h
$ %Y
!U # " S Q I VTR%
` E
G E
G ed s cc
` s
G ts
G s
If
i TD ff r p TD i
ff
a` h
$
9
. . .
@ ff ff X I edU S bQ ` I cc a r f YWVTRP% G F E
D
% B
FOR
44 4
13
P%
. . .
and
. . .
..
. . .
INITIALIZE: current approximation and dictionary matrix Notice that here is the current approximation vector, which changes at every step , so that here corresponds to in the accompanying text.
978 $
44 4
$$
)
%
79 13 % G
A C
It is possible to formulate SVM training such as to show the SVM margin loss function:
constraint parameter of SVMs, trading off margin with training errors. The two constraints for each
9 A # )
3 )7
@ 5
and otherwise.
7 A #
9 A B # )
subject to constraints
and
. when
! '
3)
%
where
#
# 9
3 )7
7 A #
3 ) '
' !0 '
9 #
Let
# #
# 9
). A margin loss
85 ternative formulation of the SVM problem, where there are no more explicit constraints (they are implicit in the criterion optimized): Minimize
margin loss function and a regularization term. It is interesting to compare this margin loss function to those used in boosting algorithms and to the more traditional cost functions. The loss functions that
As can be seen in Figure 9.5 (left), all these functions encourage large positive margins, and differ mainly in how they penalize large negative ones. In particular
It is enlightening to compare these with the more traditional loss functions that have been used for neural networks in classication tasks (i.e.
),
) ) 3
3) 3 )7
)
)
3 ) ) 3 # 9
#
Let
%
%
#
)
9 A B # )
3 )7
@
(9.13)
# 9 )
. Thus
3)
Both are illustrated on gure 9.5 (right). Notice how the squared loss after
appears similar to the margin loss function used in Doom II, except that it slightly increases for large positive margins, which is why it behaves well and does not saturate even with unconstrained weights (boosting algorithms impose further constraints on the weights, here denoted s).
3 2.5 2 loss(m) 1.5 1 0.5 0 -3 -2 -1 0 1 3 4 margin m = y.f(x) loss(m) exp(-m) [AdaBoost] log(1+exp(-m)) [LogitBoost] 1-tanh(m) [Doom II] (1-m)+ [SVM] 3 2.5 2 1.5 1 0.5 0 -3 -2 -1 0 1 2 3 4 margin m = y.f(x) squared error as a margin cost function squared error after tanh with 0.65 target
Figure 9.5: Boosting and SVM margin loss functions (left) vs. traditional loss
functions (right) viewed as functions of the margin. Interestingly the last-born of the margin motivated loss functions (used in Doom II) is similar to the traditional squared error after
6
.
I
, and was advocated
by [59] as a target value for neural networks, to avoid saturating the output units while taking advantage of the non-linearity for improving discrimination of neural networks.
3 '
)
'
3 # 9
87
constant function can also be included in the dictionary, which accounts for a bias term : the functional form of approximation
then becomes
where the
consider the values of the dictionary functions at the training points, so that it amounts to doing Matching in a vector-space of dimension . When using a squared error loss7 , the complexity of all three variations of KMP
training data as candidate support points. But it is also possible to use a random subset of the training points as support candidates (which yields a
7
The algorithms generalized to arbitrary loss functions can be much more computationally
#
7 @7
) "
#
(9.14)
).
88 We would also like to emphasize the fact that the use of a dictionary gives a lot of additional exibility to this framework, as it is possible to include any kind of function into it, in particular:
tioned the constant function to recover the bias term , but this could also be used to incorporate prior knowledge). In this later aspect, the work of [91] on semi-parametric SVMs offers an interesting advance in that direction, within the SVM framework. This remains a very interesting avenue for further research.
For huge data sets, a reduced subset can be used as the dictionary to speed up the training.
However in this study, we restrict ourselves to using a single xed kernel, so that the resulting functional form is the same as the one obtained with standard SVMs.
There is no restriction on the shape of the kernel (no positive-deniteness constraint, could be asymmetrical, etc.). The dictionary could include more than a single xed kernel shape: it could mix different kernel types to choose from at each point, allowing for instance the algorithm to choose among several widths of a Gaussian for each support point (a similar extension has been proposed for SVMs by [106]). Similarly, the dictionary could easily be used to constrain the algorithm to use a kernel shape specic to each class, based on prior-knowledge. The dictionary can incorporate non-kernel based functions (we already men-
89
However the quantity optimized by the SVM algorithm is quite different from the KMP greedy optimization, especially when using a squared error loss. Consequently the support vectors and coefcients found by the two types of algorithms are usually different (see our experimental results in section 9.5). Another important difference, and one that was a motivation for this research, is that in KMP, capacity control is achieved by directly controlling the sparsity of the solution, i.e. the number
hardly controllable inuence on sparsity. See [45] for a discussion on the merits of sparsity and margin, and ways to combine them.
of support vectors, whereas the capacity of SVMs , which has an indirect and
90 RBF and SVMs, although their resulting functional forms are very much alike. Such an empirical comparison is one of the contributions of this paper. Basically our results (section 9.5) show OLS-RBF (i.e. squared-error KMP) to perform as well as Gaussian SVMs, while allowing a tighter control of the number of support vectors used in the solution.
91 times believed to be a superior approach8 because contrary to Matching Pursuit, which is a greedy algorithm, Basis Pursuit uses Linear Programming techniques to nd the exact solution to the following problem:
The added
and thus lead to a sparse solution, whose sparsity can be controlled by appropriately tuning the hyper-parameter . However we would like to point out that, as far as the primary goal is good sparsity, i.e. using the smallest number of basis functions in the expansion, both algorithms are approximate: Matching Pursuit is greedy, while Basis Pursuit nds an exact solution, but to an approximate problem (the exact problem could be for-
norm).
In addition Matching Pursuit had a number of advantages over Basis Pursuit in our particular setting:
It is very simple and computationally efcient, while Basis Pursuit requires the use of sophisticated Linear Programming techniques to tackle large problems.
It is possible to nd articial pathological cases where Matching Pursuit breaks down, but this
doesnt seem to be a problem for real-world problems, especially when using the back-tting or pre-tting improvements of the original algorithm.
'
norm would be
'
'
where
norm of
% %
'
@ 3
& !
" "
% % @ 6 % % 5
(9.15) .
92
hyper-parameter , running the optimization several times with different values to nd the best possible choice. But other than that, we might as well have used Basis Pursuit and would probably have achieved very similar experimental results. We should also mention the works of [73] which draws an interesting parallel between Basis Pursuit and SVMs, as well as [46] who use Basis Pursuit with ANOVA kernels to obtain sparse models with improved interpretability.
It is constructive, adding the basis functions one by one to the expansion, which allows us to use a simple early-stopping procedure to control optimal sparsity. In contrast, a Basis Pursuit approach implies having to tune the
93 algorithms for nding sparse kernel-based solutions to classication problems. However there are major differences between the two approaches:
of similarity measure between input patterns, that could be engineered to include prior knowledge, or even learned, as it is not always easy, nor desirable, to enforce positive-deniteness in this perspective.
The perceptron algorithm is initially a classication algorithm, while Matching Pursuit is originally more of a regression algorithm (approximation in the least-squares sense), although the proposed extension to non-squared loss and the discussion on margin-loss functions (see section 9.3) further blurs this distinction. The main reason why we use this algorithm for binary classication tasks rather than regression, although the latter would seem more natural, is that our primary purpose was to compare its performance to classication-SVMs9 .
Similar to SVMs, the solution found by the Kernel Perceptron algorithm depends only on the retained support vectors, while the coefcients learned by Kernel Matching Pursuit depend on all training data, not only on the set of support vectors chosen by the algorithm. This implies that current
A comparison with regression-SVMs should also prove very interesting, but the question of
how to compare two regression algorithms that do not optimize the same loss (squared loss for KMP, versus -insensitive loss for SVMs) rst needs to be addressed.
Kernel Matching Pursuit does not use the Kernel trick to implicitly work in a higher-dimensional mapped feature-space: it works directly in input-space. Thus it is possible to use specic Kernels that dont necessarily satisfy Meras a kind
94 theoretical results on generalization bounds that are derived for sparse SVM or Perceptron solutions [102, 103, 61, 34, 45] cannot be readily applied to KMP. On the other hand, KMP solutions may require less support vectors than Kernel Perceptron for precisely this same reason: the information on all data points is used, without the need that they appear as support vectors in the solution.
to optimize the
10
We tried several frequencies at which to do full back-tting, but it did not seem to have a
any mention of KMP without further specication of the loss function means least-squares KMP (also sometimes written KMP-mse) KMP-tanh refers to KMP using squared error after a hyperbolic tangent with modied targets (which behaves more like a typical margin loss function as we discussed earlier in section 9.3.2). Unless otherwise specied, we used the pre-tting matching pursuit algorithm of gure 9.3 to train least-squares KMP. To train KMP-tanh we always used the back-tting matching pursuit with non-squared loss algorithm of gure 9.4 with a conjugate gradient optimizer
10
95
9.5.1 2D experiments
Figure 9.6 shows a simple 2D binary classication problem with the decision surface found by the three versions of squared-error KMP (basic, back-tting and pre-tting) and a hard-margin SVM, when using the same Gaussian kernel. We xed the number
sions to be the same as the number of support points found by the SVM algorithm. The aim of this experiment was to illustrate the following points:
Basic KMP, after 100 iterations, during which it mostly cycled back to previously chosen support points to improve their weights, is still unable to separate the data points. This shows that the back-tting and pre-tting versions are a useful improvement, while the basic algorithm appears to be a bad choice if we want sparse solutions.
The back-tting and pre-tting KMP algorithms are able to nd a reasonable solution (the solution found by pre-tting looks slightly better in terms of margin), but choose different support vectors than SVM, that are not necessarily close to the decision surface (as they are in SVMs). It should be noted that the Relevance Vector Machine [98] similarly produces 11 solutions in which the relevance vectors do not lie close to the border.
Figure 9.7, where we used a simple dot-product kernel (i.e. linear decision surfaces), illustrates a problem that can arise when using least-squares t: since the squared error penalizes large positive margins, the decision surface is drawn to11
96
Figure 9.6: From left to right: 100 iterations of basic KMP, 7 iterations of KMP
Support vectors are circled. Pre-tting KMP and SVM appear to nd equally reasonable solutions, though using different support vectors. Only SVM chooses its support vectors close to the decision surface. Back-tting chooses yet another support set, and its decision surface appears to have a slightly worse margin. As for basic KMP, after 100 iterations during which it mostly cycled back to previously chosen support points to improve their weights, it appears to use more support vectors than the others while still being unable to separate the data points, and is thus a bad choice if we want sparse solutions.
wards the cluster on the lower right, at the expense of a few misclassied points. As expected, the use of a
and
97
Figure 9.7: Problem with least squares t that leads KMP-mse (center) to mis-
classify points, but does not affect SVMs (left), and is successfully treated by KMP-tanh (right).
In [82] the RBF centers were chosen by unsupervised k-means clustering, in what they referred to as Classical RBF, and a gradient descent optimization procedure was used to train the kernel weights. We repeated the experiment using KMP-mse (equivalent to OLS-RBF) to nd the support centers, with the same Gaussian Kernel and the same training set (7300 patterns) and independent test set (2007 patterns) of preprocessed handwritten digits. Table 9.1 gives the number of errors obtained by the various algorithms on the tasks consisting of discriminating each digit versus all the others (see [82] for more details). No validation data was used to choose the number of bases (support vectors) for the KMP. Instead, we trained with
number, to see whether a sparser KMP model would still yield good results. As can be seen, results obtained with KMP are comparable to those obtained for SVMs, contrarily to the results obtained with k-means RBFs, and there is only
98 a slight loss of performance when using as few as half the number of support vectors. Table 9.1: USPS Results: number of errors on the test set (2007 patterns), when
using the same number of support vectors as found by SVM (except last row which uses half #sv). Squared error KMP (same as OLS-RBF) appears to perform as well as SVM.
Digit class #sv SVM k-means RBF KMP (same #sv) KMP (half #sv)
0 274 16 20 15 16
1 104 8 16 15 15
2 377 25 43 26 29
3 361 19 38 17 27
4 334 29 46 30 29
5 388 23 31 23 24
6 236 14 15 14 17
7 235 12 18 14 16
8 342 25 37 25 28
9 263 16 26 13 18
A rst series of experiments used the machinery of the Delve [76] system to assess performance on the Mushrooms dataset. Hyper-parameters (the the box-constraint parameter
of the kernel,
# % # $ # ! & $
# #
99 points for KMP) were chosen automatically for each run using 10-fold crossvalidation. The results for varying sizes of the training set are summarized in table 9.2. The p-values reported in the table are those computed automatically by the Delve system12 . Table 9.2: Results obtained on the mushrooms data set with the Delve system.
KMP requires less support vectors, while none of the differences in error rates are signicant.
size of train 64 128 256 512 1024 KMP error 6.28% 2.51% 1.09% 0.20% 0.05% SVM error 4.54% 2.61% 1.14% 0.30% 0.07% p-value (t-test) 0.24 0.82 0.81 0.35 0.39 KMP #s.v. 17 28 41 70 127 SVM #s.v. 63 105 244 443 483
For Wisconsin Breast Cancer, Sonar, Pima Indians Diabetes and Ionosphere, we used a slightly different procedure. The
12
For each size, the delve system did its estimations based on 8 disjoint training sets of the given
size and 8 disjoint test sets of size 503, except for 1024, in which case it used 4 disjoint training sets of size 1024 and 4 test sets of size 1007. 13 These were chosen by trial and error using SVMs with a validation set and several values of , and keeping what seemed the best , thus this choice was made at the advantage of SVMs
of the Kernel was rst xed to a reasonable value for the given data set 13 .
100 Then we used the following procedure: the dataset was randomly split into three equal-sized subsets for training, validation and testing. SVM, KMP-mse and KMP-tanh were then trained on the training set while the validation set was used to choose the optimal box-constraint parameter stopping (decide on the number
were tested on the independent test set. This procedure was repeated 50 times over 50 different random splits of the dataset into train/validation/test to estimate condence measures (p-values were computed using the resampled t-test studied in [68]). Table 9.3 reports the average error rate measured on the test sets, and the rounded average number of support vectors found by each algorithm. As can be seen from these experiments, the error rates obtained are comparable, but the KMP versions appear to require much fewer support vectors than SVMs. On these datasets, however (contrary to what we saw previously on 2D articial data), KMP-tanh did not seem to give any signicant improvement over KMPmse. Even in other experiments where we added label noise, KMP-tanh didnt seem to improve generalization performance15 .
(although they did not seem too sensitive to it) rather than KMP. The values used were: 4.0 for Wisconsin Breast Cancer, 6.0 for Pima Indians Diabetes, 2.0 for Ionosphere and Sonar. 14 Values of 0.02, 0.05, 0.07, 0.1, 0.5, 1, 2, 3, 5, 10, 20, 100 were tried for . 15 We do not give a detailed account of these experiments here, as their primary intent was to
I
error function could have an advantage over squared error in presence of label
101
Table 9.3: Results on 4 UCI datasets. Again, error rates are not signicantly
different (values in parentheses are the p-values for the difference with SVMs), but KMPs require much fewer support vectors.
Dataset SVM error Wisc. Cancer Sonar Pima Indians Ionosphere 3.41% 20.6% 24.1% 6.51% KMP-mse error 3.40% (0.49) 21.0% (0.45) 23.9% (0.44) 6.87% (0.41) KMP-tanh error 3.49% (0.45) 26.6% (0.16) 24.0% (0.49) 6.85% (0.40) SVM #s.v. 42 46 146 68 KMP-mse #s.v. 7 39 7 50 KMP-tanh #s.v. 21 14 27 41
9.6 Conclusion
We have shown how Matching Pursuit provides an interesting and exible framework to build and study alternative kernel-based methods, how it can be extended to use arbitrary differentiable loss functions, and how it relates to SVMs, RBF training procedures, and boosting methods. We have also provided experimental evidence that such greedy constructive algorithms can perform as well as SVMs, while allowing a better control of the sparsity of the solution, and thus often lead to solutions with far fewer support vectors. It should also be mentioned that the use of a dictionary gives a lot of exibility, as it can be extended in a direct and straightforward manner, allowing for instance, to mix several kernel shapes to choose from (similar to the SVM extension proposed by [106]), or to include other non-kernel functions based on prior knowledge
102 (similar to the work of [91] on semi-parametric SVMs). This is a promising avenue for further research. In addition to the computational advantages brought by the sparsity of the models obtained with the kernel matching pursuit algorithms, one might suspect that generalization error also depends (monotonically) on the number of support vectors (other things being equal). This was observed empirically in our experiments, but future work should attempt to obtain generalization error bounds in terms of the number of support vectors. Note that leave-one-out SVM bounds [102, 103] cannot be used here because the
a subset (the support vectors). Sparsity has been successfully exploited to obtain bounds for other SVM-like models [61, 34, 45], in particular the kernel perceptron, again taking advantage of the dependence on a subset of the examples. A related direction to pursue might be to take advantage of the data-dependent structural risk minimization results of [84].
104 des problmes rels indiquent souvent des rsultats bien meilleurs avec les SVMs quavec KNN. Nous avons voulu chercher comprendre un peu mieux pourquoi, et voir si on ne pouvait pas dune certaine manire rparer KNN, an que les performances en haute dimension galent ou dpassent celles des SVMs. La diffrence qualitative entre les surfaces de dcisions lisses gnralement produites par les SVMs noyau et la surface en zig-zag produite par KNN, telles quillustres la Figure 11.1 nous a fourni lintuition de dpart, lanant une rexion sur la notion de maximisation de la marge locale dans lespace dentre que nous voquons dans lintroduction de larticle, et nous amenant dnir la distance dun point une classe comme la distance la varit linaire supporte par les voisins de ce point appartenant ladite classe.
106
11.1 Motivation
The notion of margin for classication tasks has been largely popularized by the success of the Support Vector Machine (SVM) [11, 102] approach. The margin of SVMs has a nice geometric interpretation1 : it can be dened informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the kernel trick to achieve their non-linearity. This conceptually corresponds to rst mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The trick is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes innite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [110]).
1
for the purpose of this discussion, we consider the original hard-margin SVM algorithm for
107 A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [100]). We could for instance restrict ourselves to a certain class of decision functions and try to nd the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we dene the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to nd an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [24] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has innite capacity (VCdimension), 1NN is still performing quite well. . . This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding.
In the previous and following discussion, we often refer to the concept of deci
decision surface for class is the boundary between those two regions, i.e. the contour of , and can be seen as a
mention the decision surface in our discussion we consider only the case of two class discrimination, in which there is a single decision surface. When we mention a test point, we mean a point
or alternatively
# 9
#
) 3
the region
. The dis.
B 3
# 9 "
#
is if
and otherwise.
'
with respect to
and
where
F 6
$@CH 7 6 P F
@ P H
corresponding
dene a partition of
) # I) )
of points
) # ) # I) ) # # # 2 5
!9
2 5
# I) "
2 5
) # 9
#
distance to
is smallest.
of a test point
is the set of
points of
is smallest.
By Nearest Neighbor algorithm (1NN) we mean the following algorithm: the class
of a test point
By K-Nearest Neighbor algorithm (KNN) we mean the following algorithm: the class of a test point
By mostly separable we mean that the Bayes error is almost zero, and the optimal decision
surface has not too many disconnected components. 3 i.e. each class has a probability density with a support that is a lower-dimensional manifold, and with the probability quickly fading, away from this support.
#
The K-neighborhood
of a test point
. points of whose
#
#
#
# #
whose
110
Figure 11.1: A local view of the decision surface produced by the Nearest Neigh-
bor (left) and SVM (center) algorithms, and how the Nearest Neighbor solution gets closer to the SVM solution in the limit, if the support for the density of each class is a manifold which can be considered locally linear (right).
Thus the decision surface, while having the largest possible local margin with regard to the training points, is likely to have a poor small local margin with respect to yet unseen samples falling close to the locally linear manifold, and will thus result in poor generalization performance. This problem fundamentally remains with the K Nearest Neighbor (KNN) variant of the algorithm, but, as can be seen on the gure, it does not seem to affect the decision surface produced by SVMs (as the surface is constrained to a particular smooth form, a straight line or hyperplane in the case of linear SVMs). It is interesting to notice, if the assumption of locally linear class manifolds holds, how the 1NN solution approaches the SVM solution in the limit as we increase the number of samples.
111 To x this problem, the idea is to somehow fantasize the missing points, based on a local linear approximation of the manifold of each class. This leads to modied Nearest Neighbor algorithms described in the next sections.4
that would contain all the fantasized missing points of the manifold of each class, locally approximated by an afne subspace. We shall thus consider, for each class , the local afne subspace that passes through the neighborhood of . This afne subspace is typically
and we will somewhat abusively call it the local hyperplane.5 Formally, the local hyperplane can be dened as
is to take a reference point within the hyperplane as an origin, for instance the
4
Note that although we never generate the fantasy points explicitly, the proposed algorithms
are really equivalent to classical 1NN with fantasized points. 5 Strictly speaking a hyperplane in an dimensional input space is an
afne subspace,
Another way to dene this hyperplane, that gets rid of the constraint
where
) 3
#
#
'
dimensional or less,
(11.1)
112
Our modied nearest neighbor algorithm then associates a test point to the class
where
name K-local Hyperplane Distance Nearest Neighbor algorithm (HKNN in short). Computing, for each class
not in
dependent: remember that the initial formulation had an equality constraint and thus only
`
`
neighbors as the reference point, but this formulation with are linearly
Q 3$
where
and
are
, and
form as:
3$
@ 3 3
whose hyperplane
is closest to . Formally
' # #
# 9
# &% P H
Q Q
!
' # #
' # # ' #
where
Q 3$
#
'
centroid6
@ 5
3$
#
(11.3)
(11.4) is a
113
that do not affect the class identity. These are invariances. The main difference is that in HKNN these invariances are inferred directly from the local neighborhood in the training set, whereas in Tangent Distance, they are based on prior knowledge. It should be interesting (and relatively easy) to combine both approaches for improved performance when prior knowledge is available. Previous work on nearest-neighbor variations based on other locally-dened metrics can be found in [85, 66, 38, 47], and is very much related to the more general paradigm of Local Learning Algorithms [13, 4, 69]. We should also mention close similarities between our approach and the recently proposed Local Linear Embedding [78] method for dimensionality reduction. The idea of fantasizing points around the training points in order to dene the decision surface is also very close to methods based on estimating the classconditional input density [100, 19]. Besides, it is interesting to look at HKNN from a different, less geometrical angle: it can be understood as choosing the class that achieves the best reconstruction (the smallest reconstruction error) of the test pattern through a linear combination of particular prototypes of that class (the
the algorithm is very similar to the Nearest Feature Line (NFL) [60] method. They differ in the fact that NFL considers all pairs of points for its search rather than
3$
Algorithm [87].
' #
114 the local neighbors, thus looking at many ( ) lines (i.e. 2 dimensional afne
dimensional one.
hood of the test point. In a typical high dimensional setting, exact colinearities between input patterns are rare, which means that as soon as tern of of the
(including nonsensical ones) can be produced by a linear combination neighbors. The actual dimensionality of the manifold may be much
less than
the algorithm to mistake those noise directions for invariances, and may hurt its performance even for smaller values of
approximation of the class manifold by a hyperplane is valid only locally, so we might want to restrict the fantasizing of class members to a smaller region of the hyperplane. We considered two ways of dealing with these problems. 8
8
A third interesting avenue, which we did not have time to explore, would be to keep only the
) 3
, any pat-
115
the distance to the hyperplane, the distance to the convex hull cannot be found by solving a simple linear system, but typically requires solving a quadratic programming problem (very similar to the one of SVMs). While this is more complex to implement, it should be mentioned that the problems to be solved are of a relatively small dimension of order
each class. This algorithm will be referred to as K-local Convex Distance Nearest Neighbor Algorithm (CKNN in short).
essential directions):
an additional diagonal term. The resulting algorithm is a generalization of HKNN (basic HKNN corresponds to
'
where
is the
).
3$
@ 3 3
!
#
constraint of
! '
, and that the time of the whole algorithm will nearest neighbors within
'
# G
(11.5)
116
tion 11.3, and shows that the two proposed solutions: CKNN and HKNN with an added weight decay , allow to overcome it.
In our nal experiment, we wanted to see if the good performance of the new algorithms absolutely depended on having all the training points at hand, as this has a direct impact on speed. So we checked what performance we could get out of HKNN and CKNN when using only a small but representative subset of the training points, namely the set of support vectors found by a Gaussian Kernel SVM. The results obtained for MNIST are given in Table 11.2, and look very encouraging. HKNN appears to be able to perform as well or better than SVMs without requiring more data points than SVMs.
A rst 2D toy example (see Figure 11.2) graphically illustrates the qualitative differences in the decision surfaces produced by KNN, linear SVM and CKNN. Table 11.1 gives quantitative results on two real-world digit OCR tasks, allowing to compare the performance of the different old and new algorithms. , mentioned in Sec-
117
Table 11.1: Test-error obtained on the USPS and MNIST digit classication tasks
by KNN, SVM (using a Gaussian Kernel), HKNN and CKNN. Hyper parameters were tuned on a separate validation set. Both HKNN and CKNN appear to perform much better than original KNN, and even compare favorably to SVMs.
Data Set USPS (6291 train, 1000 valid., 2007 test points) MNIST (50000 train, 10000 valid., 10000 test points) Algorithm KNN SVM HKNN CKNN KNN SVM HKNN CKNN Test Error 4.98% 4.33% 3.93% 3.98% 2.95% 1.30% 1.26% 1.46%
Parameters used
'
' )'
)
118
Figure 11.2: 2D illustration of the decision surfaces produced by KNN (left, K=1),
linear SVM (middle), and CKNN (right, K=2). The holes are again visible in KNN. CKNN doesnt suffer from this, but keeps the objective of maximizing the
margin locally.
11.5 Conclusion
From a few geometrical intuitions, we have derived two modied versions of the KNN algorithm that look very promising. HKNN is especially attractive: it is very simple to implement on top of a KNN system, as it only requires the additional step of solving a small and simple linear system, and appears to greatly boost the performance of standard KNN even above the level of SVMs. The proposed algorithms share the advantages of KNN (no training required, ideal for fast adaptation, natural handling of the multi-class case) and its drawbacks (requires large memory, slow testing). However our latest result also indicate the possibility of substantially reducing the reference set in memory without loosing on accuracy. This suggests that the algorithm indeed captures essential information in the data, and that our initial intuition on the nature of the aw of KNN may well be at least partially correct.
119
0.032 0.03 0.028 0.026 error rate 0.024 0.022 0.02 0.018 0.016 0.014 0.012 0 20 40 60 K 80 100 120 CKNN basic HKNN HKNN, lambda=1 HKNN, lambda=10
with different values of . As can be seen the basic HKNN algorithm performs poorly for large values of
this problem, and HKNN can be made robust through the added weight decay penalty controlled by .
120
Table 11.2: Test-error obtained on MNIST with HKNN and CKNN when using a
reduced training set made of the 16712 support vectors retained by the best Gaussian Kernel SVM. This corresponds to 28% of the initial 60000 training patterns. Performance is even better than when using the whole dataset. But here, hyper parameters and
rate validation set in this setting. It is nevertheless remarkable that comparable performances can be achieved with far fewer points.
Data Set MNIST (16712 train s.v., 10000 test points) Algorithm HKNN CKNN Test Error 1.23% 1.36%
)'
'
were chosen with the test set, as we did not have a sepa-
Parameters used
122 Lide que, en haute dimension, le support des donnes pouvait tre une varit de dimension infrieure, a t largement popularise par la parution de deux algorithmes de rduction de dimensionalit : Local Linear Embedding [78] et IsoMap [95]. Nos recherches galement peuvent tre comprises comme la prise en compte cette belle intuition gomtrique pour rparer des algorithmes non paramtriques classiques (KNN et Parzen). Cela dit, lorientation de nos recherches nest pas ne au dpart de considrations aussi claires concernant les varits. Leur point de dpart a t une simple tentative de supprimer lartefact de zig-zag de la surface de dcision produite par KNN. La notion de varit de dimension infrieure ne nous est apparue vidente que par la suite, clariant grandement notre intuition initiale.
123 certaines directions que dans dautres pour que les bnces de ces algorithmes se fassent ressentir.
125
13.1 Introduction
In [105], while attempting to better understand and bridge the gap between the good performance of the popular Support Vector Machines and the more traditional K-NN (K Nearest Neighbors) for classication problems, we had suggested a modied Nearest-Neighbor algorithm. This algorithm, which was able to slightly outperform SVMs on several real-world problems, was based on the geometric intuition that the classes actually lived close to a lower dimensional non-linear manifold in the high dimensional input space. When this was not properly taken into account, as with traditional K-NN, the sparsity of the data points due to having a nite number of training samples would cause holes or zigzag artifacts in the resulting decision surface, as illustrated in Figure 13.1.
Figure 13.1: A local view of the decision surface, with holes, produced by the Nearest Neighbor when the data have a local structure (horizontal direction). The present work is based on the same underlying geometric intuition, but applied to the well known Parzen windows [71] non-parametric method for density estimation, using Gaussian kernels. Most of the time, Parzen Windows estimates are built using a spherical Gaussian with a single scalar variance (or width) parameter
to use a diagonal Gaussian, i.e. with a diagonal covariance matrix, or even a full Gaussian with a full covariance matrix, usually set to be proportional to the
. It is also possible
126 global empirical covariance of the training data. However these are equivalent to using a spherical Gaussian on preprocessed, normalized data (i.e. normalized by subtracting the empirical sample mean, and multiplying by the inverse sample covariance). Whatever the shape of the kernel, if, as is customary, a xed shape is used, merely centered on every training point, the shape can only compensate for the global structure (such as global covariance) of the data. Now if the true density that we want to model is indeed close to a non-linear lower dimensional manifold embedded in the higher dimensional input space, in the sense that most of the probability density is concentrated around such a manifold (with a small noise component away from it), then using Parzen Windows with a spherical or xed-shape Gaussian is probably not the most appropriate method, for the following reason.
While the true density mass, in the vicinity of a particular training point
be mostly concentrated in a few local directions along the manifold, a spherical Gaussian centered on that point will spread its density mass equally along all input space directions, thus giving too much probability to irrelevant regions of space and too little along the manifold. This is likely to result in an excessive bumpyness of the thus modeled density, much like the holes and zig-zag artifacts observed in KNN (see Fig. 13.1 and Fig. 13.2).
manifold, then it should be possible to infer the local direction of that manifold
eterized in such a way that it spreads mostly along the directions of the manifold, and is almost at along the other directions. The resulting model is a mixture
, will
127 of Gaussian pancakes, similar to [48], mixtures of probabilistic PCAs [99] or mixtures of factor analyzers [44, 43], in the same way that the most traditional Parzen Windows is a mixture of spherical Gaussians. But it remains a memorybased method, with a Gaussian kernel centered on each training points, yet with a differently shaped kernel for each point.
(13.1)
variance matrix
where ances
is the determinant of
non-linear principal manifold, those gaussians would be pancakes aligned with the plane locally tangent to this underlying manifold. The only available
( )
7
#
0 (
0 (
where
and co-
(13.2)
mixture of Gaussians, but unlike the Parzen density estimator, its covariances
#
@ )
. Our estimator
#
matrix
whose row
#
" "
Let
128 information (in the absence of further prior knowledge) about this tangent plane
the neighborhood of
. Notice that if
among the training set, according to some metric such as the Euclidean distance in input space, and assigning a weight of to points further than the -th neighbor.
In that case,
nearest neighbors of
Notice what is happening here: we start with a possibly rough prior notion of neighborhood, such as one based on the ordinary Euclidean distance in input space, and use this to compute a local covariance matrix, which implicitly denes a rened local notion of neighborhood, taking into account the local direction observed in the training samples.
signing a weight of
# #
'
where
#
@ 9 9@ 5 # 9 # 3 9 # # 3 9 # # 9 # @ 9 9@ 5
#
3 9 # # 3 9 #
# #
of
is a constant
. In other
# # #
by as-
129 Now that we have a way of computing a local covariance matrix for each training point, we might be tempted to use this directly in equations 13.2 and 13.1. But a number of problems must rst be addressed:
k-neighborhood with
at outside of the afne subspace spanned by does not constitute a proper density in
and its
neighbors, and it
problem is to add a small isotropic (spherical) Gaussian noise of variance in all directions, which is done by simply adding
to the diagonal of
by adding
spaces, it would be prohibitive in computation time and storage to keep and use the full inverse covariance matrix as expressed in 13.2. This would in effect multiply both the time and storage requirement of the already expensive ordinary Parzen Windows by
, as described
is a
(
Q
is likely
rections of the local neighborhood, i.e. the high variance local directions of the supposed underlying -dimensional manifold (but the true underlying dimension is unknown and may actually vary across space). The last few eigenvalues and eigenvectors are but noise directions with a small variance. So we may, without too much risk, force those last few components to the same low noise level have done this by zeroing the last
nary Parzen. It can easily be shown that such an approximation of the covariance
0 (
In the case of the hard k-neighborhood, the training algorithm pre-computes the
of the
(2)
" "
(1)
parameter
, training vector
eigenvalues
9
Q # #
)
#
#
#
instead of
. Thus both the storage requirement and the computational cost times that of ordi-
) ,
0 (
in time
#
3
. We
131 the covariance matrix, see below). Note that with traditional Parzen windows estimator.
matrix (3)
is a
(3)
LocalGaussian(
# Q # #
(2) For
(1)
) '
and model
Q 2
) .
Algorithm MParzen::Test(
, where
is an
) .
(4)
For
, let
& Q 2 #
)
of
tensor that
)
(2)
Collect
nearest neighbors
of
, and put
3 9#
(1) For
9#
)
parameter
with rows
Algorithm MParzen::Train(
2
'
in the rows of
, to obtain the
132
133 We should also mention similarities between our approach and the Local Linear Embedding and recent related dimensionality reduction methods [78, 94, 28, 14]. There are also links with previous work on locally-dened metrics for nearestneighbors [85, 66, 38, 47]. Lastly, it can also be seen as an extension along the line of traditional variable and adaptive kernel estimators that adapt the kernel width locally (see [50] for a survey).
) #
with the
examples
#
#
@ 5 3
134 points:
interval
and
is a normal density.
and
the training set, tuning the hyper-parameters to achieve best performance on the validation set. Figure 13.2 shows the training set and gives a good idea of the densities produced by both kinds of algorithms (as the visual representation for
the case
ordinary Parzen, and shows that MParzen is indeed able to better concentrate the probability density along the manifold, even when the training data is scarce. Quantitative comparative results of the two models are reported in table 13.1 Table 13.1: Comparative results on the articial data (standard errors are in parenthesis). Algorithm Parzen MParzen MParzen Parameters used ANLL on test-set -1.183 (0.016) -1.466 (0.009) -1.419 (0.009)
, ,
' 8 '
))
) ' ( '
)
)
MParzen with
and
)
where
) ' ' 3 ) ' ' 3 ) ' ' ' 8 ) ' ' # '
, , ,
)
is uniform in the
on
135
(even though the underlying manifold really has dimension more consistency over the test sets (lower standard error). The optimal width
rameter of the true generating model (0.01), probably because of the nite sample size. The optimal regularization parameter for MParzen with
posing a one-dimensional underlying manifold) is very close to the actual noise parameter of the true generating model. This suggests that it was able to capture the underlying structure quite well. Also it is the best of the three models, which is not surprising, since the true model is indeed a one dimensional manifold with an added isotropic Gaussian noise. The optimal additional noise parameter for MParzen with
supposing a two-dimensional underlying manifold) is close to 0, which suggests that the model was able to capture all the noise in the second principal direction.
)
)
Both MParzen models seem to achieve a lower ANLL than ordinary Parzen ), and with
(i.e. sup-
(i.e.
136 performance measures introduced above (average negative log-likelihood). Note that the improvement with respect to Parzen windows is extremely large and of course statistically signicant. Table 13.2: Density estimation of class 2 in the MNIST data set. Standard errors in parenthesis. Algorithm Parzen MParzen Parameters used validation ANLL -197.27 (4.18) -696.42 (5.94) test ANLL -197.19 (3.55) -695.15 (5.21)
tor
This method was applied to both the Parzen and the Manifold Parzen density estimators, which were compared with state-of-the-art Gaussian SVMs on the full USPS data set. The original training set (7291) was split into a training (rst 6291) and validation set (last 1000), used to tune hyper-parameters. The classication errors for all three methods are compared in Table 13.3, where the hyper-parameters are chosen based on validation classication error. The log-likelihoods are compared in Table 13.4, where the hyper-parameters are chosen based on validation ANCLL. Hyper-parameters for SVMs are the box constraint
with the
examples
@ 5 3
#
' ( '
" $#
P I F P H H H P I F P H #
, we ,
137 width . MParzen has the lowest classication error and ANCLL of the three algorithms. Table 13.3: Classication error obtained on USPS with SVM, Parzen windows
13.5 Conclusion
The rapid increase in computational power now allows to experiment with sophisticated non-parametric models such as those presented here. They have allowed to show the usefulness of learning the local structure of the data through a regularized covariance matrix estimated for each data point. By taking advantage of local structure, the new kernel density estimation method outperforms the Parzen windows estimator. Classiers built from this density estimator yield state-of-theart knowledge-free performance, which is remarkable for a not discriminatively
MParzen
0.0658
0.3384
'
)
Parzen
0.1022
0.3478
0 ) '
Algorithm
valid ANCLL
test ANCLL
parameters
MParzen
0.9%
4.08%
) 8 ) ) '
))
Parzen
1.8%
5.08%
8 '
SVM
1.2%
4.68%
' )'
and Manifold Parzen windows classiers. Algorithm validation error test error
parameters
138 trained classier. Besides, in some applications, the accurate estimation of probabilities can be crucial, e.g. when the classes are highly imbalanced. Future work should consider other alternative methods of estimating the local covariance matrix, for example as suggested here using a weighted estimator, or taking advantage of prior knowledge (e.g. the Tangent distance directions).
139
Figure 13.2: Illustration of the density estimated by ordinary Parzen Windows (left) and Manifold Parzen Windows (right). The two images on the bottom are a zoomed area of the corresponding image at the top. The 300 training points are
1.0 is painted in gray. The excessive bumpyness and holes produced by ordinary Parzen windows model can clearly be seen, whereas Manifold Parzen density is better aligned with the underlying manifold, allowing it to even successfully extrapolate in regions with few data points but high true density.
represented as black dots and the area where the estimated density
#
is above
140
142 cela ne nous a pas empcher dobtenir de trs bons rsultats sur des problmes de classication (notre but tant de battre les rsultats dun classieur SVM). Les algorithmes de K-Local Hyperplane and Convex Distance Nearest Neighbor sont vritablement conus pour la classication. Quant Manifold Parzen Windows, il sagit videmment destimation de densit. Les objectifs viss par la recherche et les techniques employes diffrent galement : Pour Kernel Matching Pursuit on cherchait contrler prcisment le nombre de points de support, tout en construisant une solution ayant exactement la mme forme analytique quune SVM Gaussienne, c.a.d. une somme pondre de Gaussiennes isotropes centres sur un petit sous-ensemble des points dapprentissage. Pour nos variantes de KNN et Parzen, nous voulions avant tout voir si on pouvait amliorer la performance des mthodes non paramtriques classiques (les rparer), en tenant compte localement des directions principales des donnes dans le voisinage dun point. Les fonctions rsultantes nont pas la mme forme analytique quune SVM Gaussienne, et ne sont clairement pas clairsemes. Mais le fait important est que tous ces algorithmes ont obtenu de trs bonnes performances sur des problmes en haute dimension, capables dgaler ou de surpasser les SVMs. Chacun sa manire constitue une alternative, ne faisant pas appel lastuce du noyau. Par ailleurs les deux derniers articles suggrent quen haute dimension il est important de prendre en considration la structure locale apparaissant dans les donnes dans le voisinage dun point.
143 Dun point de vue pratique, ces algorithmes ont chacun leurs avantages et inconvnients : Kernel Matching Pursuit peut offrir un avantage pratique important par rapport aux SVMs, car un nombre rduit de point de support signie une occupation mmoire et des temps de calculs rduits dautant. Par ailleurs le fait quon optimise un cot quadratique peut savrer utile, dans les cas o le cot -sensitif (voir [102]) des SVMs de rgression nest pas appropri. Tout dpend de lob-
distribution particulire des donnes de rclamation dassurance, le cot minimis par les SVMs de rgression nest pas appropri [29]. Enn, la simplicit et exibilit du principe du dictionnaire permet facilement daller au del de la forme des SVMs, laissant libre champ lexprimentation, pour par exemple y inclure des fonctions suggres par des connaissances priori sur le problme. K-Local Hyperplane Nearest Neighbor hrite de KNN ses avantages (pas dentranement ncessaire, rendant facile ladaptation incrmentale), et ses inconvnients (grande occupation mmoire lie la ncessit de conserver la totalit de lensemble dentranement, lenteur lutilisation car il faut parcourir la totalit de cet ensemble pour chaque exemple de test). Mais il permet parfois damliorer dramatiquement la performance (taux derreur de test) sans augmenter excessivement les ressources ncessaires. Manifold Parzen Windows est quant lui trs gourmand en ressources, puisquil ncessite une phase dentranement importante, et surtout une occupation mmoire de fois la taille de lensemble dentranement (o est le nombre
'
) est lesprance
'
' $
)
144 de directions principales que lon souhaite conserver). Ceci peut empcher son utilisation avec des ensembles de taille importante en haute dimension.
145
centr en
#
# #
Soit
, dont on a
#
146 Gaussien de largeur xe ou multiple de la distance au On dnit la densit produit entre deux densits et
comme
En particulier, en
A partir de cette expression, en utilisant des approximations de construit lestimateur de densit local :
# # #
et
empirique :
, pondrs par
en ajustant1 , cet chantillon pondr, les paramtres dun modle simple, sus
prs de
. Le modle idal si
Anglais : tting
correspondant la densit
#
est une
#
@ )
# #
#
#
#
de sorte que
7 E A $C B 2 16 # # #
A
# #
#
Soit
# # # # ) # C #
et soit (14.1) , on
voisin).
147
Gaussienne sphrique et
mension, est une Gaussienne potentiellement aplatie (du mme genre que celles utilises dans Manifold Parzen). Il suft alors de lvaluer en
#
Autrement dit :
tant quelques similitudes avec les modles de vraisemblance localement pondre proposs dans [49, 62] pour lestimation de densit. Lapproche est nanmoins assez diffrente (les deux modles de vraisemblance cits utilisent une forme assez particulire de pondration de la vraisemblance qui semble trs diffrente.) pour mriter dtre tudie de plus prs, et correspond, davantage que Manifold Parzen, lide que nous nous faisons dun pendant probabiliste lalgorithme HKNN. Qui plus est, les exigences en terme de mmoire et de temps de calcul sont comparables celles de HKNN et bien moins contraignantes que celles de Manifold Parzen.
14.4 Conclusion
Au travers des trois articles prsents, nous avons montr quil est possible de dvelopper des mthodes noyau, sans recourir lastuce du noyau, qui sont
Lestimateur
&
et
7 A 2 16
&
Avec
pour obtenir
#
C A 2 1Q $C A 2 16 # 7 7 E E
# # # # #
@ 5 P H @ 5 P H
7 C A 2 1Q E 7 E $C A 2 16
148 capables dgaler ou surpasser les performances des SVMs. De ce point de vue, on peut prsent dire que les SVMs noyau ont beaucoup en commun avec les mthodes prototypes, ou memory based, non seulement de par la forme analytique similaire de leur solution, mais galement par les performances atteintes. Pour lessentiel, les algorithmes que nous avons dvelopp sont de simples variantes dalgorithmes classiques, nanmoins les amliorations proposes ont permis un gain considrable de performance sur certains problmes en haute dimension. Ainsi il nest dsormais plus possible de prtendre aussi facilement quauparavant que les SVMs sont bien suprieures des algorithmes de type KNN par exemple. De ce point de vue, cette thse est une tentative de rhabilitation des mthodes non paramtriques classiques, et une invitation lance la communaut de recherche en apprentissage automatique les examiner de plus prs. Mais le point le plus important retenir, est quune amlioration de performances trs importante a pu tre obtenue par rapport KNN et Parzen, en les modiant simplement pour quils prennent en compte les directions locales apparaissant dans les donnes dun voisinage, tel que suggr par lhypothse de varit, selon laquelle les donnes seraient davantage concentres le long de varits de dimension infrieure. Approfondir les implications de cette hypothse de varit semble tre le meilleure espoir que nous ayons pour esprer battre le au de la dimensionalit. Nous avons montr deux exemples du succs de cette approche, et avons suggr quelques pistes pour aller plus loin, mais il y a certainement bien dautres faons dincorporer cette notion dans les algorithmes dapprentissage statistiques, anciens comme nouveaux.
Bibliography
[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821837, 1964. [2] J. Aldrich. R.A. Fisher and the making of maximum likelihood 1912-22. Technical Report 9504, University of Southampton, Department of Economics, 1995. [3] S. Amari and S. Wu. Improving support vector machine classiers by modifying kernel functions. Neural Networks, 1999. to appear. [4] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning for control. Articial Intelligence Review, 11:75113, 1997. [5] P. Baldi and Y. Chauvin. Neural networks for ngerprint recognition. Neural Computation, 5(3):402418, 1993. [6] J. Baxter. The canonical distortion measure for vector quantization and function approximation. In Proc. 14th International Conference on Machine Learning, pages 3947. Morgan Kaufmann, 1997.
150 [7] J. Baxter and P. Bartlett. The canonical distortion measure in feature space and 1-NN classication. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998. [8] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey, 1961. [9] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 932938. MIT Press, 2001. [10] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:11371155, 2003. [11] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144152, Pittsburgh, 1992. [12] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y. LeCun, U. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of classier methods: a case study in handwritten digit recognition. In International Conference on Pattern Recognition, Jerusalem, Israel, 1994. [13] L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888900, 1992.
151 [14] M. Brand. Charting a manifold. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15. The MIT Press, 2003. [15] J. Bromley, J. Benz, L. Bottou, I. Guyon, L. Jackel, Y. LeCun, C. Moore, E. Sackinger, and R. Shah. Signature verication using a siamese time delay neural network. In Advances in Pattern Recognition Systems using Neural Network Technologies, pages 669687. World Scientic, Singapore, 1993. [16] C. J. C. Burges and B. Schlkopf. Improving the accuracy and speed of support vector machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 375. MIT Press, 1997. [17] N. Chapados, Y. Bengio, P. Vincent, J. Ghosn, C. Dugas, I. Takeuchi, and L. Meng. Estimating car insurance premia: a case study in highdimensional data inference. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. The MIT Press. [18] O. Chapelle, P. Haffner, and V. Vapnik. Svms for histogram-based image classication. IEEE Transactions on Neural Networks, 1999. accepted, special issue on Support Vectors. [19] O. Chapelle, J. Weston, L. Bottou, and V. Vapnik. Vicinal risk minimization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 416422, 2001.
152 [20] S. Chen. Basis Pursuit. PhD thesis, Department of Statistics, Stanford University, 1995. [21] S. Chen, F. Cowan, and P. Grant. Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks, 2(2):302309, 1991. [22] C. Cortes and V. Vapnik. Soft margin classiers. Machine Learning, 20:273297, 1995. [23] A. Courant and D. Hilbert. Methods of Mathematical Physics. Wiley Interscience, New York, 1951. [24] T. Cover and P. Hart. Nearest neighbor pattern classication. IEEE Transactions on Information Theory, 13(1):2127, 1967. [25] I. J. Cox, J. Ghosn, and P. N. Yianilos. Feature-based face recognition using mixture-distance. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 209216, 1996. [26] J. Dahmen, D. Keysers, M. Pitz, and H. Ney. Structured covariance matrices for statistical image object recognition. In 22nd Symposium of the German Association for Pattern Recognition, Kiel, Germany, 2000. [27] G. Davis, S. Mallat, and Z. Zhang. Adaptive time-frequency decompositions. Optical Engineering, 33(7):21832191, July 1994. [28] V. de Silva and J. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, and K. Obermayer,
153 editors, Advances in Neural Information Processing Systems, volume 15, pages 705712, Cambridge, MA, 2003. The MIT Press. [29] C. Dugas, Y. Bengio, N. Chapados, P. Vincent, G. Denoncourt, and C. Fournier. Intelligent Techniques for the Insurance Industry, chapter Statistical Learning Algorithms Applied to Automobile Insurance Ratemaking. World Scientic, 2003. [30] R. A. Fisher. On an absolute citerion for frequency curves. Messenger of Mathematics, 41:155160, 1912. [31] R. A. Fisher. Frequency distribution of the values of the correlation coefcient in samples from an indenitely large population. Biometrika, 10:507 521, 1915. [32] R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, A222:309368, 1922. [33] R. A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22:700725, 1925. [34] S. Floyd and M. Warmuth. Sample compression, learnability, and the vapnik-chervonenkis dimension. Machine Learning, 21(3):269304, 1995. [35] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference, pages 148156, 1996.
154 [36] Y. Freund and R. E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Science, 55(1):119139, 1997. [37] Y. Freund and R. E. Schapire. Large margin classication using the perceptron algorithm. In Proc. 11th Annu. Conf. on Comput. Learning Theory, pages 209217. ACM Press, New York, NY, 1998. [38] J. Friedman. Flexible metric nearest neighbor classication. Technical Report 113, Stanford University Statistics Department, 1994. [39] J. Friedman. Greedy function approximation: a gradient boosting machine, 1999. IMS 1999 Reitz Lecture, February 24, 1999, Dept. of Statistics, Stanford University. [40] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Technical report, August 1998, Department of Statistics, Stanford University, 1998. [41] J. H. Friedman and W. Stuetzle. Projection pursuit regression. J. American Statistical Association, 76(376):817823, Dec. 1981. [42] S. Gallant. Optimal linear discriminants. In Eighth International Conference on Pattern Recognition, pages 849852, Paris 1986, 1986. IEEE, New York. [43] Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. MIT Press.
155 [44] Z. Ghahramani and G. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto, 21 1996. [45] T. Graepel, R. Herbrich, and J. Shawe-Taylor. Generalization error bounds for sparse linear classiers. In Thirteenth Annual Conference on Computational Learning Theory, 2000, page in press. Morgan Kaufmann, 2000. [46] S. Gunn and J. Kandola. Structural modelling with sparse kernels. Machine Learning, special issue on New Methods for Model Combination and Model Selection:to appear, 2001. [47] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classication and regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 409415. The MIT Press, 1996. [48] G. Hinton, M. Revow, and P. Dayan. Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 1015 1022. MIT Press, Cambridge, MA, 1995. [49] N. L. Hjort and M. C. Jones. Locally parametric nonparametric density estimation. Annals of Statistics, 24(4):16191647, 1996. [50] A. Inzenman. Recent developments in nonparametric density estimation. Journal of the American Statistical Association, 86(413):205224, 1991. [51] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classiers, 1998.
156 [52] I. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986. [53] D. Keysers, J. Dahmen, and H. Ney. A probabilistic view on tangent distance. In 22nd Symposium of the German Association for Pattern Recognition, Kiel, Germany, 2000. [54] G. Kimeldorf and G. Wahba. Some results on tchebychean spline functions. Journal of Mathematics Analysis and Applications, 33:8295, 1971. [55] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidenite programming. Journal of Machine Learning Research, 5:2772, 2004. [56] N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15, pages 609616. The MIT Press, 2003. [57] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit recognition with a back-propagation network. In D. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396404, Denver, CO, 1990. Morgan Kaufmann, San Mateo. [58] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, November 1998.
157 [59] Y. LeCun, L. Bottou, G. Orr, and K.-R. Mller. Efcient backprop. In G. Orr and K.-R. Mller, editors, Neural Networks: Tricks of the Trade, pages 950. Springer, 1998. [60] S. Li and J. Lu. Face recognition using the nearest feature line method. IEEE Transactions on Neural Networks, 10(2):439443, 1999. [61] N. Littlestone and M. Warmuth. Relating data compression and learnability, 1986. Unpublished manuscript. University of California Santa Cruz. An extended version can be found in (Floyd and Warmuth 95). [62] C. R. Loader. Local lieklihood density estimation. Annals of Statistics, 24(4):16021618, 1996. [63] D. Lowe. Similarity metric learning for a variable-kernel classier. Neural Computation, 7(1):7285, 1995. [64] S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Proc., 41(12):33973415, Dec. 1993. [65] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In S. Solla, T. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems, volume 12, pages 512518. MIT Press, 2000. [66] J. Myles and D. Hand. The multi-class measure problem in nearest neighbour discrimination rules. Pattern Recognition, 23:12911297, 1990. [67] E. A. Nadaraya. On nonparametric estimates of density functions and regression curves. Theory of Applied Probability, 10:186190, 1965.
158 [68] C. Nadeau and Y. Bengio. Inference for the generalization error. In S. Solla, T. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems, volume 12, pages 307313. MIT Press, 2000. [69] D. Ormoneit and T. Hastie. Optimal kernel shapes for local linear regression. In S. Solla, T. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000. [70] G. Orr and K.-R. Muller, editors. Neural networks: tricks of the trade, volume 1524 of Lecture Notes in Computer Science. Springer-Verlag Inc., New York, NY, USA, 1998. [71] E. Parzen. On the estimation of a probability density function and mode. Annals of Mathematical Statistics, 33:10641076, 1962. [72] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers, pages 4044, Nov. 1993. [73] T. Poggio and F. Girosi. A sparse representation for function approximation. Neural Computation, 10(6):14451454, 1998. [74] M. Pontil and A. Verri. Properties of support vector machines. Technical Report AI Memo 1612, MIT, 1998. [75] M. Powell. Radial basis functions for multivariable interpolation: A review, 1987.
159 [76] C. Rasmussen, R. Neal, G. Hinton, D. van Camp, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. DELVE can be found at http://www.cs.toronto.edu/ delve. [77] F. Rosenblatt. The perceptron a perceiving and recognizing automaton. Technical Report 85-460-1, Cornell Aeronautical Laboratory, Ithaca, N.Y., 1957. [78] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):23232326, Dec. 2000. [79] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):16511686, 1998. [80] A. S. Schlkopf, B. and K.-R. Mller. Nonlinear component analysis as a kernel eigenvalue problem. Technical Report 44, Max Planck Institute for Biological Cybernetics, Tbingen, Germany, 1996. [81] B. Schlkopf, A. Smola, and K.-R. Mller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:12991319, 1998. [82] B. Schlkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45:27582765, 1997. [83] M. Seeger, C. Williams, and N. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Workshop on AI and Statistics, volume 9, 2003.
160 [84] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):19261940, 1998. [85] R. D. Short and K. Fukunaga. The optimal distance measure for nearest neighbor classication. IEEE Transactions on Information Theory, 27:622627, 1981. [86] P. Simard, Y. LeCun, and J. Denker. Efcient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 50 58, Denver, CO, 1993. Morgan Kaufmann, San Mateo. [87] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition tangent distance and tangent propagation. Lecture Notes in Computer Science, 1524, 1998. [88] Y. Singer. Leveraged vector machines. In S. Solla, T. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems, volume 12, pages 610616. MIT Press, 2000. [89] A. Smola and B. Schlkopf. Sparse greedy matrix approximation for machine learning. In P. Langley, editor, International Conference on Machine Learning, pages 911918, San Francisco, 2000. Morgan Kaufmann. [90] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, 2001. To appear.
161 [91] A. J. Smola, T. Friess, and B. Schlkopf. Semiparametric support vector and linear programming machines. In M. Kearns, S. Solla, and D. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 585591. MIT Press, 1999. [92] M. Stone. Cross-validatory choice and assesment of statistical predictions. Journal of the Royal Statistical Society, 36:111147, 1974. [93] R. Sutton and A. Barto. An Introdiction to Reinforcement Learining. MIT Press, 1998. [94] Y. W. Teh and S. Roweis. Automatic alignment of local representations. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15. The MIT Press, 2003. [95] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):23192323, Dec. 2000. [96] T. Thrun and T. Mitchell. Learning one more thing. In Proceedings of the 14th International Joint Conference on Articial Intelligence (IJCAI), San Mateo, CA, Aug. 1995. Morgan Kaufmann. [97] A. Tikhonov and V. Arsenin. Solutions of Ill-posed Problems. W.H. Winston, Washington D.C., 1977. [98] M. Tipping. The relevance vector machine. In S. Solla, T. Leen, and K.R. Mller, editors, Advances in Neural Information Processing Systems, volume 12, pages 652658. MIT Press, 2000.
162 [99] M. Tipping and C. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11(2):443482, 1999. [100] S. Tong and D. Koller. Restricted bayes optimal classiers. In Proceedings of the 17th National Conference on Articial Intelligence (AAAI), pages 658664, Austin, Texas, 2000. [101] K. Tsuda. Optimal hyperplane classier based on entropy number bound. In ICANN99, pages 419424, 1999. [102] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [103] V. Vapnik. Statistical Learning Theory. Wiley, Lecture Notes in Economics and Mathematical Systems, volume 454, 1998. [104] P. Vincent and Y. Bengio. A neural support vector network architecture with adaptive kernels. In Proceedings of the International Joint Conference on Neural Network, IJCNN2000, volume 5, pages 51875192, 2000. [105] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. The MIT Press. [106] J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, and C. Watkins. Density estimation using support vector machines. In B. Schlkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods Support Vector Learning, pages 293306, Cambridge, MA, 1999. MIT Press.
163 [107] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8. The MIT Press, 1995. [108] D. R. Wilson and T. R. Martinez. Instance pruning techniques. In Proc. 14th International Conference on Machine Learning, pages 403411. Morgan Kaufmann, 1997. [109] P. N. Yianilos. Metric learning via normal mixtures. Technical report, NEC Research Institute, Princeton, NJ, October 1995. [110] B. Zhang. Is the maximal margin hyperplane special in a feature space? Technical Report HPL-2001-89, Hewlett-Packards Labs, 2001.