Efficient Sequential Learning in Structured and Constrained Environments

Daniele Calandriello

Thèse Année : 2017

Efficient Sequential Learning in Structured and Constrained Environments

Apprentissage séquentiel efficace dans des environnements structurés avec contraintes

(1)

Daniele Calandriello

Fonction : Auteur
PersonId : 1033194

Sequential Learning

Résumé

The main advantage of non-parametric models is that the accuracy of the model (degrees of freedom) adapts to the number of samples. The main drawback is the so-called "curse of kernelization": to learn the model we must first compute a similarity matrix among all samples, which requires quadratic space and time and is unfeasible for large datasets. Nonetheless the underlying effective dimension (effective d.o.f.) of the dataset is often much smaller than its size, and we can replace the dataset with a subset (dictionary) of highly informative samples. Unfortunately, fast data-oblivious selection methods (e.g., uniform sampling) almost always discard useful information, while data-adaptive methods that provably construct an accurate dictionary, such as ridge leverage score (RLS) sampling, have a quadratic time/space cost. In this thesis we introduce a new single-pass streaming RLS sampling approach that sequentially construct the dictionary, where each step compares a new sample only with the current intermediate dictionary and not all past samples. We prove that the size of all intermediate dictionaries scales only with the effective dimension of the dataset, and therefore guarantee a per-step time and space complexity independent from the number of samples. This reduces the overall time required to construct provably accurate dictionaries from quadratic to near-linear, or even logarithmic when parallelized. Finally, for many non-parametric learning problems (e.g., K-PCA, graph SSL, online kernel learning) we we show that we can can use the generated dictionaries to compute approximate solutions in near-linear that are both provably accurate and empirically competitive.

L’avantage principal des méthodes d’apprentissage non-paramétriques réside dans le fait que la nombre de degrés de libertés du modèle appris s’adapte automatiquement au nombre d’échantillons. Ces méthodes sont cependant limitées par le "fléau de la kernelisation": apprendre le modèle requière dans un premier temps de construire une matrice de similitude entre tous les échantillons. La complexité est alors quadratique en temps et espace, ce qui s’avère rapidement trop coûteux pour les jeux de données de grande dimension. Cependant, la dimension "effective" d’un jeu de donnée est bien souvent beaucoup plus petite que le nombre d’échantillons lui-même. Il est alors possible de substituer le jeu de donnée réel par un jeu de données de taille réduite (appelé "dictionnaire") composé exclusivement d’échantillons informatifs. Malheureusement, les méthodes avec garanties théoriques utilisant des dictionnaires comme "Ridge Leverage Score" (RLS) ont aussi une complexité quadratique. Dans cette thèse nous présentons une nouvelle méthode d’échantillonage RLS qui met à jour le dictionnaire séquentiellement en ne comparant chaque nouvel échantillon qu’avec le dictionnaire actuel, et non avec l’ensemble des échantillons passés. Nous montrons que la taille de tous les dictionnaires ainsi construits est de l’ordre de la dimension effective du jeu de données final, guarantissant ainsi une complexité en temps et espace à chaque étape indépendante du nombre total d’échantillons. Cette méthode présente l’avantage de pouvoir être parallélisée. Enfin, nous montrons que de nombreux problèmes d’apprentissage non-paramétriques peuvent être résolus de manière approchée grâce à notre méthode.

Mots clés

Nystrom approximation Nystrom-type algorithm Sequential learning Stochastic gradient method Newton and quasi-Newton methods Graph Laplacian spectrum Semi-supervised learning Kernel learning Gaussian process Low-rank Matrix Approximation Distributed learning Dictionary learning Principal component analysis method Online learning

Apprentissage

Domaines

Apprentissage [cs.LG] Machine Learning [stat.ML] Analyse numérique [cs.NA]

Fichier principal

main.pdf (2.12 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Daniele Calandriello : Connectez-vous pour contacter le contributeur

https://theses.hal.science/tel-01816904

Soumis le : vendredi 15 juin 2018-19:42:45

Dernière modification le : mercredi 24 janvier 2024-09:54:23

Archivage à long terme le : lundi 17 septembre 2018-10:41:24

Dates et versions

tel-01816904 , version 1 (15-06-2018)

Identifiants

HAL Id : tel-01816904 , version 1

Citer

Daniele Calandriello. Efficient Sequential Learning in Structured and Constrained Environments. Machine Learning [cs.LG]. Inria Lille Nord Europe - Laboratoire CRIStAL - Université de Lille, 2017. English. ⟨NNT : ⟩. ⟨tel-01816904⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 CRISTAL-SEQUEL UNIV-LILLE

409 Consultations

594 Téléchargements

Efficient Sequential Learning in Structured and Constrained Environments

Apprentissage séquentiel efficace dans des environnements structurés avec contraintes

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager