Crowdsourcing Thousands of Specialized Labels: A Bayesian Active Training Approach

Maximilien Servajean; Alexis Joly; Dennis Shasha; Julien Champ; Esther Pacitti

doi:10.1109/TMM.2017.2653763

Article Dans Une Revue IEEE Transactions on Multimedia Année : 2017

Crowdsourcing Thousands of Specialized Labels: A Bayesian Active Training Approach

(1) , (2) , (2) , (2) , (2)

1
2

Maximilien Servajean

Fonction : Auteur
PersonId : 169562
IdHAL : maximilien-servajean
ORCID : 0000-0002-9426-2583
IdRef : 196392160

Institut de Biologie Computationnelle

Alexis Joly

Fonction : Auteur
PersonId : 12088
IdHAL : alexis-joly
ORCID : 0000-0002-2161-9940
IdRef : 107969394

Scientific Data Management

Dennis Shasha

Fonction : Auteur
PersonId : 833427

Scientific Data Management

Julien Champ

Fonction : Auteur
PersonId : 738313
IdHAL : julien-champ
ORCID : 0000-0002-2042-0411

Scientific Data Management

Esther Pacitti

Fonction : Auteur
PersonId : 3271
IdHAL : esther-pacitti
ORCID : 0000-0003-1370-9943
IdRef : 117946451

Scientific Data Management

Résumé

Large-scale annotated corpora have yielded impressive performance improvements in computer vision and multimedia content analysis. However, such datasets depend on an enormous amount of human labeling effort. When the labels correspond to well-known concepts, it is straightforward to train the annotators by giving a few examples with known answers. It is also straightforward to judge the quality of their labels. Neither is true when there are thousands of complex domain-specific labels. Training on all labels is infeasible and the quality of an annotator's judgements may be vastly different for some subsets of labels than for others. This paper proposes a set of data-driven algorithms to 1) train image annotators on how to disambiguate among automatically generated candidate labels, 2) evaluate the quality of annotators' label suggestions, and 3) weigh predictions. The algorithms adapt to the skills of each annotator both in the questions asked and the weights given to their answers. The underlying judgements are Bayesian, based on adaptive priors. We measure the benefits of these algorithms on a live user experiment related to image-based plant identification involving around 1000 people. The proposed methods are shown to enable huge gains in annotation accuracy. A standard user can correctly label around 2% of our data. This goes up to 80% with machine learning assisted training and assignment and up to almost 90% when doing a weighted combination of several annotators' labels.

Mots clés

Bayes methods Taylor series Parameter estimation Crowdsourcing

Domaines

Informatique [cs] Environnements Informatiques pour l'Apprentissage Humain Apprentissage [cs.LG]

Fichier principal

main_single.pdf (1.14 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Alexis Joly : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01629149

Soumis le : jeudi 9 novembre 2017-14:44:07

Dernière modification le : vendredi 24 mars 2023-14:53:05

Archivage à long terme le : samedi 10 février 2018-12:38:29

Dates et versions

hal-01629149 , version 1 (09-11-2017)

Identifiants

HAL Id : hal-01629149 , version 1
DOI : 10.1109/TMM.2017.2653763

Citer

Maximilien Servajean, Alexis Joly, Dennis Shasha, Julien Champ, Esther Pacitti. Crowdsourcing Thousands of Specialized Labels: A Bayesian Active Training Approach. IEEE Transactions on Multimedia, 2017, 19 (6), pp.1376-1391. ⟨10.1109/TMM.2017.2653763⟩. ⟨hal-01629149⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

TICE CNRS INRIA INRA ZENITH LIRMM INRIA2 MIPS UNIV-MONTPELLIER INRAE

513 Consultations

393 Téléchargements

Crowdsourcing Thousands of Specialized Labels: A Bayesian Active Training Approach

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager