Visual Reasoning with Multi-hop Feature Modulation

Florian Strub; Mathieu Seurin; Ethan Perez; Harm de Vries; Jérémie Mary; Philippe Preux; Aaron Courville; Olivier Pietquin

Communication Dans Un Congrès Année : 2018

Visual Reasoning with Multi-hop Feature Modulation

(1, 2, 3) , (2, 1, 3) , (4) , (5, 6) , (7) , (1, 2, 3) , (6, 8) , (9)

1
2
3
4
5
6
7
8
9

Florian Strub

Fonction : Auteur
PersonId : 18649
IdHAL : florian-strub
ORCID : 0000-0001-7271-5345

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Université de Lille

Sequential Learning

Mathieu Seurin

Fonction : Auteur
PersonId : 1039295

Université de Lille

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Sequential Learning

Ethan Perez

Fonction : Auteur
PersonId : 1023763

Rice University [Houston]

Harm de Vries

Fonction : Auteur

Department of Computer Science and Operations Research [Montreal]

Montreal Institute for Learning Algorithms [Montréal]

Jérémie Mary

Fonction : Auteur
PersonId : 740984
IdHAL : jeremie-mary

Criteo [Paris]

Philippe Preux

Fonction : Auteur
PersonId : 5488
IdHAL : preux-philippe
IdRef : 059896353

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Université de Lille

Sequential Learning

Aaron Courville

Fonction : Auteur
PersonId : 1011047

Montreal Institute for Learning Algorithms [Montréal]

CIFAR

Olivier Pietquin

Fonction : Auteur
PersonId : 4024
IdHAL : olivier-pietquin
ORCID : 0000-0002-5386-465X
IdRef : 142821861

Google Inc

Résumé

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt-on-par with single-hop FiLM generation-while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

Mots clés

Deep Learning Computer Vision Natural Language Understanding Multi-modal Learning

Domaines

Intelligence artificielle [cs.AI] Vision par ordinateur et reconnaissance de formes [cs.CV] Apprentissage [cs.LG] Réseau de neurones [cs.NE]

Fichier principal

1808.04446.pdf (5.26 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Mathieu Seurin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01927811

Soumis le : mardi 20 novembre 2018-10:30:57

Dernière modification le : mercredi 24 janvier 2024-09:54:22

Dates et versions

hal-01927811 , version 1 (20-11-2018)

Identifiants

HAL Id : hal-01927811 , version 1
ARXIV : 1808.04446

Citer

Florian Strub, Mathieu Seurin, Ethan Perez, Harm de Vries, Jérémie Mary, et al.. Visual Reasoning with Multi-hop Feature Modulation. ECCV 2018 - 15th European Conference on Computer Vision, Sep 2018, Munich, Germany. pp.808-831. ⟨hal-01927811⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 CRISTAL-SEQUEL UNIV-LILLE

139 Consultations

144 Téléchargements

Visual Reasoning with Multi-hop Feature Modulation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager