TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization

Résumé

The casual, neutral, and formal language registers are highly perceptible in discourse productions. However, they are still poorly studied in Natural Language Processing (NLP), especially outside English, and for new textual types like tweets. To stimulate research, this paper introduces a large corpus of 228,505 French tweets (6M words) annotated in language registers. Labels are provided by a multi-label CamemBERT classifier trained and checked on a manually annotated subset of the corpus, while the tweets are selected to avoid undesired biases. Based on the corpus, an initial analysis of linguistic traits from either human annotators or automatic extractions is provided to describe the corpus and pave the way for various NLP tasks. The corpus, annotation guide and classifier are available on http://tremolo.irisa.fr.
Fichier principal
Vignette du fichier
TREMoLo_Tweets__a_Multi_Label_Corpus_of_French_Tweets_for_Language_Register_Characterization-1.pdf (620.65 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03331738 , version 1 (02-09-2021)

Identifiants

  • HAL Id : hal-03331738 , version 1

Citer

Jade Mekki, Gwénolé Lecorvé, Delphine Battistelli, Nicolas Béchet. TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization. RANLP 2021 - Recent Advances in Natural Language Processing, Sep 2021, Varna, Bulgaria. ⟨hal-03331738⟩
201 Consultations
233 Téléchargements

Partager

Gmail Facebook X LinkedIn More