TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization

Jade Mekki; Gwénolé Lecorvé; Delphine Battistelli; Nicolas Béchet

Communication Dans Un Congrès Année : 2021

TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization

(1, 2) , (1, 3) , (2) , (1)

1
2
3

Jade Mekki

Fonction : Auteur
PersonId : 1236982
IdHAL : jade-mekki
ORCID : 0009-0009-1725-1133

Institut de Recherche en Informatique et Systèmes Aléatoires

Modèles, Dynamiques, Corpus

Gwénolé Lecorvé

Fonction : Auteur
PersonId : 20677
IdHAL : gwenole-lecorve
ORCID : 0000-0002-4271-2087
IdRef : 150245254

Institut de Recherche en Informatique et Systèmes Aléatoires

Orange Labs [Lannion]

Delphine Battistelli

Fonction : Auteur
PersonId : 89
IdHAL : delphine-battistelli
IdRef : 060895217

Modèles, Dynamiques, Corpus

Nicolas Béchet

Fonction : Auteur
PersonId : 181774
IdHAL : nicolas-bechet
ORCID : 0000-0001-9425-5570
IdRef : 142928879

Institut de Recherche en Informatique et Systèmes Aléatoires

Résumé

The casual, neutral, and formal language registers are highly perceptible in discourse productions. However, they are still poorly studied in Natural Language Processing (NLP), especially outside English, and for new textual types like tweets. To stimulate research, this paper introduces a large corpus of 228,505 French tweets (6M words) annotated in language registers. Labels are provided by a multi-label CamemBERT classifier trained and checked on a manually annotated subset of the corpus, while the tweets are selected to avoid undesired biases. Based on the corpus, an initial analysis of linguistic traits from either human annotators or automatic extractions is provided to describe the corpus and pave the way for various NLP tasks. The corpus, annotation guide and classifier are available on http://tremolo.irisa.fr.

Domaines

Traitement du texte et du document Linguistique Machine Learning [stat.ML]

Fichier principal

TREMoLo_Tweets__a_Multi_Label_Corpus_of_French_Tweets_for_Language_Register_Characterization-1.pdf (620.65 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Jade Mekki : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03331738

Soumis le : jeudi 2 septembre 2021-09:48:09

Dernière modification le : jeudi 21 décembre 2023-17:18:03

Archivage à long terme le : vendredi 3 décembre 2021-19:35:50

Dates et versions

hal-03331738 , version 1 (02-09-2021)

Identifiants

HAL Id : hal-03331738 , version 1

Citer

Jade Mekki, Gwénolé Lecorvé, Delphine Battistelli, Nicolas Béchet. TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization. RANLP 2021 - Recent Advances in Natural Language Processing, Sep 2021, Varna, Bulgaria. ⟨hal-03331738⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA MODYCO CENTRALESUPELEC UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UNIV-PARIS-LUMIERES IRISA_UBS_2 UR1-MATH-NUM UNIV-PARIS-NANTERRE

201 Consultations

233 Téléchargements

TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager