ISSumSet: a tweet summarization dataset hidden in a TREC track - Université Toulouse III - Paul Sabatier - Toulouse INP Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

ISSumSet: a tweet summarization dataset hidden in a TREC track

Résumé

A key issue for Twitter users relates to the summarization of the continuous and overwhelming stream of information. Many approaches for tweet summarization were proposed in the literature. It is however difficult to compare them given the lack of standard and accessible test collection. This absence can be due to the efforts needed to construct such a (large) dataset. In this paper, we propose to capitalize on the dataset proposed for the TREC Incident Streams track, which was not intended to evaluate automatic summarization. We show why and how this dataset is usable for this purpose, focusing on extractive summarization. Indeed, when producing additional annotations on a subset of the TREC Incident Streams (IS) dataset with particular initial assessors' annotations, it appears to respect the criteria identified in the literature for automatic summarization. For this, we studied the original TREC IS dataset and then proposed a subset summarizing each event, based on the initial assessors' annotations. This subset is evaluated according to the criteria previously mentioned. Several widely used state-of-the-art models for automatic text summarization, some specific to tweets and some adapted to tweet summarization, were finally tested on the proposed dataset. For easy reproducibility, the code used to build the dataset, our additional annotations, and the experiments made on the dataset are provided on our Github.
Fichier non déposé

Dates et versions

hal-03244354 , version 1 (01-06-2021)

Identifiants

Citer

Alexis Dusart, Karen Pinel-Sauvagnat, Gilles Hubert. ISSumSet: a tweet summarization dataset hidden in a TREC track. 36th ACM/SIGAPP Symposium on Applied Computing (SAC 2021), Association for Computing Machinery - Special Interest Group on Applied Computing (SIGAPP), Mar 2021, Republic of Korea (virtual event), South Korea. pp.665-671, ⟨10.1145/3412841.3441946⟩. ⟨hal-03244354⟩
71 Consultations
1 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More