Keyword Search in Heterogeneous Data Sources - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2020

Keyword Search in Heterogeneous Data Sources

Résumé

Data journalism is the field of investigative journalism work based first and foremost on digital data. As more and more of human activity leaves strong digital traces, data journalism is an increasingly important trend. Important journalism projects increasingly involve diverse data sources, having heterogeneous data models, different structures, or no structure at all; the Offshore Leaks is a prime example. Inspired by our collaboration with Le Monde, a leading French newspaper , we designed a novel content management architecture, together with an algorithm for exploiting such heterogeneous corpora through keyword search: given a set of search terms, find links between them within and across the different datasets which we interconnect in a graph. Our work recalls keyword search in structured and unstructured data, but data heterogeneity makes it computationally harder. We analyze the performance of our algorithm on real-life datasets.
Fichier principal
Vignette du fichier
submitted-CL.pdf (608.85 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02559688 , version 1 (30-04-2020)

Identifiants

  • HAL Id : hal-02559688 , version 1

Citer

Felipe Cordeiro, Helena Galhardas, Julien Leblay, Ioana Manolescu, Tayeb Merabti. Keyword Search in Heterogeneous Data Sources. 2020. ⟨hal-02559688⟩
107 Consultations
283 Téléchargements

Partager

Gmail Facebook X LinkedIn More