Big Data Entity Resolution: - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

Big Data Entity Resolution:

Résumé

—In the Web of data, entities are described by inter-linked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To reduce the required number of pairwise comparisons, methods for entity resolution perform blocking as a pre-processing step. A blocking technique places similar entity descriptions into blocks and executes comparisons only between descriptions within the same block. We experimentally evaluate blocking techniques proposed for the Web of data and present dataset characteristics that determine the effectiveness and efficiency of such methods. Furthermore, we analyze the characteristics of the missed matching entity descriptions and examine different types of links that blocking techniques can potentially identify. I. INTRODUCTION Nowadays, knowledge bases (KBs) offer comprehensive, machine-readable descriptions of a large variety of real-world entities (e.g., persons, places) published on the Web as Linked Data (LD). Although KBs (e.g., DBpedia, Freebase) may be derived from the same data source (e.g., Wikipedia), they may provide multiple descriptions of the same entities. This is mainly due to the different information extraction tools and curation policies [3] employed by KBs, resulting to complementary and sometimes conflicting descriptions. Entity resolution (ER) aims to identify descriptions that refer to the same entity within or across KBs [2], [4]. Compared to data warehouses, the new ER challenges stem from the openness of the Web of data in describing entities by an unbounded number of KBs, the semantic and structural diversity of the descriptions provided across domains even for the same entities, and the autonomy of KBs in terms of adopted processes for creating and curating descriptions. In general, the way two descriptions can be effectively compared to efficiently decide if they refer to the same entity is challenged by the scale, diversity and graph structuring of the descriptions in the Web. This requires an understanding of the relationships among somehow similar descriptions that goes beyond duplicate detection. Also, the huge volume of entity collections that we need to resolve in the Web is prohibitive when examining pairwise all descriptions. In this context of big Web data, blocking is typically used as a pre-processing step for ER to reduce the number of required comparisons. After blocking, each description can be compared only to others placed within the same block. The desiderata of blocking are to place (i) similar
Fichier principal
Vignette du fichier
Big Data Entity Resolution.pdf (416.02 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01199399 , version 1 (15-09-2015)

Identifiants

Citer

Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides. Big Data Entity Resolution:: From Highly to Somehow Similar Entity Descriptions in the Web. 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Oct 2015, Santa Clara, CA,, United States. ⟨10.1109/BigData.2015.7363781⟩. ⟨hal-01199399⟩

Collections

INRIA INRIA2
270 Consultations
718 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More