Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2016

Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies

Résumé

In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide Web, DMOZ 1 to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook, Instagram, Flickr) with an objective of enhancing an individual's exploration of their personal information through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that the induced taxonomies are of high quality
Fichier principal
Vignette du fichier
IJAI2016LM_GG_4June2016.pdf (3.02 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01334236 , version 1 (20-06-2016)

Licence

Copyright (Tous droits réservés)

Identifiants

  • HAL Id : hal-01334236 , version 1

Citer

Lawrence Muchemi, Gregory Grefenstette. Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies. 2016. ⟨hal-01334236⟩
301 Consultations
344 Téléchargements

Partager

Gmail Facebook X LinkedIn More