Deep Reinforcement Learning for Web Crawling - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Deep Reinforcement Learning for Web Crawling

Résumé

A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly nonuniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.
Fichier principal
Vignette du fichier
Deep_RL_Crawling_ICC21_Author.pdf (2.86 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03461189 , version 1 (01-12-2021)

Identifiants

  • HAL Id : hal-03461189 , version 1

Citer

Konstantin Avrachenkov, Vivek Borkar, Kishor Patil. Deep Reinforcement Learning for Web Crawling. ICC 2021 - 7th Indian Control Conference, Dec 2021, Mumbai, India. ⟨hal-03461189⟩
107 Consultations
482 Téléchargements

Partager

Gmail Facebook X LinkedIn More