Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?

Nicolas Gast; Bruno Gaujal; Kimang Khun

Pré-Publication, Document De Travail Année : 2022

Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?

(1) , (1) , (1)

Nicolas Gast

Fonction : Auteur
PersonId : 1247
IdHAL : nicolas-gast
ORCID : 0000-0001-6884-8698
IdRef : 233247874

Performance analysis and optimization of LARge Infrastructures and Systems

Bruno Gaujal

Fonction : Auteur
PersonId : 11644
IdHAL : bruno-gaujal
ORCID : 0000-0001-9081-8401
IdRef : 074658441

Performance analysis and optimization of LARge Infrastructures and Systems

Kimang Khun

Fonction : Auteur
PersonId : 1102232
IdHAL : kimang-khun
ORCID : 0000-0002-1939-4332

Performance analysis and optimization of LARge Infrastructures and Systems

Résumé

We study learning algorithms for the classical Markovian bandit problem with discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the problem structure. These variants are called MB-PSRL and MB-UCRL2. While the regret bound and runtime of vanilla implementations of PSRL and UCRL2 are exponential in the number of bandits, we show that the episodic regret of MB-PSRL and MB-UCRL2 isÕ(S √ nK) where K is the number of episodes, n is the number of bandits and S is the number of states of each bandit (the exact bound in S, n and K is given in the paper). Up to a factor √ S, this matches the lower bound of Ω(√ SnK) that we also derive in the paper. MB-PSRL is also computationally efficient: its runtime is linear in the number of bandits. We further show that this linear runtime cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2 or UCBVI to Markovian bandit problems. Finally, we perform numerical experiments that confirm that MB-PSRL outperforms other existing algorithms in practice, both in terms of regret and of computation time.

Mots clés

Bayesian approach Restedbandit Posterior Sampling Gittins index Reinforcement Learning Optimistic approach Rested bandit

Domaines

Intelligence artificielle [cs.AI] Apprentissage [cs.LG]

Fichier principal

mbpsrl.pdf (1.28 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Kimang Khun : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03262006

Soumis le : lundi 2 mai 2022-14:28:01

Dernière modification le : vendredi 24 mars 2023-14:53:26

Dates et versions

hal-03262006 , version 1 (16-06-2021)

hal-03262006 , version 2 (02-05-2022)

hal-03262006 , version 3 (09-02-2023)

Identifiants

HAL Id : hal-03262006 , version 2
ARXIV : 2106.08771

Citer

Nicolas Gast, Bruno Gaujal, Kimang Khun. Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?. 2022. ⟨hal-03262006v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

226 Consultations

209 Téléchargements

Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Altmetric

Partager