Upper Confidence Reinforcement Learning exploiting state-action equivalence - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2018

Upper Confidence Reinforcement Learning exploiting state-action equivalence

Odalric-Ambrym Maillard
Mahsa Asadi
  • Fonction : Auteur
  • PersonId : 1039999

Résumé

Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as ~O(D√KCT) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the unknown classes, but show that its natural combination with UCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterion for stopping the episodes in UCRL2. We replace it with hypothesis testing, which in turns considerably improves all strategies. It is then empirically validated that learning the structure can be beneficial in a full-blown RL problem.
Fichier principal
Vignette du fichier
UCRL_Classes_HAL.pdf (609.03 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-01945034 , version 1 (05-12-2018)

Identifiants

  • HAL Id : hal-01945034 , version 1

Citer

Odalric-Ambrym Maillard, Mahsa Asadi. Upper Confidence Reinforcement Learning exploiting state-action equivalence. 2018. ⟨hal-01945034⟩
144 Consultations
623 Téléchargements

Partager

Gmail Facebook X LinkedIn More