Upper Confidence Reinforcement Learning exploiting state-action equivalence

Odalric-Ambrym Maillard; Mahsa Asadi

Pré-Publication, Document De Travail Année : 2018

Upper Confidence Reinforcement Learning exploiting state-action equivalence

(1) , (2)

1
2

Odalric-Ambrym Maillard

Fonction : Auteur
PersonId : 5563
IdHAL : odalric-ambrym-maillard
ORCID : 0000-0001-7935-7026
IdRef : 158055594

Sequential Learning

Mahsa Asadi

Fonction : Auteur
PersonId : 1039999

Machine Learning in Information Networks

Résumé

Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as ~O(D√KCT) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the unknown classes, but show that its natural combination with UCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterion for stopping the episodes in UCRL2. We replace it with hypothesis testing, which in turns considerably improves all strategies. It is then empirically validated that learning the structure can be beneficial in a full-blown RL problem.

Domaines

Intelligence artificielle [cs.AI] Statistiques [math.ST] Machine Learning [stat.ML]

Fichier principal

UCRL_Classes_HAL.pdf (609.03 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Odalric-Ambrym Maillard : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01945034

Soumis le : mercredi 5 décembre 2018-09:57:00

Dernière modification le : vendredi 5 avril 2024-09:33:22

Archivage à long terme le : mercredi 6 mars 2019-13:08:17

Dates et versions

hal-01945034 , version 1 (05-12-2018)

Identifiants

HAL Id : hal-01945034 , version 1

Citer

Odalric-Ambrym Maillard, Mahsa Asadi. Upper Confidence Reinforcement Learning exploiting state-action equivalence. 2018. ⟨hal-01945034⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 CRISTAL-MAGNET CRISTAL-SEQUEL UNIV-LILLE

144 Consultations

623 Téléchargements

Upper Confidence Reinforcement Learning exploiting state-action equivalence

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager