Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Bruno Scherrer

Communication Dans Un Congrès Année : 2013

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

(1)

Bruno Scherrer

Fonction : Auteur
PersonId : 1406
IdHAL : bruno-scherrer
IdRef : 073360708

Autonomous intelligent machine

Résumé

Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $ O \left( \frac{ n m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ O \left( \frac{n^2 m}{1-\gamma} \log \left( \frac{1}{1-\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: given a measure of the maximal transient time $\tau_t$ and the maximal time $\tau_r$ to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ \tilde O \left( n^3 m^2 \tau_t \tau_r \right) $ iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which $\tau_t \le n$ and $\tau_r \le n$. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Simplex-PI and Howard's PI terminate after at most $ \tilde O(nm (\tau_t+\tau_r))$ iterations.

Domaines

Complexité [cs.CC] Optimisation et contrôle [math.OC]

Fichier principal

nips2013.pdf (272.03 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Bruno Scherrer : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00921261

Soumis le : vendredi 20 décembre 2013-10:24:32

Dernière modification le : mardi 16 avril 2024-10:24:38

Archivage à long terme le : vendredi 21 mars 2014-00:40:10

Dates et versions

hal-00921261 , version 1 (20-12-2013)

Identifiants

HAL Id : hal-00921261 , version 1

Citer

Bruno Scherrer. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration. Neural Information Processing Systems (NIPS) 2013, Dec 2013, South Lake Tahoe, United States. ⟨hal-00921261⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 TDS-MACS LORIA LORIA-AIS

244 Consultations

309 Téléchargements

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager