Agent-State Based Policies in POMDPs: Beyond Belief State MDPs
Amit Sinha – Université McGill, Canada
Aditya Mahajan – Professeur titulaire, Département de génie électrique et informatique, Université McGill, Canada
Séminaire hybride à l'Université McGill ou Zoom.
The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable ``agent state'' and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer's approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs. These include agent-state based Q-learning (ASQL) and agent-state based actor critic (ASAC). We show how to analyze the convergence of these algorithms and characterize the suboptimality of the converged solution.
We illustrate that these suboptimality bounds can be used to add a novel “AIS block” in the standard Q-learning and action-critic pipelines, where the AIS block learns a model to minimize a proxy for the suboptimality bounds. Experiments on large-scale POMDPs demonstrate that adding such an AIS block improves performance of RL algorithms. We conclude by a discussion of various practical considerations including choice of predicting state vs predicting observations, the choice of metric, the relationship with representation learning, and others.
This is a dry run of a tutorial presentation to be given at CDC 2024.
Bio: Amit Sinha is a senior PhD student in the department of Electrical and Computer Engineering at McGill University.
Aditya Mahajan is Professor of Electrical and Computer Engineering at McGill University.
Lieu
CIM
Pavillon McConnell
Université McGill
Montréal QC H3A 0E9
Canada