文章基本信息

标题：Accelerating Reinforcement Learning with Suboptimal Guidance ⁎
本地全文：下载
作者：Eivind Bøhn ; Signe Moe ; Tor Arne Johansen 等
期刊名称：IFAC PapersOnLine
印刷版ISSN：2405-8963
出版年度：2020
卷号：53
期号：2
页码：8090-8096
DOI：10.1016/j.ifacol.2020.12.2278
语种：English
出版社：Elsevier
摘要：AbstractReinforcement learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. To achieve such an exploration guidance while also allowing the learning controller to outperform the demonstrations provided to it, Nair et al. (2017) proposes to use a "Q-filter" to select states where the agent should clone the behaviour of the demonstrations. The Q-filter selects states where the critic deems the demonstrations to be superior to the agent, providing a natural way to adjust the guidance in a manner that is adaptive to the proficiency of the demonstrator. The contribution of this paper lies in adapting the Q-filter concept from pre-recorded demonstrations to an online guiding controller, and further in identifying shortcomings in the formulation of the Q-filter and suggesting some ways these issues can be mitigated — notably by replacing the value comparison baseline with the guiding controller’s own value function — reducing the effects of stochasticity in the neural network value estimator. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotics environments tested.
关键词：KeywordsDeep Reinforcement LearningNon-Linear Control SystemsRobotics