摘要:AbstractReinforcement learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. To achieve such an exploration guidance while also allowing the learning controller to outperform the demonstrations provided to it, Nair et al. (2017) proposes to use a "Q-filter" to select states where the agent should clone the behaviour of the demonstrations. The Q-filter selects states where the critic deems the demonstrations to be superior to the agent, providing a natural way to adjust the guidance in a manner that is adaptive to the proficiency of the demonstrator. The contribution of this paper lies in adapting the Q-filter concept from pre-recorded demonstrations to an online guiding controller, and further in identifying shortcomings in the formulation of the Q-filter and suggesting some ways these issues can be mitigated — notably by replacing the value comparison baseline with the guiding controller’s own value function — reducing the effects of stochasticity in the neural network value estimator. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotics environments tested.
关键词:KeywordsDeep Reinforcement LearningNon-Linear Control SystemsRobotics