文章基本信息

标题：A new approach for solving positioning tasks of robotic systems based on reinforcement learning.
作者：Albers, Albert ; Sommer, Hermann ; Frietsch, Markus 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2010
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：One of the biggest challenges in current research in robotics is, that robots "leave" their well structured environment and are confronted with new tasks in a more complex environment. An example for this new setting is the striving for autonomy and versatility in the fields of humanoid robotics as explained in (Peters J., Vijayakumar S. & Schaal S., 2003). Due to this, it can only be successful resp. useful, when it is able to adapt itself and to learn from its experiences. Reinforcement Learning (RL), a branch of machine learning (Mitchell, Tom M. 1997), is one possible approach to deal with this problem. However, the application of this learning process is limited due to its complexity. RL is a learning process, which uses reward and punishment signals from the interaction with the agent's environment in order to learn a distinct policy for achieving tasks. Various RL methods e.g. Q-learning (Watkins, 1989) or the SARSA algorithm have been studied in (Sutton, R.S. & Barto, A.G., 1998) where it is shown that two problems must be considered (Park, M. S. & Choi, J. Y. 2002). At first, the high computational efforts: RL is disturbed by the "curse of dimensionality" (Bellman, 1957), which refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables.
关键词：Incremental motion control;Machine learning;Manipulators;Motion control;Robot motion;Robots

A new approach for solving positioning tasks of robotic systems based on reinforcement learning.

Albers, Albert ; Sommer, Hermann ; Frietsch, Markus 等

1. INTRODUCTION

One of the biggest challenges in current research in robotics is, that robots "leave" their well structured environment and are confronted with new tasks in a more complex environment. An example for this new setting is the striving for autonomy and versatility in the fields of humanoid robotics as explained in (Peters J., Vijayakumar S. & Schaal S., 2003). Due to this, it can only be successful resp. useful, when it is able to adapt itself and to learn from its experiences. Reinforcement Learning (RL), a branch of machine learning (Mitchell, Tom M. 1997), is one possible approach to deal with this problem. However, the application of this learning process is limited due to its complexity. RL is a learning process, which uses reward and punishment signals from the interaction with the agent's environment in order to learn a distinct policy for achieving tasks. Various RL methods e.g. Q-learning (Watkins, 1989) or the SARSA algorithm have been studied in (Sutton, R.S. & Barto, A.G., 1998) where it is shown that two problems must be considered (Park, M. S. & Choi, J. Y. 2002). At first, the high computational efforts: RL is disturbed by the "curse of dimensionality" (Bellman, 1957), which refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables.

[FIGURE 1 OMITTED]

Secondly, the information of every different task is stored in a separate Q-table, thus requiring a huge storing space when performing larger amounts of tasks. In addition, there is no formal connection between the Q-Tables corresponding to two different tasks; therefore the agent has to learn every task without any kind of precognition. This also reduces the usefulness of RL for practical applications and poses the question of how already acquired knowledge can be reused. In (Martin H, J. A. & De Lope, J., 2007), an approach is presented where a distributed architecture in RL serves as a pragmatical solution for the first problem for some common robotic manipulators with different DOF. A global high-dimensional Q-table which contains the evaluations of actions for all states is replaced by using some small, low dimensional Q-tables. RL based approaches have been applied to various robotic systems in the past, although mostly applied to the learning of elemental tasks meant to serve as "building blocks of movement generation" as in (Peters J., 2008). Nevertheless, new computations and additional storage space is required for performing new tasks. In this paper, a novel relative approach for positioning tasks is implemented, and the pragmatical repercussions resulting in a considerable reduction of the computational effort are presented.

1.2 Two Link Planar Robot

To show the feasibility of reinforcement learning methods to control manipulator systems, a 2 DOF manipulator system is used as a first simplified platform. Fig.1 shows the schematic drawing of the manipulator. The central goal of this paper is to move the robot manipulator from a start position to a target position. The aim is to create a trajectory using as few control commands as possible. Thus, the state of the manipulator is described by:

s = [[[theta].sub.1], [[??].sub.1], [[theta].sub.2], [[??].sub.2]] (2)

The system is described more in detail in (Denzinger and Laureyns, 2008).

[FIGURE 1 OMITTED]

2. THE RELATIVE APPROACH

2.1 Overview

The system is dynamic and nonlinear because the inertia of the upper arm changes due to a variable angle (Denzinger, J. & Laureyns, I. 2008). The state space of the robot arm consists of its angles 6 and angle velocities 6. A Q-table in this case is a 6-dimensional hyperspace consisting of angles, velocities and actions:

Q = [[[theta].sub.1], [[theta].sub.2], [[??].sub.1], [[??].sub.2], [a.sub.1], [a.sub.2]]} (1)

Applying the approach of (J. A. Martin H. & J. De Lope, 2007) delivers two low dimensional Q-table in [Q.sub.1] = {[[theta].sub.1], [[??].sub.1], [a.sub.1]} and [Q.sub.2] = {[[theta].sub.2], [[??].sub.2], [a.sub.2]}. The relative approach developed in (Yan W. & et. al., 2009) enables each position to be calculated as a constant value plus a difference as shown in the follow equations:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

At this point it is visible that every possible positioning task can be reduced to a simple offset compensation, reducing the amount of possible tasks to a single one, which contain all necessary information about the system's dynamics.

2.2 Computing appropriate Q-tables

As a result of the implemented relative approach, every positioning task can be seen as a compensation of an angular offset towards the target joint angle. Theoretically, the employed Q-Learning algorithm converges towards an optimal Q-table for the case that every state in the state space is visited an infinite amount of times. As a pragmatical solution to this problem the employment of an averaged Q-table for a finite amount of repetitions of the same task is proposed. Additionally, a proper reduction of the state space e.g. using an unevenly distributed state space makes it possible to visit all or nearly all states during a learning task.

3. RESULTS

The experimental focus lies in reusing the information of a learning task for the completion of new tasks. For this purpose a set of averaged Q-tables was computed. Each set corresponds to the average over a different amount of repetitions of one and the same learning task, consisting of 600 episodes with a maximum of 250 steps per episode. Here one completely performed movement starting from an initial position either ending after 250 steps or reaching the target position with less than 250 steps is termed as an episode. These Q-tables were computed for different amounts of repetitions using a Q-Learning algorithm. The computed Q-tables were then used to complete 10,000 random positioning tasks without further learning. The results are shown in Fig. 3. Using the averaged Q-tables for 10 and 100 repetitions ensured a perfect completion ratio. Furthermore, the quality of the solution increased with the amount of repetitions of the underlying learning task. The main difference between the employed policies could be detected at the amount of initialization values remaining in the Q-tables underlying each solution. While the Q-tables that corresponded to the solution computed for a single repetition of the learning task contained several initialization values, the ones corresponding to the average of one hundred repetitions presented only a single one left. These initialization entries represent situations in which the agent lacks information about its interaction with the environment, thus nearly always resulting in an inappropriate action selection and a failure of the exploitation task at hand. This pragmatical approach to reproduce the theoretical computation of an optimal policy failed to outperform a regular RL-algorithm in approximately one half of the tested tasks. The outcome of the comparison and the results are not surprising since an optimal policy cannot be computed as an average of non-optimal ones. The fact that the general quality of the solutions still improves with the amount of repetitions is a current topic of investigation.

4. CONCLUSIONS

An implementation of a novel relative approach for positioning tasks based on reinforcement learning was presented. By using this new approach any positioning task of a 2 DOF robot manipulator system can be modeled as an offset compensation task. This makes the employment of one single Q-table per joint, for the completion of any positioning task within the agent's workspace without additional learning possible. At this point no further computation is necessary to find a solution for any new positioning task, which represents a vast reduction of the computing costs for the control of the manipulator. The validation of the results obtained by the conducted simulations using a real manipulator system will be subject to future work. In general this approach can be seen as a self learning regulator framework for closed loop control of diverse nonlinear systems, which will also be subject to future research.

5. REFERENCES

Bellman R. (1954). The theory of dynamic programming, Bulletin of the American Mathematical Society, 60, 503-515

Denzinger J.; Laureyns I. & et. al. (2008). A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot, The 2nd European DAAAM International Young Researchers' and Scientists' Conference

Martin J. A. & De Lope J. (2007). A Distributed Reinforcement Learning Architecture for Multi-Link Robots, 4th International Conference on Informatics in Control, Automation and Robotics (ICINCO), (pp. 192-197). Angers, France

Mitchell, Tom M. (1997). Machine learning, McGraw-Hill, New York, ISBN 0-07-042807-7

Peters J., Vijayakuar S. & Schaal S. (2003) Reinforcement learning for humanoid robotics. Third IEEE-RAS International Conference on Humanoid Robots, Karlsruhe, Germany.

Peters J. (2008). Machine Learning for Robotics. VDM Verlag Dr. Muller, Saarbrucken, ISBN 978-3-639-02110-3

Sutton R.S & Barto A.G. (1998). Reinforcement Learning, an introduction, The MIT press, MA

Yan, W & et. al. (2009). Application of reinforcement learning to a two DOF Robot arm control, Annals of DAAAM for 2009 & Proceedings of 20th DAAAM International Symposium

Fig. 3. Experimental results

 Completion Average steps
Repetitions ratio [%] to target

1 99.2 65.55
10 100 31.28
100 100 26.99