文章基本信息

标题：Multiple Q-update approach for reinforcement learning using similarities of adjacent states.
作者：Albers, Albert ; Schillo, Sebastian Mathias ; Frietsch, Markus 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2010
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Instead of using a conventional control system, the aim of this paper is to develop an advanced control system for a simplified humanoid robot manipulator using Reinforcement Learning algorithms. Machine learning techniques are able to find sophisticated solutions for problems where conventional control algorithms are faced with serious difficulties.
关键词：Reinforcement learning (Machine learning);Robot motion;Robots

Multiple Q-update approach for reinforcement learning using similarities of adjacent states.

Albers, Albert ; Schillo, Sebastian Mathias ; Frietsch, Markus 等

1. INTRODUCTION

Instead of using a conventional control system, the aim of this paper is to develop an advanced control system for a simplified humanoid robot manipulator using Reinforcement Learning algorithms. Machine learning techniques are able to find sophisticated solutions for problems where conventional control algorithms are faced with serious difficulties.

1.1 RL Overview

RL is based on the trial-and-error method: the system learns from gathered experience. To determine how good an action is the agent would have to choose it and observe the reward afterwards. Therefore a value is needed to specify how desirable it is for the agent to take a certain action in a state s. The value under a fixed policy can be computed as:

(1)

is called the action value function for policy . denotes the state, stands for the action at time step t (Watkins, 1989). denotes the expectation for policy . This type of update rule evolved from the idea of dynamic programming which was introduced by (Bellmann, 1954) and (Bertsekas et. al., 1996). In order to overcome local extremes, an optimal compromise between exploring and exploiting actions has to be found. This is achieved by picking random actions with a certain probability. E.g. a s-greedy policy can be used to choose random action depending on a parameter s and a random number. Learning is achieved by iteratively updating the Q-Table e.g. by using General-Policy-Iteration (GPI) presented in (Sutton and Barto, 1998). A good survey of RL can be found in (Kaelbling et. al., 1996) or (Smart and Kaelbling, 2002).

1.2 Two Link Planar Robot

In this approach, a humanoid robot manipulator should be controlled by a reinforcement learning agent. To show the feasibility of reinforcement learning methods to control manipulator systems, a 2-DOF-Manipulator system is used as a first simplified platform. A schematic drawing of the robot can be seen in Figure 1. The central goal of this paper is to move the robot manipulator from a start position to a target position. The aim is to create a smooth trajectory using as few control commands as possible.

Thus, the state of the manipulator is described by its two position angles and its two corresponding angular speeds which are discretized:

(2)

The system is described in more detail in (Denzinger and Laureyns, 2008)

2. Q-TABLE

Several problems occur when when a look-up-table is used. (Martin and De Lope, 2007). One major drawback is the large number of states. If an exploring action is chosen, the agent leaves the well-known path with the already well-estimated Q-Values. Hence, without the possibility of making a decision based on Q-Values, all actions are random and the agent does not have any orientation, even if it is very close to the target area. This increases the duration of an episode as well as it hinders a fast and intensive learning process. The basic idea behind the novel Q-update is to store more Q-Values per step and facilitate the agent to make more adequate choices.

If the distance from a specific query point is small enough, the action chosen in the query point is also well suited for a slightly varied state as a first approximation. Hence, the state variables are varied by a small amount A as indicated in equation (3):

(3)

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

Only one state variable is changed while the three others (indicated by index j) remain constant. The indeces i, j denote the position of the state variable. The variation is equal to the discretization step size of the state entries--for

. For the varied state, a different Q-Value is stored. This new Q-Value has a smaller value than the original one because the confidence is smaller. The variation of the state by increments of the discretization step size is necessary to ensure that the new value is stored next to the old one instead of overwriting it. Fig. 2 shows a visualization of the basic idea of this method: in addition to the visited state in the middle, adjacent states are updated using a discount factor in order to represent a less confidential update. Note that the Q-Table is a 5-dimensional matrix. The table's velocity, and action entries are considered to be constant in order to display the Q-Values and the angle's, . Furthermore the Q-Values are scalars in the general case; for better readability an additional dimension for is introduced in Fig. 2.

3. EVALUATION

Different points, evenly distributed over the workspace, are chosen to compare the novel approach to two classic Reinforcement Learning agents. All agents use the Elegibility-Traces presented in (Sutton and Barto, 1998). Apart from a geometrical balanced distribution, the distance between start and target position were taken into account as well as the type of movement, such as straight or diagonal. This was done to obtain a profound and clear decision basis.

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

The quality of the solution found by the novel_SARSA([lambda]) and SARSA([lambda]) agents is very high at the start of the RL task. That is, the agents need a low number of actions to reach the target position. SARSA's number of actions decreases constantly although the learning process is less intensive compared to the versions of SARSA([lambda]). The performance of the SARSA([lambda]) version is a trade-off between good quality of the solution and time. If the number of actions is higher, the agent is reset to the starting position. In addition, the higher computational effort elongates the learning process. To sum up: the novel_SARSA([lambda]) is the most successful agent, concerning quality as well as the elapsed time.

4. CONCLUSION

A novel update algorithm for Reinforcement Learning agents based on exploiting similarities of adjacent states and actions was presented. An evaluation of the new approach was conducted using point-to-point movements of a 2-DOF manipulator model. The comparison of the novel update algorithm to two classic RL agents clearly showed better results regarding elapsed time and quality. Influences of the new approach on the exploration characteristics of the learning process are object of future research.

5. REFERENCES

Bellman R. (1954). The theory of dynamic programming, Bulletin of the American Mathematical Society, 60, 503-515

Bertsekas D. P. & Tsitsiklis J. N. (1996). Neuro-Dynamic Programming, Athena Scientific, Belmont, MA

Denzinger J.; Laureyns I. & et.al. (2008). A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot, The 2nd European DAAAM International Young Researchers' and Scientists' Conference

Kaelbling L. P.; Liftman M.L. & Moore A. (1996). Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research 4, 237-285

Martin J. A. & De Lope J. (2007). A Distributed Reinforcement Learning Architecture for Multi-Link Robots, 4th International Conference on Informatics in Control, Automation and Robotics (ICINCO), (pp. 192-197). Angers, France

Smart W. D. & Kaelbling L. (2002). Making Reinforcement Learning Work on Real Robots, Ph.D. Thesis, Brown University

Sutton R.S & Barto A.G. (1998). Reinforcement Learning, an introduction, The MIT press, MA

Watkins C. (1989). Learning from Delayed Rewards, Thesis, University of Cambidge, England