On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB
Pagination or Media Count:
Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD lambda algorithm of Sutton 1988 and the Q-learning algorithm of Watkins 1989, can be motivated heuristically as approximations to dynamic programming DP. In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TDlambda and Q-learning belong. reinforcement learning, Stochastic approximation, Convergence, Dynamic programming.
- Operations Research