Partially Observable Markov Decision Processes over an Infinite Planning Horizon with Discounting
Technical rept. 1 Jan-31 Mar 1976
UNIVERSITY OF SOUTHERN CALIFORNIA LOS ANGELES BEHAVIORAL TECHNOLOGY LABS
Pagination or Media Count:
This is the last in a series of technical reports concerned with mathematical approaches to instructional sequence optimization in instructional systems. This paper deals with Markov decision processes where the true state of the system is not known with certainty. Hence the state of the system is characterized by a probability vector. Each action yields an expected reward, transforms the system to a new state and yields an observable outcome. One wishes to determine an action for each probability state vector so as to maximize the total expected reward. This report treats the infinite time horizon with a discount factor, using a partial N dimensional Maclaurin series to approximate the total optimal reward as a function of the probability state vector.
- Statistics and Probability