Accession Number : ADA627123


Title :   Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance


Descriptive Note : Journal article preprint


Corporate Author : MASSACHUSETTS INST OF TECH CAMBRIDGE MEDIA LAB


Personal Author(s) : Knox, W B ; Stone, Peter


Full Text : https://apps.dtic.mil/dtic/tr/fulltext/u2/a627123.pdf


Report Date : 29 Sep 2014


Pagination or Media Count : 49


Abstract : Several studies have demonstrated that reward from a human trainer can be a powerful feedback signal for control-learning algorithms. However, the space of algorithms for learning from such human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward this article investigates the problem of learning from human reward through six experiments, focusing on the relationships between reward positivity, which is how generally positive a trainer's reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity, whether task learning occurs in discrete learning episodes instead of one continuing session and task performance, the agent's performance on the task the trainer intends to teach. This investigation is motivated by the observation that an agent can pursue different learning objectives, leading to different resulting behaviors. We search for learning objectives that lead the agent to behave as the trainer intends. We identify and empirically support a positive circuits problem with low discounting (i.e., high discount factors) for episodic, goal-based tasks that arises from an observed bias among humans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i.e., continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and--relatedly--enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm introduced in this article, which we call vi-tamer, is the first algorithm to successfully learn non-myopically from reward generated by a human trainer; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task.


Descriptors :   *LEARNING MACHINES , *MULTIAGENT SYSTEMS , ALGORITHMS , BEHAVIOR , BIAS , CLASSIFICATION , DECISION MAKING , FEEDBACK , INFORMATION RETRIEVAL , MAN COMPUTER INTERFACE , MARKOV PROCESSES , MATHEMATICAL MODELS , NETWORK ARCHITECTURE , PROBABILITY , SEMANTICS , SIGNAL PROCESSING , TRAINING , USER NEEDS


Subject Categories : Statistics and Probability
      Computer Programming and Software
      Human Factors Engineering & Man Machine System


Distribution Statement : APPROVED FOR PUBLIC RELEASE