Model-free reinforcement learning

cosmos 30th May 2018 at 1:10am
Reinforcement learning

See more at Reinforcement learning

summary

Simple random search provides a competitive approach to reinforcement learning – Our findings contradict the common belief that policy gradient techniques, which rely on exploration in the action space, are more sample efficient than methods based on finite-differences [25, 26]. In more detail, our contributions are as follows:

Prediction

comparing approaches

Evaluation the value function given a policy

Introduction, monte carlo model-free prediction, just sample over runs of the MDP+policy, and average empirical returns (discounted sum of rewards).

Incremental Monte Carlo update

Temporal difference learning

Simple example comparing monte carlo vs TD0

Model-free control (tabular solutions)

intro video!

actually need to use the action-value function to be model-free

We are basically going to use Policy iteration with the Action-value function, with different ways to do the Policy evaluation (by sampling) and policy update step (in a way that explores enough, given that the sampling means we don't see everything). This is an instance Generalized policy iteration with Q function evaluated by sampling (model-free)

Policy improvement in the model free setting

ϵ\epsilon-greedy exploration

motivationExploration versus exploitation. We need to carry on exploring everything to make sure we understand the value of all options!

epsilon-greedy explorationtheorem of policy improvement by epsilon-greedy policy iteration

Making the policy iteration more efficient by only partial policy evaluation

Greedy in the limit with infinite exploration (GLIE)

GLIE is a method that is guaranteed to converge to the optimal policy in a model-free manner

An example is ϵ\epsilon-greedy policy iteration with gradual decay of ϵ\epsilon

GLIE Monte Carlo control

On-policy learning methods

Monte Carlo

first attempt, Policy iteration with Monte-Carlo policy evaluation – but this isn't very efficient, so we use TD learning methods.

Temporal difference learning methods

introduction to TD learning for control

Sarsa

Action-critic methods

https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node66.html

Off-policy learning methods


CuriosityCuriosity-driven Exploration by Self-supervised Prediction, see work by Schmidhuber