See more at Reinforcement learning
Simple random search provides a competitive approach to reinforcement learning – Our findings contradict the common belief that policy gradient techniques, which rely on exploration in the action space, are more sample efficient than methods based on finite-differences [25, 26]. In more detail, our contributions are as follows:
Evaluation the value function given a policy
Introduction, monte carlo model-free prediction, just sample over runs of the MDP+policy, and average empirical returns (discounted sum of rewards).
Incremental Monte Carlo update
Simple example comparing monte carlo vs TD0
actually need to use the action-value function to be model-free
We are basically going to use Policy iteration with the Action-value function, with different ways to do the Policy evaluation (by sampling) and policy update step (in a way that explores enough, given that the sampling means we don't see everything). This is an instance Generalized policy iteration with Q function evaluated by sampling (model-free)
motivation – Exploration versus exploitation. We need to carry on exploring everything to make sure we understand the value of all options!
epsilon-greedy exploration – theorem of policy improvement by epsilon-greedy policy iteration
Making the policy iteration more efficient by only partial policy evaluation
GLIE is a method that is guaranteed to converge to the optimal policy in a model-free manner –
An example is -greedy policy iteration with gradual decay of
first attempt, Policy iteration with Monte-Carlo policy evaluation – but this isn't very efficient, so we use TD learning methods.
introduction to TD learning for control
https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node66.html
Curiosity – Curiosity-driven Exploration by Self-supervised Prediction, see work by Schmidhuber