A class of Reinforcement learning algorithms. These are also known as direct search algorithms or Policy search, in contrast with algorithms where our aim is to find the optimal value function.
They directly optimize the Policy function to maximize expected reward. If the expected reward can be computed exactly, this is typically an instance of Model-based control. If the environment is unknown, or can't be integrated over, then we may approximate the expected reward with a Monte Carlo estimate (sum over samples). But this alone doesn't let us calculate the gradients! W need a Monte Carlo estimate of the gradients themselves. This isn't as easy as in supervised learning (where the cost is a sum over i.i.d. examples) because the distribution of states depends on the policy itself. The solution to this problem is the Policy gradient theorem
With Monte Carlo estimates of the gradient of the expected reward, we can then use an Stochastic optimization algorithm like Stochastic gradient descent (when we parametrize the policy in a way such that the gradients with respect to the parameters exist).
intro vid – General aim – Stochastic policy Definition
Classes of policy gradient algorithms
Sometimes called the reinforce algorithm, and is a form of Stochastic gradient descent
Derivation, using the Product rule
With direct policy search, rewards may be combined in other ways other than by summing them
Derivation by Nando – comment on reward function not being really needed –> result
What we use for the gradient descent is do a Monte Carlo estimate, which makes it stochastic.
Trust region policy optimization
-—>vid
Can approach it as an Inference problem, or in other ways. See comment
based on EM
based on probabilistic modeling
Hierarchical Relative Entropy Policy Search – extended version