Policy search: Cosmos — All that is, or was, or ever will be

Policy search

cosmos 9th November 2019 at 3:31pm

A class of Reinforcement learning algorithms. These are also known as direct search algorithms or Policy search, in contrast with algorithms where our aim is to find the optimal value function.

They directly optimize the Policy function to maximize expected reward. If the expected reward can be computed exactly, this is typically an instance of Model-based control. If the environment is unknown, or can't be integrated over, then we may approximate the expected reward with a Monte Carlo estimate (sum over samples). But this alone doesn't let us calculate the gradients! W need a Monte Carlo estimate of the gradients themselves. This isn't as easy as in supervised learning (where the cost is a sum over i.i.d. examples) because the distribution of states depends on the policy itself. The solution to this problem is the Policy gradient theorem

With Monte Carlo estimates of the gradient of the expected reward, we can then use an Stochastic optimization algorithm like Stochastic gradient descent (when we parametrize the policy in a way such that the gradients with respect to the parameters exist).

intro vid – General aim – Stochastic policy Definition

Classes of policy gradient algorithms