Policy search

cosmos 9th November 2019 at 3:31pm

aka Policy gradient method

A class of Reinforcement learning algorithms. These are also known as direct search algorithms or Policy search, in contrast with algorithms where our aim is to find the optimal value function.

They directly optimize the Policy function to maximize expected reward. If the expected reward can be computed exactly, this is typically an instance of Model-based control. If the environment is unknown, or can't be integrated over, then we may approximate the expected reward with a Monte Carlo estimate (sum over samples). But this alone doesn't let us calculate the gradients! W need a Monte Carlo estimate of the gradients themselves. This isn't as easy as in supervised learning (where the cost is a sum over i.i.d. examples) because the distribution of states depends on the policy itself. The solution to this problem is the Policy gradient theorem

With Monte Carlo estimates of the gradient of the expected reward, we can then use an Stochastic optimization algorithm like Stochastic gradient descent (when we parametrize the policy in a way such that the gradients with respect to the parameters exist).

intro vidGeneral aimStochastic policy Definition

Classes of policy gradient algorithms

REINFORCE

Sometimes called the reinforce algorithm, and is a form of Stochastic gradient descent

Goal

Algorithmexplanation

Derivation, using the Product rule

  1. Differentiation
  2. Factor out joint probability from terms in sum
  3. Rewrite as expectation –> On expectation, reinforce algorithm updates parameters in the direction of the gradient of the expected payout. This shows the algorithm is an Stochastic gradient descent algorithm!

With direct policy search, rewards may be combined in other ways other than by summing them

Derivation by Nandocomment on reward function not being really needed –> result

What we use for the gradient descent is do a Monte Carlo estimate, which makes it stochastic.

Actor-critic method

Policy optimization

Proximal policy optimization

Trust region policy optimization

Pegasus

-—>vid

Nando's vid

Direct policy gradient methods

Deterministic Policy Gradient Algorithms

paper

Natural policy gradient

Other variations

Can approach it as an Inference problem, or in other ways. See comment

pair-wise policy comparisons

probabilistic policy search approaches

based on EM

based on probabilistic modeling

Relative Entropy Policy Search

Hierarchical Relative Entropy Policy Searchextended version