Actor-critic method

cosmos 21st March 2018 at 7:09pm
Policy gradient method

Comic: https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752 -------—

After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is the TD error:

Note that unlike Bellman equation for VV^*, which has a maxa\max\limits_a, this doesn't but that is because, the action we have just taken follows the policy and so it's selected by maxa\max\limits_a..

Traditional AC methods optimize the policy through policy gradients and scale the policy gradie nt by the TD error, while the action-value function is updated by ordinary TD learning

Read here