Actor-critic method: Cosmos — All that is, or was, or ever will be

Actor-critic method

cosmos 21st March 2018 at 7:09pm

Comic: https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752 -------—

After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is the TD error:

Note that unlike Bellman equation for $V^*$ , which has a $\max\limits_a$ , this doesn't but that is because, the action we have just taken follows the policy and so it's selected by $\max\limits_a$ ..

Traditional AC methods optimize the policy through policy gradients and scale the policy gradie nt by the TD error, while the action-value function is updated by ordinary TD learning

Read here