Comic: https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752 -------—
After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is the TD error:
Note that unlike Bellman equation for , which has a , this doesn't but that is because, the action we have just taken follows the policy and so it's selected by ..
Traditional AC methods optimize the policy through policy gradients and scale the policy gradie nt by the TD error, while the action-value function is updated by ordinary TD learning
Read here