aka action-value function
Although state-values suffice to define optimality, it will prove to be useful to define action-values. Given a state , an action and a policy , the action-value of the pair under is defined by
where, now, stands for the random return associated with first taking action in state and following thereafter.
Video – Bellman equation for action-value function
It is well-known from the theory of MDPs that if someone gives us for an optimal policy, we can always choose optimal actions (and thus act optimally) by simply choosing the action with the highest value at each state. The action-value function of such an optimal policy is called the optimal action-value function and is denoted by .
video, Q function using NN, define loss function, and then use Gradient descent
Can learn the Q function by a dynamic programming approach but it's too computationally expensive. The Model-free reinforcement learning method of Q-learning, on the other hand, is very useful. Don't need to follow optimal policy while Q-learning