Q function: Cosmos — All that is, or was, or ever will be

Q function

cosmos 15th July 2017 at 6:43pm

aka action-value function

Although state-values suffice to define optimality, it will prove to be useful to define action-values. Given a state $s$ , an action $a$ and a policy $\pi$ , the action-value of the pair $(s,a)$ under $\pi$ is defined by

Q^\pi(s,a) = E[R|s,a,\pi],\,

where, now, $R$ stands for the random return associated with first taking action $a$ in state $s$ and following $\pi$ thereafter.

Video – Bellman equation for action-value function

It is well-known from the theory of MDPs that if someone gives us $Q$ for an optimal policy, we can always choose optimal actions (and thus act optimally) by simply choosing the action with the highest value at each state. The action-value function of such an optimal policy is called the optimal action-value function and is denoted by $Q^*$ .

video, Q function using NN, define loss function, and then use Gradient descent

Can learn the Q function by a dynamic programming approach but it's too computationally expensive. The Model-free reinforcement learning method of Q-learning, on the other hand, is very useful. Don't need to follow optimal policy while Q-learning

Example