Markov decision process: Cosmos — All that is, or was, or ever will be

Markov decision process

cosmos 15th July 2017 at 7:18pm

Markov decisions processes are Markov processes extended with actions and rewards. They are related to Markov reward processes, and in fact they are an MRP if we fix a policy (i.e. distribution over actions given current state. see Reinforcement learning).

A Markov decision process is a 5-tuple $(S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma)$ , where

$S$ is a finite set of states,
$A$ is a finite set of actions (alternatively, $A_s$ is the finite set of actions available from state $s$ ),
$P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $t+1$ . I.e. what happens when you take an action
$R_a(s,s')$ is the immediate reward (or expected immediate reward) received after transition to state $s'$ from state $s$ . What reward you get when something happens. ( This is called state-action reward. An alternative is that rewards are associated with states!)
$\gamma \in [0,1]$ is the discount factor, which represents the difference in importance between future rewards and present rewards.

(Note: The theory of Markov decision processes does not state that $S$ or $A$ are finite, but the basic algorithms below assume that they are finite.)

Video by Andrew Ng – operational definition

Finite-horizon MDP

intro vid. Maximum time that is considered. When that time is reached, the MDP ends.

In this case, optimal policy may be non-stationary.

Non-stationary MDPs

Several of the quantities in the definition of an MDP may be allowed to depend on time, like the transition probabilities, or the rewards.

This can be mapped to the previous case, by letting time be part of the state space.

Definition of the optimal value function for non-stationary finite-horizon case, which now depends on time

Value iteration for this case

https://en.wikipedia.org/wiki/Markov_decision_process