Markov decision process

cosmos 15th July 2017 at 7:18pm
Decision theory Reinforcement learning

See Reinforcement learning

Markov decisions processes are Markov processes extended with actions and rewards. They are related to Markov reward processes, and in fact they are an MRP if we fix a policy (i.e. distribution over actions given current state. see Reinforcement learning).

A Markov decision process is a 5-tuple (S,A,P(,),R(,),γ)(S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma), where

  • SS is a finite set of states,
  • AA is a finite set of actions (alternatively, AsA_s is the finite set of actions available from state ss),
  • Pa(s,s)=Pr(st+1=sst=s,at=a)P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) is the probability that action aa in state ss at time tt will lead to state ss' at time t+1t+1. I.e. what happens when you take an action
  • Ra(s,s)R_a(s,s') is the immediate reward (or expected immediate reward) received after transition to state ss' from state ss. What reward you get when something happens. ( This is called state-action reward. An alternative is that rewards are associated with states!)
  • γ[0,1]\gamma \in [0,1] is the discount factor, which represents the difference in importance between future rewards and present rewards.

(Note: The theory of Markov decision processes does not state that SS or AA are finite, but the basic algorithms below assume that they are finite.)

Video by Andrew Ngoperational definition

Finite-horizon MDP

intro vid. Maximum time that is considered. When that time is reached, the MDP ends.

In this case, optimal policy may be non-stationary.

Non-stationary MDPs

Several of the quantities in the definition of an MDP may be allowed to depend on time, like the transition probabilities, or the rewards.

This can be mapped to the previous case, by letting time be part of the state space.

Definition of the optimal value function for non-stationary finite-horizon case, which now depends on time

Value iteration for this case


https://en.wikipedia.org/wiki/Markov_decision_process