See Reinforcement learning
Markov decisions processes are Markov processes extended with actions and rewards. They are related to Markov reward processes, and in fact they are an MRP if we fix a policy (i.e. distribution over actions given current state. see Reinforcement learning).
A Markov decision process is a 5-tuple (S,A,P⋅(⋅,⋅),R⋅(⋅,⋅),γ), where
- S is a finite set of states,
- A is a finite set of actions (alternatively, As is the finite set of actions available from state s),
- Pa(s,s′)=Pr(st+1=s′∣st=s,at=a) is the probability that action a in state s at time t will lead to state s′ at time t+1. I.e. what happens when you take an action
- Ra(s,s′) is the immediate reward (or expected immediate reward) received after transition to state s′ from state s. What reward you get when something happens. ( This is called state-action reward. An alternative is that rewards are associated with states!)
- γ∈[0,1] is the discount factor, which represents the difference in importance between future rewards and present rewards.
(Note: The theory of Markov decision processes does not state that S or A are finite, but the basic algorithms below assume that they are finite.)
Video by Andrew Ng – operational definition
Finite-horizon MDP
intro vid. Maximum time that is considered. When that time is reached, the MDP ends.
In this case, optimal policy may be non-stationary.
Non-stationary MDPs
Several of the quantities in the definition of an MDP may be allowed to depend on time, like the transition probabilities, or the rewards.
This can be mapped to the previous case, by letting time be part of the state space.
Definition of the optimal value function for non-stationary finite-horizon case, which now depends on time
Value iteration for this case
https://en.wikipedia.org/wiki/Markov_decision_process