Reinforcement learning

cosmos 13th November 2019 at 1:23am

Prediction vs control.
- Prediction = Policy evaluation
- control = find optimal policy
Planning vs learning+planning
- Planning = solve RL (prediction/control) problem with full knowledge of MDP
- Learning + planning = RL (prediction/control) in not fully-known environment. need to learn to model environment too!. Model-free reinforcement learning

Environment, observation, and State:

The next action could depend on the whole previous history, however, it is more efficient to just take an action according to a summary of the history (often sufficient statistic of the future) which is called the agent state. In fact, there are two types of states in RL.

Enviroment state. Real state of the environment which determines its future behaviour. Generally, unobservable.
Agent state of the model. In general, a function of the agent's history

The agent state is often chosen to be the Information state or Markov state. An information that contains all useful information of the history, in order to probabilistically predict the next state (a Sufficient statistic of the future). This Markov state is used as part of the model the agent uses, which is often formalized as a Markov decision process (vid)

Example

Fully observable environment. agent state = environment state = observation. Model = real environment.

Partially-observable environment. Agent needs to learn an "agent state", and an MDP. Now environment state $\neq$ agent state $\neq$ observation.

Markov decision process

(Fully observable) Reinforcement leaning models the world as a Markov decision process. Andrew Ng intro to MDPs – operational definition – David Silver lecture

Planning -> policy evaluation (prediction)-> control

Methods for policy evaluation (prediction)

Optimal policy problem (Optimal control)

David Silver video – Optimal policy theorem!

The core problem of MDPs is to find a policy for the decision maker: a function $\pi$ that specifies the action $\pi(s)$ that the decision maker will choose when in state $s$ (vid) . vid

The goal is to choose a policy $\pi$ that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon (known as the state-Value function):

V_\pi (s) := \sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}

Value function (vid)

where we choose $a_t = \pi(s_t)$ , $\ \gamma \$ is the discount factor and satisfies $0 \le\ \gamma\ < 1$ . (For example, $\gamma = 1/(1+r)$ when the discount rate is r.) $\gamma$ is typically close to 1. This is known as the total payoff, or value function, for policy $\pi$ .

Undiscounted reward

One can have RL with no discount factor $\gamma$ . Approach: average reward MPD.

One can and often does apply plain undiscounted rewards to Episodic MDPs

Policy evaluation

Computing the value function. video

Use Bellman expectation equation gives a set of linear constraints to the value function $V_\pi (s)$ that can be solved as a linear system, to obtain the value of the value function, for a given policy. Most effective way to solve it exactly is iteratively:

Synchronous update

Policy evaluation is most often used as a substep of a reinforcement learning algorithm (see next Learning algorithms section)

Learning algorithms

video

Algorithms to solve the prediction and control problems, i.e. for Policy evaluation and optimal control (finding optimal policies)

Generalized policy iteration

They can be seen in an unified manner as applied to Planning (see Sutton-Barto chapter 8). See hybrid methods below. Basically, as in Generalized policy iteration, they revolve in improving value functions by Value function backup

Model-based methods (planning)

Use a Model

Linear programming approach

Dynamic programming for optimal value function

idea – Tradeoffs. The idea is to solve consistency equations (derived by a look ahead tree and principle of optimality) iteratively (see Fixed-point iteration). – Summary of methods

Neuro-dynamic programming

Policy iteration

Value iteration

Model-free reinforcement learning

(learning proper)

Unlike model-based RL (planning), we don't know the dynamics. This is also useful even if we know the dynamics, but the state/action space is too big to be computationally tractable (with dynamic programming approaches, we do a short lookahead which has complexity proportional to the number of actions and states). So instead we sample the dynamics, which is more efficient, and can be done even when we don't know the environment.

Summary

Model-free prediction – Model-free control

Monte Carlo learning

full look ahead, but only samples

Temporal differences (TD)

one step look ahead, and estimate return.. There are also n-step look ahead versions, and other extensions

TD0 . vid. A kind of Gradient descent to converge to solution to V(s) that satisfies Bellman equation

Unkown state transition probabilites

Estimate from data.

Hybrid methods

Integrate learning and Planning processes. The most straightforward method is simply by allowing both to update the same estimated Value function.

Dyna-Q

Prioritized sweeping (see chapter 8.4 of Sutton-Barto)

Value function approximation

Methods which attempt to compute the exact value function are called Tabular method. The other alternative, which can be applied to larger and more difficult problems is to approximate the value function, for instance using a neural network

Policy gradient methods)

(aka Policy search)

Stochastic gradient descent on the parameters determining (stochastic) policy, to maximize expected payoff.

Application to POMDP

Optimal value function vs policy search approaches

Policy search is usually best when the policy is a simple function of the state features (like a 'reflex'). Optimal value function approaches are better when the policy is more complicated, maybe needing some multistep reasoning, as in chess.

Policy search often works well, but is very slow, and is stochastic. Also, because one needs to simulate the MDP, it is trained most often using simulation.

See here for Pegasus policy search, using "scenarios", which look like Quenched disorder