Reinforcement learning

cosmos 13th November 2019 at 1:23am
Machine learning Optimization

See Machine learning, Decision theory

Andrew Ng intro lecturebook which proves several important theorems. – good github repo with sourcesSutton&Barto book

Advances in reinforcement learning. See my more recent notes on https://github.com/oxai/vrai/wiki

https://www.alexirpan.com/2018/02/14/rl-hard.html

Credit assignment problem

Intro lecture by David Silverslides

Types of RL

Free energy principle

Software for reinforcement learning

Environment, observation, and State:

The next action could depend on the whole previous history, however, it is more efficient to just take an action according to a summary of the history (often sufficient statistic of the future) which is called the agent state. In fact, there are two types of states in RL.

  • Enviroment state. Real state of the environment which determines its future behaviour. Generally, unobservable.
  • Agent state of the model. In general, a function of the agent's history

The agent state is often chosen to be the Information state or Markov state. An information that contains all useful information of the history, in order to probabilistically predict the next state (a Sufficient statistic of the future). This Markov state is used as part of the model the agent uses, which is often formalized as a Markov decision process (vid)

Example

Fully observable environment. agent state = environment state = observation. Model = real environment.

Partially-observable environment. Agent needs to learn an "agent state", and an MDP. Now environment state \neq agent state \neq observation.

Markov decision process

(Fully observable) Reinforcement leaning models the world as a Markov decision process. Andrew Ng intro to MDPsoperational definitionDavid Silver lecture

Planning -> policy evaluation (prediction)-> control

Methods for policy evaluation (prediction)

Optimal policy problem (Optimal control)

David Silver videoOptimal policy theorem!

The core problem of MDPs is to find a policy for the decision maker: a function π\pi that specifies the action π(s)\pi(s) that the decision maker will choose when in state ss (vid) . vid

The goal is to choose a policy π\pi that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon (known as the state-Value function):

Vπ(s):=t=0γtRat(st,st+1)V_\pi (s) := \sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})} Value function (vid)

where we choose at=π(st)a_t = \pi(s_t),  γ \ \gamma \ is the discount factor and satisfies 0 γ <10 \le\ \gamma\ < 1. (For example, γ=1/(1+r) \gamma = 1/(1+r) when the discount rate is r.) γ \gamma is typically close to 1. This is known as the total payoff, or value function, for policy π\pi.

Undiscounted reward

One can have RL with no discount factor γ\gamma. Approach: average reward MPD.

One can and often does apply plain undiscounted rewards to Episodic MDPs

Policy evaluation

Computing the value function. video

Use Bellman expectation equation gives a set of linear constraints to the value function Vπ(s)V_\pi (s) that can be solved as a linear system, to obtain the value of the value function, for a given policy. Most effective way to solve it exactly is iteratively:

Policy evaluation is most often used as a substep of a reinforcement learning algorithm (see next Learning algorithms section)

Learning algorithms

video

Algorithms to solve the prediction and control problems, i.e. for Policy evaluation and optimal control (finding optimal policies)

Generalized policy iteration

They can be seen in an unified manner as applied to Planning (see Sutton-Barto chapter 8). See hybrid methods below. Basically, as in Generalized policy iteration, they revolve in improving value functions by Value function backup

Model-based methods (planning)

Use a Model

Linear programming approach

Dynamic programming for optimal value function

ideaTradeoffs. The idea is to solve consistency equations (derived by a look ahead tree and principle of optimality) iteratively (see Fixed-point iteration). – Summary of methods

Neuro-dynamic programming

Policy iteration

Value iteration

Model-free reinforcement learning

(learning proper)

Unlike model-based RL (planning), we don't know the dynamics. This is also useful even if we know the dynamics, but the state/action space is too big to be computationally tractable (with dynamic programming approaches, we do a short lookahead which has complexity proportional to the number of actions and states). So instead we sample the dynamics, which is more efficient, and can be done even when we don't know the environment.

Summary

Model-free predictionModel-free control

Monte Carlo learning

full look ahead, but only samples

Temporal differences (TD)

one step look ahead, and estimate return.. There are also n-step look ahead versions, and other extensions

TD0 . vid. A kind of Gradient descent to converge to solution to V(s) that satisfies Bellman equation

Unkown state transition probabilites

  • Estimate from data.

Hybrid methods

Integrate learning and Planning processes. The most straightforward method is simply by allowing both to update the same estimated Value function.

Dyna-Q

Prioritized sweeping (see chapter 8.4 of Sutton-Barto)

Value function approximation

Methods which attempt to compute the exact value function are called Tabular method. The other alternative, which can be applied to larger and more difficult problems is to approximate the value function, for instance using a neural network

Policy gradient methods)

(aka Policy search)

Stochastic gradient descent on the parameters determining (stochastic) policy, to maximize expected payoff.

Application to POMDP

Optimal value function vs policy search approaches

Policy search is usually best when the policy is a simple function of the state features (like a 'reflex'). Optimal value function approaches are better when the policy is more complicated, maybe needing some multistep reasoning, as in chess.

Policy search often works well, but is very slow, and is stochastic. Also, because one needs to simulate the MDP, it is trained most often using simulation.

See here for Pegasus policy search, using "scenarios", which look like Quenched disorder

Evolutionary methods

Neuroevolution

See papers from Uber AI and OpenAI, Deep GA, Evolutionary strategies

Evolutionary Reinforcement Learning


Decision-time planning

A type of planning which is used like a way of encoding a policy, where the action the agent takes is done by solving a Planning problem focused on the current state.


Reinforcement learning in continuous state space

Fitted value iteration

Linear quadratic regulation (LQR)

Partially-observable MDP (POMDP)

Kalman filters and LQG control

Spectral methods in reinforcement learning


Debugging RL algorithms

Applications

Many applications to Robotics

More applications


Priors in reinforcement learning

https://rach0012.github.io/humanRL_website/

RL Course by David Silver course page

Deep reinforcement learning

See Nando's lectures

OpenAI Gym

https://gym.openai.com/docs

https://github.com/openai/gym

Example: https://github.com/joschu/modular_rl

Pavlov.js - Reinforcement learning using Markov Decision Processes

https://openai.com/blog/universe/

DeepMind Lab

See also Decision theory, Game theory

Deep reinforcement learning

Multi-agent systems