See Machine learning, Decision theory
Andrew Ng intro lecture – book which proves several important theorems. – good github repo with sources – Sutton&Barto book
Advances in reinforcement learning. See my more recent notes on https://github.com/oxai/vrai/wiki
https://www.alexirpan.com/2018/02/14/rl-hard.html
Intro lecture by David Silver – slides
Software for reinforcement learning
The next action could depend on the whole previous history, however, it is more efficient to just take an action according to a summary of the history (often sufficient statistic of the future) which is called the agent state. In fact, there are two types of states in RL.
The agent state is often chosen to be the Information state or Markov state. An information that contains all useful information of the history, in order to probabilistically predict the next state (a Sufficient statistic of the future). This Markov state is used as part of the model the agent uses, which is often formalized as a Markov decision process (vid)
Fully observable environment. agent state = environment state = observation. Model = real environment.
Partially-observable environment. Agent needs to learn an "agent state", and an MDP. Now environment state agent state observation.
(Fully observable) Reinforcement leaning models the world as a Markov decision process. Andrew Ng intro to MDPs – operational definition – David Silver lecture
Planning -> policy evaluation (prediction)-> control
David Silver video – Optimal policy theorem!
The core problem of MDPs is to find a policy for the decision maker: a function that specifies the action that the decision maker will choose when in state (vid) . vid
The goal is to choose a policy that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon (known as the state-Value function):
Value function (vid) |
where we choose , is the discount factor and satisfies . (For example, when the discount rate is r.) is typically close to 1. This is known as the total payoff, or value function, for policy .
Undiscounted reward
One can have RL with no discount factor . Approach: average reward MPD.
One can and often does apply plain undiscounted rewards to Episodic MDPs
Computing the value function. video
Use Bellman expectation equation gives a set of linear constraints to the value function that can be solved as a linear system, to obtain the value of the value function, for a given policy. Most effective way to solve it exactly is iteratively:
Policy evaluation is most often used as a substep of a reinforcement learning algorithm (see next Learning algorithms section)
Algorithms to solve the prediction and control problems, i.e. for Policy evaluation and optimal control (finding optimal policies)
They can be seen in an unified manner as applied to Planning (see Sutton-Barto chapter 8). See hybrid methods below. Basically, as in Generalized policy iteration, they revolve in improving value functions by Value function backup
Use a Model
idea – Tradeoffs. The idea is to solve consistency equations (derived by a look ahead tree and principle of optimality) iteratively (see Fixed-point iteration). – Summary of methods
(learning proper)
Unlike model-based RL (planning), we don't know the dynamics. This is also useful even if we know the dynamics, but the state/action space is too big to be computationally tractable (with dynamic programming approaches, we do a short lookahead which has complexity proportional to the number of actions and states). So instead we sample the dynamics, which is more efficient, and can be done even when we don't know the environment.
Model-free prediction – Model-free control
full look ahead, but only samples
one step look ahead, and estimate return.. There are also n-step look ahead versions, and other extensions
TD0 . vid. A kind of Gradient descent to converge to solution to V(s) that satisfies Bellman equation
Integrate learning and Planning processes. The most straightforward method is simply by allowing both to update the same estimated Value function.
Prioritized sweeping (see chapter 8.4 of Sutton-Barto)
Methods which attempt to compute the exact value function are called Tabular method. The other alternative, which can be applied to larger and more difficult problems is to approximate the value function, for instance using a neural network
(aka Policy search)
Stochastic gradient descent on the parameters determining (stochastic) policy, to maximize expected payoff.
Policy search is usually best when the policy is a simple function of the state features (like a 'reflex'). Optimal value function approaches are better when the policy is more complicated, maybe needing some multistep reasoning, as in chess.
Policy search often works well, but is very slow, and is stochastic. Also, because one needs to simulate the MDP, it is trained most often using simulation.
See here for Pegasus policy search, using "scenarios", which look like Quenched disorder
See papers from Uber AI and OpenAI, Deep GA, Evolutionary strategies
Evolutionary Reinforcement Learning
A type of planning which is used like a way of encoding a policy, where the action the agent takes is done by solving a Planning problem focused on the current state.
Many applications to Robotics
https://rach0012.github.io/humanRL_website/
RL Course by David Silver course page
Deep reinforcement learning
See Nando's lectures
OpenAI Gym
Example: https://github.com/joschu/modular_rl
Pavlov.js - Reinforcement learning using Markov Decision Processes
See also Decision theory, Game theory