Model-based reinforcement learning: Cosmos — All that is, or was, or ever will be

Model-based reinforcement learning

cosmos 15th May 2019 at 2:22am

https://worldmodels.github.io

https://twitter.com/wgussml/status/1126984030090596354

World models for atari: https://arxiv.org/pdf/1903.00374.pdf . they train the world model on trajectories that the agent actually explores, rather than random ones. And they iterate several times

Linear programming

Dynamic programming

idea – Tradeoffs. The idea is to solve consistency equations (derived by a look ahead tree and principle of optimality) iteratively (see Fixed-point iteration). – Summary of methods

Neuro-dynamic programming

Policy iteration

Value iteration

Extensions to dynamic programming:

Asynchronous dynamic programming (DP)

There are other algorithms described in the Wiki page:

Trust Region Policy Optimization [1]
Proximal Policy Optimization (i.e., TRPO, but using a penalty instead of a constraint on KL divergence), where each subproblem is solved with either SGD or L-BFGS
Cross Entropy Method

final comment on DP methods, DP uses full-width look ahead, plus it assumes we know dynamics. Instead we can sample) –> leads to Model-free reinforcement learning

Asynchronous DP

Real-time dynamic programming (RTDP), which uses On-policy trajectory sampling

Combining model-free with model-based RL

https://deepmind.com/blog/agents-imagine-and-plan/ – Learning model-based planning from scratch

Using Generative models, and Environment models

Empowerment