Advances in reinforcement learning: Cosmos — All that is, or was, or ever will be

Advances in reinforcement learning

cosmos 13th November 2019 at 1:23am

Deep reinforcement learning

See my more recent notes on https://github.com/oxai/vrai/wiki

Model-based reinforcement learning and hybrid model-based/model-free

AlphaZero

https://buff.ly/2KOUbFO – MCTS approaches really seem like the best we have for AI problems beyond perception.

Adances in StarCraft RL

Theory of RL

Comparing humans with the best Reinforcement Learning algorithms – this demonstrates that our RL algorithms don't have the same priors that we humans use for learning

https://arxiv.org/abs/1807.09647

https://machinethoughts.wordpress.com/2018/04/14/rl-and-the-game-of-mathematics/

https://arxiv.org/abs/1806.02426

"Solving" Montezuma's revenge – They make the RL problem easier by mapping it to a supervised learning problem. However, that way they allow the algorithm to just solve by memorization, rather than having it generalize because of better priors. Learning from demonstrations is interesting though because it could allow the algorithms, for instance, to learn the priors from humans which already have them! However, the algorithms presented in the research discussed in this paper don't seem to do this, and just learn the particular trajectory/policy for solving this game..

https://www.youtube.com/watch?v=Ol0-c9OE3VQ Interesting. This research quite undeniably demonstrates that our RL algorithms lack the priors that we use for learning. And that is bad when learning human-related tasks, but can be good in other cases (No Free Lunch!!). Most future AGIs will have rather different priors than us, and will specialize at things like quantum mechanics, relativity, algebraic geometry. They will dream in those things, and they will find our world very unintuitive :) Perhaps there will also be superintelligences which while being slightly worse at some things than us, will be much better at most relevant things than us. Also, nice channel

Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multi-Objective Evolutionary Algorithm

https://blog.openai.com/openai-five/

This is quite impressive. Could they beat the top pro players by this August?

"OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version of Proximal Policy Optimization running on 256 GPUs and 128,000 CPU cores" ２５６ＧＰＵｓａｎｄ１２８，０００ＣＰＵｃｏｒｅｓ

This indicates that reinforcement learning can yield long-term planning with large but achievable scale — without fundamental advances

Each of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that sees the current game state (extracted from Valve’s Bot API) and emits actions through several possible action heads. Each head has semantic meaning, for example, the number of ticks to delay this action, which action to select, the X or Y coordinate of this action in a grid around the unit, etc. Action heads are computed independently.

"To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves." This sounds similar to the "double learning" used in DQN, which enhances exploration.

To force exploration in strategy space, during training (and only during training) we randomized the properties (health, speed, start level, etc.) of the units, and it began beating humans/

We postprocess each agent’s reward by subtracting the other team’s average reward to prevent the agents from finding positive-sum situations.

Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.

Our system is implemented as a general-purpose RL training system called Rapid