The idea that works best (as of 2016 or so) is Q-learning. Most well-known Q-learning type, where we allow both behaviour and target policies to improve
friendly intro
aka SARSAMAX