Robust Adversarial Reinforcement Learning

Motivation

How can we model uncertainties and learn a policy robust to all uncertainties?

How can we model the gap between simulations and real-world?

Inspired from \(H_\infty\) control methods, both modeling errors and differences in training and test scenarios can be viewed as extra forces or disturbances in the system.

Contributions

Propose the idea of modeling uncertainties via an adversarial agent.
Train a pair of agents, a protagonist and an adversary, where the protagonist learns to fulfil the original task goals while being robust to the disruptions generated by its adversary.
The proposed algorithm is robust to model initializations, modeling errors and uncertainties.

Algorithm

state: \(s_t\)
action: \(a_t^1 \sim \mu(s_t)\) and \(a_t^2 \sim \nu(s_t)\)
next state: \(s_{s+1} = P(s_t,a_t^1,a_t^2)\)
reward: \(r_t = r(s_t,a_t^1,a_t^2)\), when \(r_t^1 = r_t\), while adversary gets a reward \(r_t^2 = -r_t\)
one step of MDP: \((s_t, a_t^1, a_t^2, r_t^1, r_t^2, s_{t+1})\)

Maximize the following reward function:

\[R^1 = \mathbb{E}_{s_0 \sim \rho, a^1 \sim \mu(s), a^2 \sim \nu(s)}\left[ \sum_{t=0}^{T-1} r_1(s,a^1,a^2) \right].\]

Thus,

\[R^{1*} = \min_\nu \max_\mu R^1(\mu, \nu) = \max_\mu \min_\nu R^1(\mu, \nu).\]

The algorithm optimizes both of the agents using the alternating procedure:

Learn the protagonist’s policy while holding the adversary’s policy fixed.
The protagonist’s policy is held constant and the adversary’s policy is learned.

: Algorithm of RARL

References