Motivation

How can we model uncertainties and learn a policy robust to all uncertainties?

How can we model the gap between simulations and real-world?

Inspired from \(H_\infty\) control methods, both modeling errors and differences in training and test scenarios can be viewed as extra forces or disturbances in the system.

Contributions

  • Propose the idea of modeling uncertainties via an adversarial agent.
  • Train a pair of agents, a protagonist and an adversary, where the protagonist learns to fulfil the original task goals while being robust to the disruptions generated by its adversary.
  • The proposed algorithm is robust to model initializations, modeling errors and uncertainties.

Algorithm

  • state: \(s_t\)
  • action: \(a_t^1 \sim \mu(s_t)\) and \(a_t^2 \sim \nu(s_t)\)
  • next state: \(s_{s+1} = P(s_t,a_t^1,a_t^2)\)
  • reward: \(r_t = r(s_t,a_t^1,a_t^2)\), when \(r_t^1 = r_t\), while adversary gets a reward \(r_t^2 = -r_t\)
  • one step of MDP: \((s_t, a_t^1, a_t^2, r_t^1, r_t^2, s_{t+1})\)

Maximize the following reward function:

\[R^1 = \mathbb{E}_{s_0 \sim \rho, a^1 \sim \mu(s), a^2 \sim \nu(s)}\left[ \sum_{t=0}^{T-1} r_1(s,a^1,a^2) \right].\]

Thus,

\[R^{1*} = \min_\nu \max_\mu R^1(\mu, \nu) = \max_\mu \min_\nu R^1(\mu, \nu).\]

The algorithm optimizes both of the agents using the alternating procedure:

  1. Learn the protagonist’s policy while holding the adversary’s policy fixed.
  2. The protagonist’s policy is held constant and the adversary’s policy is learned.
Algorithm of RARL
Algorithm of RARL

References