Robust RL on State Observations with Learned Optimal Adversary

Motivation

Adversarial attacks used in image classificant tasks.
Attacks:
- Add imperceptible noises into the observations of states.
  - The observed environment slightly differs from true environment states.
Using RL in safety-crucial applications:
- Autonomous driving.
It is important for real-world RL agent under unpredictable sensing noise.
Contribute to the reality gap:
- An agent working well in simulated environments may fail in real environments due to noises in observations.
The weakness of a DRL agent on the perturbations of state observations:
- Vulnerability in function approximators.
  - Originate from the highly non-linear and blackbox nature of neural networks.
  - Example: For Atari games, CNN extracts features (e.g. for game of Pong, the position and velocity of the ball) from input frames. Using attacks to add noises in the observations, CNN will extract wrong features.
- Intrinsic weakness of policy.
  - Even perfect features for states are extracted, a policy can still make mistakes due to an intrinsic weakness.
  - Example: In a finite-state MDP, there is no function approximation error, but the agent can still be vulnerable to small perturbations on observations (e.g. perturb the observation to one of the neighbors of the true states).
To improve the robustness of RL:
- A more robust function approximator.
  - Add a robust policy regularizer to the loss function. (indistinctive)
- A policy aware of perturbations on observations.
Related works:
- SA-MDP
  - The true state in the environment is not perturbed by the adversary.
  - Example: GPS sensor readings on a car are naturally noisy, but the ground truth location of the car is not affected by the noise.
- Robust MDP (RMDP)
  - The worst case transition probabilities of the environment are considered.
  - This setting is essentially different from SA-MDP.
- Robust Adversarial RL (RARL)
  - Learn an adversary online together with an agent.
  - Train an agent and an adversary under the two-player Markov game setting.
  - The adversary can change the environment states through actions directly applied to environment.
  - The goal is to improve the robustness against environment parameter changes, such as mass, length, or friction.
- Action Robust RL
- The learning process of above works are still a MDP, but consider the perturbations on state observations will lead it to a POMDP problem.

Contributions

With a fixed agent policy:
- An optimal adversary to perturb state observations can be found.
  - The learned adversary is much stronger than previous ones.
- And, it is guaranteed to obtain worst case agent reward.
Propose a framework of alternating training with learned adversaries (ATLA).
- Training an adversary online together with the agent using policy gradient following the optimal adversarial attack framework.
- Improve the robustness of the agent.
History (the past states and actions) can be important for learning a robust agent.
- Propose to use a LSTM based policy in the ATLA framework.
  - It is more robust than policy parameterized as regular feedforward NNs.

Method

POMDP

\[\langle S,A,O,\Omega,R,P,\gamma \rangle\]

\(O\) is a set of observations.
\(\Omega\) is a set of conditional observation probabilities \(p(o\vert s)\).
POMDP typically require history-dependent optimal policies.
- A history-dependent policy is defined as \(\pi: H\to P(A)\).
- \(H\) is the set of history.
  - History \(h_t\) at time \(t\) is denoted as \((s_0,a_0,\cdots,s_{t-1},s_{t-1},s_t)\).

SA-MDP

\(\langle S,A,B,R,P,\gamma \rangle\).
\(B\) is a mapping from a state \(s\in S\) to a set of states \(B(s)\in S\).
- \(B\) limits the power of adversary: \(v(s)\in B(s)\).
Goal: solve an optimal policy \(\pi^\star\) under the optimal adversary \(v^\star(\pi^\star)\).
The related work did not give an explicit algorithm to solve SA-MDP and found that a stationary optimal policy need not exist.

Find the optimal adversary under a fixed policy

The optimal adversary leads to the worst case performance under bounded perturbation set \(B\).
- An absolute lower bound of the excepted cumulative reward an agent can receive.
- The adversary’s goal is to reduce the reward earned by the agent.

Solving the optimal adversary \(v^\star(s)\) (policy) in MDP setting:

Environment dynamics:
- Merge the two parts into a MDP:
  - A fixed and stationary agent policy \(a = \pi(s_v)\), where \(s_v = proj_{B(s)} (s + \Delta)\).
  - The original environment dynamics \(s' = P(s,a)\).
State: \(s\), the true state.
Action: \(\Delta = v(s)\).
Reward: a weighted average of \(-R\).

Parameterize the adversary as a neural network function:

Using any popular DRL algorithm to learn an optimal adversary.
No existing adversarial attacks follow such a theoretically guided framework.

The algorithm is shown as follows:

Advantages:

The learned adversary is strong.
- Rather than using any heuristic to generate a perturbation.
- The previous strong attack (RS attack) can successful fail an agent and make it stop moving and receive a small positive reward.
- While, the learned attack can trick the agent into moving toward the opposite direction of the goal and receive a large negative reward.
Keep improving the adversary with a goal of receiving as less reward as possible.
The learned adversary requires no gradient access to the policy.
- Previous attacks are mostly gradient based approach and need to access the values or gradients to a policy or value function.

Find the optimal policy under a fixed adversary

Based on POMDP \(\langle S,A,O,\Omega,R,P,\gamma \rangle\):
- \(O = \bigcup_{s\in S}\{ s'\vert s'\in supp(v(\cdot\vert s)) \}\).
- \(P(o_t = s' \vert s_t = s) = v(s_v = s' \vert s_t = s)\).
Using recurrent policy gradient theorem on POMDPs.
Using LSTM as function approximators for the value and policy networks.

For PPO, use the loss function:

\[L(\theta) = \mathbb{E}_{h_t}\left[\min\left[ \frac{\pi_\theta(a_t\vert h_t)}{\pi_{\theta_{old}}(a_t\vert h_t)} A_{h_t}, clip\left(\frac{\pi_\theta(a_t\vert h_t)}{\pi_{\theta_{old}}(a_t\vert h_t)}, 1-\epsilon,1+\epsilon\right)A_{h_t} \right]\right],\]

where

\(h_t = \{ s_v^0, a_0, \cdots, s_v^t \}\) contains all history of perturbed states and actions.
\(A_{h_t}\) is a baseline advantage function modeled using a LSTM.

Alternating training with learned adversaries (ATLA)

Training an adversary online with the agent:
- First keep the agent fixed, optimize the adversary (parameters of NN).
- Then keep the adversary fixed, optimize the agent.
Both adversary and agent can be updated using PPO.

Algorithm:

Using a strong and learned adversary that tries to find intrinsic weakness of the policy.
To obtain a good reward, the policy must learn to defeat such an adversary.
Using \(s\) instead of \(s_v\) to train the advantage function and policy of the agent leads to worse performance.
- It does not follow the theoretical framework of SA-MDP.

References

Robust Reinforcement Learning on State Observations with Learned Optimal Adversary, under review at ICLR, 2021.