Robust Deep RL Against Adversarial Perturbations on State Observations

Motivation

DRL has achieved great success on some safety-critical applications, such as autonomous driving.
The existence of adversarial examples in DNNs and many successful attacks to DRL motivates us to study robust DRL algorithms.
Problems in state observation:
- May contain uncertainty that naturally originates from unavoidable sensor measurement errors or equipment inaccuracy.
- Adversarial noises.
Since the observations deviate from the true states:
- They can mislead the agent into making suboptimal actions.
- A policy not robust to such uncertainty can lead to catastrophic failures.
Existing works:
- have limited success.
- lack for theoretical principles.
To ensure safety under the worst case uncertainty, consider the adversarial setting:
- The observation is adversarially perturbed from \(s\) to \(v(s)\).
- The underlying true environment state \(s\) where the agent locates is unchanged.
- This setting is aligned with many adversarial attacks on state observations.
To improve robustness under this setting:
- Extend existing adversarial defenses for supervised learning (adversarial training).
- Attack the agent and generate trajectories adversarially dyring training time, and apply any existing DRL algorithm to hopefully obtain a robust policy.
- Drawback:
  - For most environments, naive adversarial training can make training unstable and deteriorate agent performance.
  - Does not significantly improve robustness under strong attacks.
DRL agents can be brittle even without any adversarial attacks, only with a small noise:
- An agent may fail occasionally but catastrophically during regular rollouts.
- A small noise added to the environment may lead the agent to low reward.
- In practical applications, such a small noise can be naturally prevalent.
- Thus, prohibit the use of DRL in safety critical domains (autonomous driving).
Related Work:
- Robust RL:
  - Each element of RL can contain uncertainty:
    - Observation.
      - More suitable to model worst case errors in observation process (sensro errors and equipment inaccuracy).
    - Action.
    - Transition dynamics.
    - Reward.
- Adversarial attacks on state observations:
  - FGSM based attack.
    - on pixel space.
  - Black-box attacks.
  - Multi-step gradient based attacks.

Contributions

Formulate the perturbation on state observations as a modified MDP, called state-adversarial Markov Decision Process (SA-MDP).
- Study its fundamental properties.
Based on the theory of SA-MDP, propose a theoretically principled robust policy regularizer.
- Related to the total variation (TV) distance or KL-divergence on perturbed policies.
Develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms.
- PPO.
- DDPG.
- DQN.
Improve the robustness of PPO, DDPG, and DQN agents under a suite of strong white box adversarial attacks on state observations.
- The robust SARSA attack (RS attack).
- Maximal action difference attack (MAD attack).
Robust policy noticeably improves DRL performance even without an adversary in a number of environments.
Code.

Method

SA-MDP

Definition

Adversary: \(v(s): S \to S\).
- Perturb the observation of the agent.
  - The worst case noise in measurement.
  - State estimation uncertainty.
- Assumption:
  - Stationary, Deterministic and Markovian Adversary.
  - \(v(s)\) is a deterministic function.
  - Only depend on the current state \(s\).
  - Does not change over time.
- This assumption holds for many adversarial attacks.
- Restrict the power of the adversary:
  - Define perturbation set \(B(s)\).
    - \(B(s)\) is a set of states, \(s \in B(s)\).
      - Usually defined as an \(L_\infty\) norm ball around \(s\) with a radius \(\epsilon\): \(B(s) = \{s_v : \lVert s - s_v \rVert_\infty \le \epsilon\}\).
      - \(\epsilon\) is also referred to as the perturbation budget.
    - Contain all allowed perturbations of the adversary \(v(s) \in B(s)\).
    - \(B(s)\) is usually a set of task-specific neighbouring states of \(s\).
      - Bound sensor measurement errors.
      - The observation still meaningful even with perturbations.
  - Restrict the adversary to perturb a state \(s\) only to a predefined set of states.
Action: \(\pi(a \vert v(s))\).
- may be sub-optimal.
- Thus, the adversary is able to reduce the reward.
Next state: \(s\), not \(v(s)\).

Bellman Equations

Define the adversarial value function as follows:

\[V_{\pi,v}(s) = \mathbb{E}_{\pi,v}\left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1}\vert s_t = s \right].\]

Action-value function:

\[Q_{\pi,v}(s,a) = \mathbb{E}_{\pi,v}\left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1}\vert s_t = s, a_t = a \right].\]

Bellman equations for fixed \(\pi\) and \(v\):

\[V_{\pi,v}(s) = \sum_{a\in A} \pi(a\vert v(s)) \sum_{s'\in S}p(s'\vert s,a)\left[ R(s,a,s') + \gamma V_{\pi,v}(s') \right],\] \[Q_{\pi,v}(s,a) = \sum_{s'\in S}p(s'\vert s,a)\left[ R(s,a,s') + \gamma \sum_{a'\in A}\pi(a'\vert v(s')) Q_{\pi,v}(s',a') \right].\]

Find the value functions under optimal (strongest) adversary \(v^\star(\pi)\):

Minimize the total expected reward for a fixed policy \(\pi\).

Define the optimal adversarial value and action-value functions as:

\[V_{\pi,v*}(s) = \min_v V_{\pi,v}(s),\] \[Q_{\pi,v*}(s,a) = \min_v Q_{\pi,v}(s,a).\]

Bellman contraction for optimal adverasry:

Similar to the policy evaluation procedure in regular MDP.
In the same spirit as Bellman optimality equations.

\[V_{\pi,v*}(s) = \min_{s_v \in B(s)} \sum_{a \in A}\pi(a\vert s_v) \sum_{s'\in S}p(s'\vert s,a)\left[ R(s,a,s')+\gamma V_{\pi,v*}(s') \right].\]

Find an optimal policy \(\pi^\star\) under the strongest adversary \(v^\star(\pi)\) in the SA-MDP setting:

\[V_{\pi^\star,v^\star(\pi^\star)}(s) \ge V_{\pi,v^\star(\pi)}(s), \quad \forall s \in S, \forall \pi.\]

Theorem: There exists an SA-MDP and some stochastic policy \(\pi \in \Pi_{MR}\) such that we cannot find a better deterministic policy \(\pi' \in \Pi_{MD}\) satisfying \(V_{\pi',v*(\pi')}(s) \ge V_{\pi,v*(\pi)}(s)\) for all \(s\in S\).

Some stochastic policies are better than any other deterministic policies in SA-MDP.
Contrarily, in a regular MDP, for any stochastic policy, we can find a deterministic policy that is at least as good as the stochastic one.
Even looking for both deterministic and stochastic policies, we cannot always find an optimal one.
- Under the optimal \(v^\star\), an optimal policy \(\pi^\star \in \Pi_{MR}\) does not always exist for SA-MDP.
Therefore, it is diffcult to find an optimal policy under the optimal adversary.

Despite that, under certain assumptions, the loss in performance due to an optimal adversary can be bounded:

If the differences between the action distributions under state perturbations are not too large, the performance gap between \(V_{\pi,v*(\pi)}(s)\) and \(V_\pi(s)\) can be bounded.
Regularize \(D_{TV}\) during training can obtain a policy that is robust under strong adversaries.
- A bounded state perturbation \(s_v\) produces bounded \(D_{TV}\).

\[\max_{s\in S}\left[ V_\pi(s) - V_{\pi,v*(\pi)}(s) \right] \le \alpha \max_{s\in S}\max_{s_v \in B(s)}D_{TV}(\pi(\cdot \vert s), \pi(\cdot \vert s_v)),\]

where \(\alpha\) is a constant that does not depend on \(\pi\):

\[\alpha = 2\left[ 1 + \frac{\gamma}{(1-\gamma)^2} \right] \max_{(s,a,s')\in S\times A \times S}\lvert R(s,a,s')\rvert.\]

SA-MDP on PPO

Stochastic policies.

The total variation distance is not easy to compute for most distributions, so we upper bound it again by KL-divergence:

\[[D_{TV}(\pi(a\vert s), \pi(a\vert s_v))]^2 \le D_{KL}(\pi(a\vert s),\pi(a\vert s_v)).\]

Define the policies as multivariate Gaussian distribution:

\[\pi(a\vert s)\sim N(\mu_s,\Sigma_s), \quad \pi(a\vert s_v)\sim N(\mu_{s_v}, \Sigma_{s_v}).\]

The KL-divergence can be given as:

\[D_{KL}(\pi(a\vert s),\pi(a\vert s_v)) = \frac{1}{2} \left( \log\lvert\Sigma_{s_v}\Sigma_s^{-1}\rvert +tr(\Sigma_{s_v}^{-1}\Sigma_s) + (\mu_{s_v} - \mu_s)^T \Sigma_{s_v}^{-1}(\mu_{s_v} - \mu_s) - \lvert A \rvert \right),\]

where

The mean terms \(\mu_s,\mu_{s_v}\) are produced by neural networks \(\mu_{\theta}(s), \mu_{\theta}(s_v)\).
\(\Sigma = \Sigma_{s_v} = \Sigma_s\) is a diagonal matrix independent of state \(s\).

Regularize KL-divergence for all \(s_v \in B(s)\):

Lead to a smaller upper bound of the loss in performance.
Directly related to agent performance under optimal adversary.
Ignore constant terms, then, the following robust policy regularizer can be obtained:

\[\mathcal{L}_{PPO}(\theta_\mu) = \frac{1}{2}\sum_s \max_{s_v \in B(s)}\left( \mu_{\theta_\mu}(s_v) - \mu_{\theta_\mu}(s) \right)^T \Sigma^{-1} \left( \mu_{\theta_\mu}(s_v) - \mu_{\theta_\mu}(s) \right) = \frac{1}{2} \sum_s \max_{s_v \in B(s)} \mathcal{L}_s(s_v,\theta_\mu).\]

Replace \(\max_{s\in S}\) term with a more practical and optimizer-friendly summation over all states in sampled trajectory.

Minimizing the above equation is challenging as it is a minimax objective.

Because \(\nabla_{s_v} \mathcal{L}(s_v,\theta_\mu)\vert_{s_v = s} = 0\), so using gradient descent directly cannot solve the inner maximization problem.
Propose two first order approaches to sovle it:
- Convex relaxation based method.
  - Enable an efficient analysis of the outer bounds for a neural network.
  - Leverage it as a generic optimization tool for solving minimax functions involving neural networks.
  - We can obtain an upper bound for \(\mathcal{L}_s(s_v,\theta_\mu): \bar{\mathcal{L}}_s(\theta_\mu) \ge \mathcal{L}_s(s_v,\theta_\mu)\) for all \(s_v\in B(s)\).
  - Compute \(\bar{\mathcal{L}}_s(\theta_\mu)\) is only a constant factor slower than compute \(\mathcal{L}_s(s,\theta_\mu)\) when an efficient relaxation is used.
- Stochastic Gradient Langevin Dynamics (SGLD) based solver.

Then, the minimization problem can be solved by:

\[\min_{\theta_\mu}\frac{1}{2}\sum_s \bar{\mathcal{L}}_s(\theta_\mu) \ge \min_{\theta_\mu}\frac{1}{2}\sum_s\max_{s_v\in B(s)}\mathcal{L}_s(s_v,\theta_\mu) = \min_{\theta_\mu}\mathcal{L}_{PPO}(\theta_\mu).\]

Then, we can add \(\mathcal{L}_{PPO}\) as part of the policy optimization objective, and solve PPO using SGD as usual.

SA-MDP on DDPG

Deterministic policy.

The total variation distance \(D_{TV}(\pi(\cdot\vert s), \pi(\cdot\vert s_v))\) is malformed.

The densities at different states \(s\) and \(s_v\) are very likely to be completely non-overlapping.

To adress this issue:

Define a smoothed version of policy \(\bar{\pi}(a\vert s)\).
- Add independent Gaussian noise with variance \(\sigma^2\) to each action: \(\bar{\pi}(a\vert s)\sim N(\pi(s),\sigma^2 I_{\lvert A \rvert})\).

Then, the total variation distance can be computed by:

\[D_{TV}(\bar{\pi}(\cdot\vert s), \bar{\pi}(\cdot\vert s_v)) = \sqrt{\frac{2}{\pi}}\frac{d}{\sigma} + O(d^3),\]

where \(d = \lVert \pi(s) - \pi(s_v)\rVert_2\).

Then, if we can penalize \(\sqrt{\frac{2}{\pi}}\frac{d}{\sigma}\), \(D_{TV}\) can be bounded.

Define the robust policy regularizer for DDPG as:

\[\mathcal{L}_{DDPG}(\theta_\pi) = \sqrt{\frac{2}{\pi}}\frac{1}{\sigma} \sum_s \max_{s_v \in B(s)}\lVert \pi_{\theta_\pi}(s) - \pi_{\theta_\pi}(s_v)\rVert_2.\]

For each state in a sampled batch of states, we need to solve a maximization problem:
- Using SGLD.
- Convex relaxation.

Then, during training time our goal is to keep \(\mathcal{L}_{DDPG}\) small.

SA-MDP on DQN

The action space is finite.
The deterministic action is determined by the maximum \(Q\) value.
- \(\pi(a\vert s) = 1\), when \(a = \arg\max_{a'}Q(s,a')\).
- \(\pi(a\vert s) = 0\), otherwise.

Then, the total variation distance is:

\[D_{TV}(\pi(\cdot\vert s),\pi(\cdot\vert s_v)) = \begin{cases} 0 & \arg\max_a \pi(a\vert s) = \arg\max_a \pi(a\vert s_v) \\ 1 & otherwise. \end{cases}\]

We want to make the top-1 action stay unchanged after perturbation.

Define the robust policy regularizer as:

\[\mathcal{L}_{DQN}(\theta) = \sum_s \max\left\{ \max_{s_v\in B(s)} \max_{a\neq a^\star}\left[Q_\theta(s_v,a) - Q_\theta(s_v,a^\star(s))\right], -c \right\},\]

where \(a^\star(s) = \arg\max_a Q_\theta(s,a)\) and \(c\) is a small positive constant.

The sum is over all \(s\) in a sampled batch.
It is more similar to the robustness of classification tasks.
- Treat \(a^\star(s)\) as the correct label.
The maximization can be solved using:
- Projected Gradient Descent (PGD).
- Convex relaxation of neural networks.

Then, add \(\mathcal{L}_{DQN}\) as part of the optimization objective.

Adversarial Attacks under Assumption

Discuss a few strong adversarial attacks under assumption.
Propose two novel critic independent attacks for DDPG and PPO:
- Robust SARSA (RS) attack.
- Maximal Action Difference (MAD) attack.

Critic based attack

Many works use the gradient of \(Q(s,a)\) to provide the direction to update states adversarially in \(K\) steps:

Define the adversarial state as: \(s_v = s^{K-1}\).

\[s^{k+1} = s^k - \eta \cdot proj[\nabla_{s^k}Q(s^0,\pi(s^k))], \quad k=0,\cdots,K-1.\]

Where

\(proj[\cdot]\) is a projection to \(B(s)\).
\(\eta\) is the learning rate.
\(s^0\) is the state under attack.

It attempts to find a state \(s_v\) triggering an action \(\pi(s_v)\) minimizing the action-value at state \(s^0\).

Require a \(Q\) function to find the best perturbed state \(s_v\).

This attack has a few drawbacks:

Attack strength strongly depends on critic quality.
- If \(Q\) is poorly learned, the attack fails as no correct update direction is given.
  - Not robust against small perturbations.
  - Has obfuscated gradients.
It relies on \(Q\) function which is specific to the training process.
Not applicable to many actor-critic methods (TRPO, PPO) using a learned value function \(V(s)\) instead of \(Q\).
- \(V(s_v)\) represents the value of \(s_v\) rather than the value of taking \(\pi(s_v)\) at \(s^0\).
  - \(V(s_v) = \max_a Q(s_v,a) \neq \max_a Q(s^0,a)\).

Robust SARSA (RS) attack

Since \(\pi\) is fixed during evaluation:

We can learn its corresponding \(Q^\pi(s,a)\) using on-policy TD algorithms without knowing the critic network used during training.
The robustness of \(Q^\pi(s,a)\) is very important.
- If it is not robust against small perturbations:
  - Given a state \(s_0\), a small change in \(a\) will significantly reduce \(Q^\pi(s_0,a)\), which does not reflect the true action-value.
  - It cannot provide a good direction for attacks.

Use the on-policy TD-learning algorithm (SARSA) to learn \(Q^\pi(s,a)\) with an additional robustness objective to minimize:

\[L_{RS}(\theta) = \sum_{i\in [N]}\left[ r_i + \gamma Q_{RS}^\pi(s_i',a_i') - Q_{RS}^\pi(s_i,a_i) \right]^2 + \lambda_{RS} \sum_{i\in [N]}\max_{\hat{a}\in B(a_i)}(Q_{RS}^\pi(s_i,\hat{a}) - Q_{RS}^\pi(s_i,a_i))^2,\]

where

The first summation is the TD-loss.
The second summation is the robustness penalty with regularization \(\lambda_{RS}\).
\(N\) is the batch size and each batch contains N tuples of transitions \((s,a,r,s',a')\) sampled from agent rollouts.
\(B(a_i)\) is a small set near action \(a_i\).
The inner maximization can be solved using convex relaxation of neural networks.
Use \(Q_{RS}^\pi\) to perform critic-based attacks.

The attack strength of this attack does not depend on the quality of an existing critic.

Maximal Action Difference (MAD) attack

An effective attack, does not depend on a critic.
Find an adversarial state \(s_v\) by maximizing \(D_{KL}(\pi(\cdot\vert s),\pi(\cdot\vert s_v))\).

Let

\[L_{MAD}(s_v) = - D_{KL}(\pi(\cdot\vert s),\pi(\cdot\vert s_v)).\]

Then, find \(s_v\) by:

\[\begin{aligned} \arg\min_{s_v \in B(s)} L_{MAD}(s_v) & = \arg\max_{s_v\in B(s)} D_{KL}(\pi(\cdot\vert s),\pi(\cdot\vert s_v)) \\ & = \arg\max_{s_v\in B(s)} (\pi_\theta(s) - \pi_\theta(s_v))^T\Sigma^{-1}(\pi_\theta(s) - \pi_\theta(s_v)). \end{aligned}\]

The objective can be optimized using SGLD to find a good \(s_v\).

For DDPG, set \(\Sigma = I\).

References

H. Zhang, H. Chen, C. Xiao, B. Li, D. Boning, and C. J. Hsieh, “Robust deep reinforcement learning against adversarial perturbations on state observations,” Neural Information Processing Systems, 2020.