Motivation

Quadrotor

  • Applications: dirty, dangerous, dull tasks.
    • Search and rescue missions.
    • Various surveillance tasks.
    • Acting as mobile nodes in communication networks.
    • Quick delivery systems.
      • Medical supplies.
  • Some serious problem:
    • How to hold algorithmically driven robots responsible for devastating actions?
    • The lowered bar of entry to start warfare with cheaper robots to restrict the use of autonomous systems in warfare.
  • Control:
    • Lacking in stability compared to traditional designs.
      • With smaller units resulting in lower inertia moments being more susceptible to complex aerodynamical effects.
    • Require intricate control design in order to guarantee stable flight.
      • Requirement on computational speed and accuracy.
        • The dynamics are quick and non-linear.
    • Control schemes:
      • Hierarchical:
        • Lower level controllers:
          • control of electrical signals to rotors.
          • attitude controller.
          • positional and altitude controller.
        • Higher level controllers:
          • path planning.
          • problem solving.
          • cooperation between humans or other units.
    • Related works:
      • PID.
      • LQR.
      • \(H_\infty\).
      • Sliding mode.
      • MPC.
      • RL:
        • Attitude control.
        • Positional control.
        • S2R problem:
          • Not transfer to real hardware and test on notably different simulated setups.
          • Transferring successfully to real hardware but noticing some performance degradation.
          • Implement methods of handling the S2R-gap and seeing improved transfer results.
          • None of these methods implemented the robust wersions of the underlying RL formulation.

S2R Transfer

  • Sim-to-Reality (S2R) transfer in RL is a promising approach of solving costly exploration in real systems.
    • Modifying the training such that safety guarantees on the training policy.
      • Safe RL.
      • Modify the objective function or exploration policy are studied to minimize the risk costly exploration.
    • Improving the transferability of simulator-trained policies to actual systems.
      • Transfer Learning in RL.
      • The agents performance on environments different than the environment trained in.
  • Problem:
    • S2R solves the costly exploration problem, but introduces robustness and generalization concerns.
    • The generalization of transferring policies from simulators to real systems.
    • Even if the agent finds an optimal policy inside the simulated environment, it is difficult to guarantee anything about the performance on the target environments.
  • Related works:
    • Making the S2R gap smaller:
      • Making the simulated environment behave more closely to the target environment.
      • Accurate parameter estimation and more precise system indentification.
      • Combine rollouts in real environment and using the trajectories as feedback to improve the model behavior.
      • Drawbacks:
        • The assumption is that there exists only one single target environment, but there may be multiple possible target environments.
        • It is difficult or costly to gather information from the target environments.
    • Consider random environments:
      • Training on several model dynamics sampled from a priori parameter distribution representing the agents, which have been shown to better generalize to real systems.
      • Drawbacks:
        • The exact dynamics of the target environments is difficult to emulate.
    • Robust control:
      • Design controller to tolerant to misspecifications between models and real systems.
      • The controlled system in robust control is assumed to be unknown but bounded.
      • The most popular robust controller designs is the \(H_\infty\) controller.
        • Work with the average bounded system optimizes instead for the worst-case system.
        • Guarantee stability for all systems inside the bounds.
        • Thus, it is popular for situation where robustness is required.
  • Ideas:
    • Robust MDP.
    • RL + Robust Control.
    • Create agents with embedded uncertainty about the simulated environment.
    • Training the agent in a simple environment and test on versions of the simulator with different environment parameters.
      • Training environment.
      • Target environments.
  • Results:
    • Agents with higher level of robustness outperformed the standard agents in these environments.
    • The added robustness increases generality and can help when transferring policies from simulators to reality.

Method

  • Goal: Formulate a training method for RL agents to bridge the S2R gap.
  • Robust MDP:
    • Uncertainty sets can be expressed in state space instead of probability distributions over the state space.

Framework

Model Uncertainty

  • Express the uncertainty of the transition model through some uncertainty set in which all the possible transition models lie.

Inner Problem Definition

For discrete state space:

  • The transition model can be expressed as a set of transition matrices consisting of finitely many elements of transition probabilities from and to each state given an action.
  • Define the inner problem as:
\[\min_{p\in P} \mathbb{E}_p[V^\pi(s_{s+1})] = \min_{p \in P}\sum_{i=1}^{\vert S \vert} p(s_i\vert s_t,a_t) V^\pi(s_i).\]

For continuous state space:

  • Remove the enumerability of the states.
  • The inner problem can be transformed into:
\[\min_{p\in P} \mathbb{E}_p[V^\pi(s_{s+1})] = \min_{p\in P}\int_S p(s\vert s_t,a_t)V^\pi(s)ds.\]

The above inner problem expectation calculation is difficult to evaluate, since \(V^\pi\) in this setting will be a highly non-linear structure.

Then, make a huge simplifying assumption to this problem:

  • In deterministic environments, the transition distributions \(P\) degenerate the expectation evaluation becomes trivial.
  • Let \(P^U\) be a set of transition models with degenerate distributions \(p^u\) parameterized by states \(u\).
  • Then, all probability density is located at \(u\).
\[\min_{p^u\in P^U} \mathbb{E}_{p^u}[V^\pi(s_{s+1})] = \min_{p^u\in P^U}\int_S p^u(s\vert s_t,a_t)V^\pi(s)ds = \min_{p^u\in P^U} V^\pi(u).\]

Uncertainty Set

  • View the uncertainty set as a possibility to handle unmodelled behaviors (which are difficult to quantify).
    • Wind.
    • Consider the sum of all of the unmodelled dynamics as some noise added to the transition dynamics.
  • Reasons:
    • The parameters of the system are not estimated.
    • The model is crude and a lot of uncertainty would probably come from unmodelled dynamics.
    • The uncertainty modeling becomes simpler to implement.

The deterministic model can be seen as a degenerate distribution:

  • Parameterized by the point in state space.
  • The entirely of the probability density is located.
  • The uncertainty set can be reinterpreted as a set over the state space instead over transition dynamics.

Consider an uncertainty set around each observed state \(s\) in a trajectory:

\[S_U(s) = \{ s+u\vert \forall u \in U(s) \},\]

where \(U(s)\) is a priori asserted uncertainty set.

Let \(U(s)\) as \(L_2\) spheres in the state space:

\[U_\rho(s) = \{ u\in \mathbb{R}^S \vert \lVert u \rVert_2 < \rho \},\]

where \(\rho\) is the size of uncertainty, can be viewed as radius.

Solving Inner Problem

Minimize the value function given the uncertainty set:

\[\min_{u\in U}V_\phi^\pi(s+u).\]

Algorithm:

  • Hyperparameters:
    • \(k\) for width, controls how many points are computed in parallel.
    • \(n\) for depth.
    • \(n\) controls how many gradient steps are taken before we assume convergence.

Robust PPO

  • Replace training targets using Bellman equations with the robust Bellman equations instead.

Estimate \(\hat{R}\) and \(\hat{A}\)

  • Using truncated \(n\)-step TD to estimate.
  • Apply a decaying scheme for parameter \(\lambda\).
    • In order to reduce computations, handle episodic tasks, reduce reliance on too long traces.

Define the \(n\)-step robust returns for a trajectory:

\[\begin{aligned} \hat{R}_t^{\rho(1)} & = r_t + \gamma \min_{u\in U_\rho}V^\pi(s_{t+1} + u), \\ \hat{R}_t^{\rho(2)} & = r_t + \gamma r_{t+1} + \gamma^2 \min_{u\in U_\rho}V^\pi(s_{t+2} + u), \\ & \vdots \\ \hat{R}_t^{\rho(n)} & = \sum_{k=0}^{n-1}\gamma^k r_{t+k} + \gamma^n \min_{u\in U_\rho}V^\pi(s_{t+n} + u). \end{aligned}\]

Define the \(n\)-step robust advantage estimates:

\[\hat{A}_t^{\rho(n)} = \hat{R}_t^{\rho(n)} - V^\pi(s_t).\]

Using Truncated TD(\(\lambda\)) to estimate \(\hat{R}_t\):

\[\begin{aligned} \hat{R}_t^{\rho TTD(\lambda,N)} & = (1-\lambda)\hat{R}_t^{\rho(1)} + \lambda\left( (1-\lambda)\hat{R}_t^{\rho(2)} + \lambda(\cdots) \right) \\ & = (1-\lambda)\sum_{n=1}^{N-1}\lambda^{n-1}\hat{R}_t^{\rho(n)} + \lambda^{N-1}\hat{R}_t^{\rho(N)}. \end{aligned}\]

Using Truncated GAE(\(\lambda\)) to estimate \(\hat{A}_t\):

\[\hat{A}_t^{\rho TGAE(\lambda,N)} = (1-\lambda)\sum_{n=1}^{N-1}\lambda^{n-1}\hat{A}_t^{\rho(n)} + \lambda^{N-1}\hat{A}_t^{\rho(N)}.\]

Algorithm:

Algorithm of Robust PPO

  • Incorporate a decaying scheme for the trace parameter \(\lambda\):
    • High \(\lambda\) greatly improves training in early iterations when the critic network has not yet converged to a useful approximator.
    • \(\lambda=0\) is needed in order to represent the correct robust equations for training targets.
    • Using a linear decaying scheme, \(\lambda_{decay}\) controls the speed at which the trace reaches zero.

Robust TD-error to fit the value function approximator:

\[\phi_{k+1} = \arg\min_\phi\frac{1}{\vert B\vert T}\sum_{\tau\in B}\sum_{t=0}^T\left( V_\phi(s_t) - \hat{R}_t^{\rho TTD(\lambda,N)} \right)^2.\]

Robust advantage estimation to fit the policy function:

\[\theta_{k+1} = \arg\max_\theta J^{PPO}(\theta\vert \hat{A}_t^{\rho TGAE(\lambda,N)}).\]

Robust PPO for UAV positional control

State Input

\[(p_t^e, v, \Theta, \omega, e_{p_t}, e_{v_t}, e_{\sigma_t}).\]
  • \(p_t^e = p_t^g - p_t\).

Action Output

\[(\phi_d,\theta_d,\dot{\psi}_d, T_d).\]
  • Remove yaw rate, set it to a constant \(\dot{\psi}_d = 0\).
  • Rescale the possible range for the remaining action outputs:
    • Angles \(\phi_d, \theta_d\) range: \([-10^\circ,10^\circ]\).
    • Thrust \(T_d\) range: \([10000,65535]\).
  • Normalize the action: \([-1,1]\).

Environment terminal time

  • The agent is more interested in the behavior before reaching the goal position.
  • Then, episode is needed.
  • Add a time limit \(T_{limit}\).
    • In order to let the agent see initial states more often.
  • End the episode when:
    • A failure state.
      • \(V(s_T) = -1\).
    • Maximum number of interactions has been reached (limit time).
      • \(V(s_T) = V(s_T)\).

Neural Networks

  • MLP.
  • Critic network:
    • The output layer is a simple linear layer with one output.
  • Actor network:
    • Two output layers representing parameters for Gaussian distributions:
      • Mean values \(\mu_\theta(s)\in [-1,1]\) with tanh activation function.
      • Standard deviations \(\sigma_\theta(s)\in[0,1]\) with sigmoid activation function, in order to impose limits on the variance of the distribution.

Regularization

Definition:

  • Regularization is a general term for modification to objective functions to guide optimization in ill-posed problem.
  • In ML, it is widely used to combat overfitting.

Formulate the regularization to objective function as:

\[J(\theta) + \sum_{i=1}^I \beta_i l_i(\theta),\]

where \(\beta_i\) is the regularization weight to set the strength of the regularization term.

Using regularization in robust PPO:

  • Entropy regularization for actor network.
    • Trade-off between exploration and exploitation.
      • Exploitation: using current knowledge to collect high rewards.
      • Exploration: searching the policy space to potentially find better policies to follow.
    • Usually resolved with heuristical methods.
    • Entropy can be seen as the uncertainty of a distribution.
      • \(H(p) = -\int_X p(x)\log(x)dx\).
    • Entropy regularization: \(l_H(\theta) = \mathbb{E}_{\pi_\theta}[H(\pi_\theta(a_t\vert s_t))]\).
      • Prevent too deterministic policies.
      • Encourage a higher level of exploration in the on-policy setting.
  • \(L_1\) regularization for both actor and critic networks.
    • Tend to favor sparsity in the weight vector.
    • Reduce the model complexity.
      • Tend towards simpler models by pruning model weights.
      • Does not impact the original objective significantly enough.
    • Parameter weights regularization:
      • For actor network weights: \(l_{L_1}(\theta) = \lVert \theta \rVert_1\).
      • For critic network weights: \(l_{L_1}(\phi) = \lVert \phi \rVert_1\).

Simulation

Implementation

  • Simulator: RLlib framework.
    • Distribute training across multiple processes.
    • Parallelize the agent rollouts.
  • Wrapper: OpenAI Gym API.
  • NNs: Pytorch.
  • Gradient method: Adam.

Training

  • Training multiple agents with values of robust radius \(\rho\in[0.001,1]\).
  • Compare these agents to the baseline of a regular PPO agent with \(\rho=0\).

Evaluation

  • Test the robustness of the policy in target environments.
    • Different to the training environment.
  • Target environments:
    • Different quadrotor mass: \(\Delta m\in [0.7,1.3]\).
    • Different motor coefficients:
      • Change the coefficients of one of the motors in the model, corresponding to damaged or worn.
      • The motors are modeled with a set of polynomial coefficients mapping digital driver values to physical values \([c_T,b_T,a_T,b_\omega,a_\omega,b_\psi,a_\psi]\).
      • Multiply these coefficients by a factor: \(\Delta \lambda \in [0.3,1]\).
    • Different PID tuning:
      • PID parameters of internal attitude controller.
      • Multiply a factor \(\Delta K_p\in[0.1,1.9]\) to the proportional gain \(K_p\) of the roll and pitch controllers: \([PID_\phi.K_p, PID_\theta.K_p, PID_{\dot{\phi}}.K_p, PID_{\dot{\theta}}.K_p]\).
  • Performance metric:
    • Use the episodic average step reward measured from an episode: \(\bar{r}(\tau) = \frac{\sum_{k=0}^{T-1}r_t}{T}\).
  • Robust performance:
    • Remove the stochastisity in the policy, picking the mean value \(\mu_\theta\) of the policy distribution.

Results

  • Convergence seems clear for agents with lower values of \(\rho\), but asymptotic performance does taper off as the radius grows.
    • Performance dropping after \(\rho = 0.1\).
  • Robustness of the agents:
    • Running each agent in different target environments.
    • The robust agents outperforms the nominal in the more extreme environment differences.
    • The robust agents do not suffer from an immediate performance drop in the nominal environment.
    • More robust agents performs similar to slightly better than the non-robust agent for environments close to the nominal.
    • Reduce the oscillations and static errors.
  • Static errors:
    • especially in the \(z\)-axis.
    • Static errors are widely understood in the control theory where controllers with integral action.
      • Integral action is the most non-Markovian part of controllers.

Future Work

  • Robust MDP for continuous control
    • Deterministic systems.
      • Moving all the uncertainty into the second order uncertainty sets (on state observations).
    • Non-deterministic systems.
      • Require measurements on probability distributions.
    • Consider other uncertainty sets on state observations.
  • Robustness of other RL algorithms
    • DQN
    • DDPG
    • SAC
  • RL for UAV control
    • End-to-end control.
    • Static errors.
      • Handle via integral action.
      • Extend the state space to include the sum of error signals.
        • Increase the dimensionality of the state space and the search space for policies.

References

  • L. Bjarre, “Robust Reinforcement Learning for Quadcopter Control,” 2019.