Motivation

  • Quadrotor is a proven good platform for research in control and navigation systems.
  • Most quadrotor control approaches relies on a model of the quadrotor and its dynamics, which are nonlinear and may carry inaccuracies due to the impossibility of modelling all aspects of the plant’s dynamics.
  • Intelligent controllers perform well even if the system is nonlinear and unknown, and its main advantage is that there is no need for the mathematical model of the plant.
  • The controller acts as a black box, and approximates the model based on the data gathered.
  • But, it is hard to guarantee performance and fast convergence.
  • Policy gradient methods:
    • Deterministic policy gradients1
      • Value/advantage estimations with lower variance.
      • But, a good exploration strategy is required to explor the state space efficiently.
    • Stochastic policy gradients2
      • Present a better sample efficiency.
      • No need to use any additional exploration strategy to achieve stability.
      • Not too many episodes are required for training.

Contributions

  • Present a quadrotor control approach that uses PPO with stochastic policy gradients.
  • Verify the feasibility of applying RL methods to optimize a stochastic control policy for position control of a quadrotor, while maintaining a good sampling efficiency, allowing fast covergence.
  • Implement the simulation via V-REP.

Method

RL

The goal of RL is to develop a policy \(\pi\) that maps states to probabilities of selecting each possible actions:

  1. Generating samples (run the policy).
  2. Fitting a model / Estimating the return.
  3. Improving the policy \(\pi\).
  4. Repeating this process until the policy has converged to the optimal \(\pi^*\).

Policy gradient:

\[g = \mathbb{E}_t \left[ \nabla_\theta \log \pi_\theta(a_t \vert s_t) A_t \right].\]

PPO

  • PPO’s hyperparameters are robust for a large variety of tasks.
  • It has high performance and low computational complexity (first-order optimization).
  • Guarantee monotonic improvement.
  • For continuous action spaces, the policy network is tasked with outputting a probability distribution, the means and variances of a multivariate Gaussian.
  • During training, actions are randomly sampled from this distribution to increase exploration.
  • When testing, the mean is taken as the action, which changes the stochastic policy to a deterministic one.
  • GAE is employed to reduce variance of the advantage estimates.
  • In order to increase sample efficiency, improtance sampling is used to obtain the exceptation of samples gathered from an old policy.

The detail of PPO is here.

Conservative policy iteration (CPI) objective:

\[L^{CPI}(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)} A_t \right].\]

Consider the probability ratio:

\[r(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)}.\]

Then, the clipped objective function is given by

\[L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left(r(\theta) A_t, clip(r(\theta), 1-\epsilon, 1+\epsilon) A_t\right) \right].\]

Action

  • velocities of each of the four motors (PWM ratio \(0\) to \(100\%\)).

State

  • quadrotor/target position error
  • rotation matrix
  • linear velocities
  • angular velocities

NNs are known to converge faster when the input states share a common scale, such that the network does not need to learn this scaling itself. Therefore, the state vector should be normalized (estimate a running mean and variance of each state component and normalized based on these estimates), which are expressed in different units and have different dynamic ranges.

Reward

\[r_t(e_{x_t}, e_{y_t}, e_{z_t}) = a - \sqrt{e_{x_t}^2 + e_{y_t}^2 + e_{z_t}^2},\]

where \(a\) is a constant used to assure the reward is always positive.

Code

Simulation

  • V-REP.
  • Initialize the quadrotor in a random position with random orientation.

PWM \(\to\) Propeller thrust force:

\[Tr(pwm) = 1.5618e - 4*pwm^2 + 1.0395e - 2*pwm + 0.13894.\]

Evaluation

  • Stochastic policy
    • the quadrotor irregularly flies around the setpoint due to the non-zero variance of the continuous probability distribution of the actions.
  • Deterministic policy - Fixed target
    • the quadrotor stabilizes around the setpoint, with apparent steady-state error.
    • also capable of recovering from harsh initial conditions (e.g. 90-degree), which demonstrates the policy generalization capability.
  • Deterministic policy - Moving target
    • the quadrotor completed the trajectory (setpoint moves at a velocity 1.5 m/s) smoothly.

Turn the stochastic policy to a deterministic one by ignoring the variance of the distributions over its actions and considering only its mean value.

Future Work

  • new reward signals.
  • transfer ideas from classic control theory to reduce the observed steady-state error.
  • consider multiple initialization strategies to improve the controller’s performance.
  • Reduce the reality gap, transfer the policy learned in simulations to real world flight.
  • Evaluate the robustness of the controller to environment uncertainty.

References

  1. J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a Quadrotor with Reinforcement Learning,” IEEE Robot. Autom. Lett., vol. 2, no. 4, 2017, pp. 2096–2103. 

  2. G. C. Lopes, M. Ferreira, A. Da Silva Simoes, and E. L. Colombini, “Intelligent control of a quadrotor with proximal policy optimization reinforcement learning,” in Proceedings - 15th Latin American Robotics Symposium, 6th Brazilian Robotics Symposium and 9th Workshop on Robotics in Education, LARS/SBR/WRE, 2018, pp. 509–514.