Position Control of a Quadrotor with Proximal Policy Optimization
Motivation
- Quadrotor is a proven good platform for research in
control and navigation
systems. - Most quadrotor control approaches relies on a
model
of the quadrotor and itsdynamics
, which are nonlinear and may carry inaccuracies due to theimpossibility of modelling all aspects
of the plant’s dynamics. - Intelligent controllers perform well even if the system is
nonlinear and unknown
, and its main advantage is that there is no need for themathematical model
of the plant. - The controller acts as a black box, and approximates the model based on the data gathered.
- But, it is hard to
guarantee performance and fast convergence
. - Policy gradient methods:
- Deterministic policy gradients1
- Value/advantage estimations with
lower variance
. - But, a good
exploration
strategy is required to explor the state space efficiently.
- Value/advantage estimations with
- Stochastic policy gradients2
- Present a better
sample efficiency
. - No need to use any additional exploration strategy to achieve stability.
- Not too many episodes are required for training.
- Present a better
- Deterministic policy gradients1
Contributions
- Present a quadrotor control approach that uses
PPO
withstochastic policy gradients
. - Verify the feasibility of applying RL methods to optimize a
stochastic control policy
forposition control
of a quadrotor, while maintaining a goodsampling efficiency
, allowing fast covergence. - Implement the simulation via V-REP.
Method
RL
The goal of RL is to develop a policy \(\pi\) that maps states to probabilities of selecting each possible actions
:
- Generating samples (run the policy).
- Fitting a model / Estimating the return.
- Improving the policy \(\pi\).
- Repeating this process until the policy has converged to the optimal \(\pi^*\).
Policy gradient:
\[g = \mathbb{E}_t \left[ \nabla_\theta \log \pi_\theta(a_t \vert s_t) A_t \right].\]PPO
- PPO’s hyperparameters are robust for a large variety of tasks.
- It has high performance and
low computational complexity
(first-order optimization). - Guarantee monotonic improvement.
- For continuous action spaces, the policy network is tasked with outputting a
probability distribution
, themeans
andvariances
of amultivariate Gaussian
. - During training, actions are randomly
sampled
from this distribution toincrease exploration
. - When testing, the
mean
is taken as theaction
, which changes thestochastic policy
to adeterministic
one. GAE
is employed toreduce variance
of the advantage estimates.- In order to
increase sample efficiency
,improtance sampling
is used to obtain the exceptation of samples gathered from an old policy.
The detail of PPO is here.
Conservative policy iteration (CPI) objective:
\[L^{CPI}(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)} A_t \right].\]Consider the probability ratio:
\[r(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)}.\]Then, the clipped objective function is given by
\[L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left(r(\theta) A_t, clip(r(\theta), 1-\epsilon, 1+\epsilon) A_t\right) \right].\]Action
- velocities of each of the four motors (PWM ratio \(0\) to \(100\%\)).
State
- quadrotor/target position error
- rotation matrix
- linear velocities
- angular velocities
NNs are known to converge faster when the input states share a common scale, such that the network does not need to learn this scaling itself. Therefore, the state vector should be normalized (estimate a running mean and variance of each state component and normalized based on these estimates), which are expressed in different units and have different dynamic ranges.
Reward
\[r_t(e_{x_t}, e_{y_t}, e_{z_t}) = a - \sqrt{e_{x_t}^2 + e_{y_t}^2 + e_{z_t}^2},\]where \(a\) is a constant used to assure the reward is always positive.
Code
Simulation
- V-REP.
- Initialize the quadrotor in a random position with random orientation.
PWM \(\to\) Propeller thrust force:
\[Tr(pwm) = 1.5618e - 4*pwm^2 + 1.0395e - 2*pwm + 0.13894.\]Evaluation
- Stochastic policy
- the quadrotor irregularly flies around the setpoint due to the
non-zero variance
of the continuous probability distribution of the actions.
- the quadrotor irregularly flies around the setpoint due to the
- Deterministic policy - Fixed target
- the quadrotor stabilizes around the setpoint, with apparent
steady-state error
. - also capable of recovering from harsh initial conditions (e.g. 90-degree), which demonstrates the policy
generalization capability
.
- the quadrotor stabilizes around the setpoint, with apparent
- Deterministic policy - Moving target
- the quadrotor completed the trajectory (setpoint moves at a velocity 1.5 m/s) smoothly.
Turn the stochastic policy to a deterministic one by ignoring the variance of the distributions over its actions and considering only its mean value.
Future Work
- new reward signals.
- transfer ideas from classic control theory to
reduce the observed steady-state error
. - consider multiple initialization strategies to improve the controller’s performance.
- Reduce the
reality gap
, transfer the policy learned in simulations to real world flight. - Evaluate the
robustness
of the controller to environment uncertainty.
References
-
J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a Quadrotor with Reinforcement Learning,” IEEE Robot. Autom. Lett., vol. 2, no. 4, 2017, pp. 2096–2103. ↩
-
G. C. Lopes, M. Ferreira, A. Da Silva Simoes, and E. L. Colombini, “Intelligent control of a quadrotor with proximal policy optimization reinforcement learning,” in Proceedings - 15th Latin American Robotics Symposium, 6th Brazilian Robotics Symposium and 9th Workshop on Robotics in Education, LARS/SBR/WRE, 2018, pp. 509–514. ↩