Motivation

  • Fast navigation of autonomous quadrotors:
    • push the vehicles to more aggressive maneuvers.
    • Push quadrotors to their physical limits.
    • Quadrotor competitions
      • Autonomous drone racing
        • IROS
        • NeurIPS
      • AlphaPilot challenge
        • develop autonomous systems that will eventually outperform expert human pilots.

Drone racing

  • The goal is to travel through a set of waypoints as fast as possible.
  • The key challenge is planning the minimum-time trajectory.
    • Assuming perfect knowledge of the waypoints to pass in advance.
  • The planned trajectory satisfies all constraints:
    • system dynamics
      • linear and angular dynamics are coupled
      • compute minimum-time trajectories requires a trade-off between maximizing linear and angular accelerations.
    • diminish control authority
  • Sampling-based method
    • RRT\(^*\)
      • can be provably optimal in the limit of infinite samples.
      • Rely on the construction of a graph.
      • commonly combine with approaches that provide a closed-form solution for trajectories.
    • drawbacks
      • closed-form solution between two full quadrotor states is difficult to derive analytically when considering the nonlinear system dynamics.
        • Approximate the quadrotor as a point mass.
        • Polynomial representation method
          • faster computation times.
          • but fail to account for true actuation limits.
          • lead to suboptimal or dynamically infeasible solutions.
  • Optimization-based method
    • Solve the problem by using discrete-time state space representations for the trajectory.
      • Rely on numerical optimization by discretizing the trajectory in time.
    • Enforce the system dynamics and thrust limits as constraints.
    • Require very long computation times.
    • Trajectory optimization can only achieve local optima.
    • Can not reliably compute feasible trajectories for
      • nonlinear systems
      • non-convex constraints
      • when the initialization is poor
  • RL-based method
    • applications
      • waypoint following
      • autonomous landing
    • neither managed to push the platform to its physical limits.

Contributions

  • A new approach to minimum-time trajectory generation for quadrotors is presented.
    • Leveraging DRL and relative gate observations (as the policy inputs).
    • Highly parallelized sampling scheme.
    • maximize a 3D path progress reward.
    • Adaptively compute near-time-optimal trajectories for random track layouts.
  • Compare to optimization-based method
    • Compute extremely aggressive trajectories that are close to the time-optimal solutions provided by state-of-the-art trajectory optimization.
    • Computational advantage over methods based on trajectory optimization.
    • The learned NN policy can handle large track uncertainties efficiently, which is very challenging and computationally expensive for optimization-based approaches.
  • The first learning-based approach for tackling minimum-time quadrotor trajectory planning that is adaptive to environmental changes.
    • Generalization capability
    • adaptive to different and unseen racing environments.
    • can be deployed in unknown, dynamic, or unstructured environemnts.
  • Achieving speeds of up to \(17m/s\) with physical quadrotor.
  • Video: https://youtu.be/Hebpmadjqn8.

Method

Quadrotor Dynamics

\[\begin{aligned} \dot{p} & = v, \\ \dot{v} & = q\odot c + g, \\ \dot{q} & = \frac{1}{2} \Lambda(\omega)\cdot q, \\ \dot{\omega} & = J^{-1}(\tau - \omega\times J \omega). \end{aligned}\]

Full state of the quadrotor:

\[x = [p,q,v,\omega].\]

Control inputs:

\[u = [f_1,f_2,f_3,f_4].\]

Using 4th order Runge-Kutta method for the dynamic equation: \(f_{RK4}(x,u,\Delta t)\).

Reward Function

  • Using a reward that closely approximates the true performance objective.
  • Provide feedback to the agent at every time step.
  • According to car racing.

Projection-based path progress reward

  1. Using straight line segments connecting adjacent gate centers as a representation of the track center-line.
  2. Project the position of the quadrotor onto the current line segment.
  3. Given the center-line position of the quadrotor \(p_c(t)\).
  4. Define the progress reward as follows:
\[r_p(t) = s(p_c(t)) - s(p_c(t-1)),\]

where \(s(\cdot)\in\mathbb{R}\) denotes the progress along the path.

Safety reward

  • Incentivize the quadrotor to fly through the gate center by penalizing small safety margins.
  • Reduce the risk of crashing.
\[r_s(t) = -f^2\left( 1 - \exp\left(-\frac{d_n^2}{2v}\right) \right),\]

where

  • \(f = \max[1-\frac{d_p}{d_{max}}, 0.0]\).
  • \(v = \max[(1-f)\frac{w_g}{6},0.05]\).
  • \(d_p\) denotes the distance of the quadrotor to the gate plane.
  • \(d_n\) denotes the distance of the quadrotor to the gate normal.
  • \(d_{max}\) specifies a threshold on the distance to the gate center in order to activate the safety reward.

Final reward function

\[r(t) = r_p(t) + a \cdot r_s(t) - b \cdot \Vert \omega_t\Vert^2 + \begin{cases} r_T & \text{if gate collision} \\ 0 & \text{else} \end{cases},\]

where

  • \(r_T = - \min\left[ \left( \frac{d_g}{w_g} \right)^2, 20.0 \right]\) is a terminal reward for penalizing gate collisions.
    • \(d_g\) is the euclidean distance from the position of the crash to the gate center.
  • \(\Vert \omega_t\Vert^2\) is added to counteract large oscillations.

Observation Space

Consist of two main components:

  • quadrotor’s state \(s_t^{quad} = [v,\dot{v},R,\omega]\in\mathbb{R}^{18}\).
    • Using rotation matrix to avoid ambiguities and discontinuities in the representation of the orientation.
  • race track \(s_t^{track} = [o_1,\alpha_1,\cdots,o_i,\alpha_i,\cdots], i\in[1,\cdots,N]\).
    • \(o_i = [p_r,p_\theta,p_\phi]_i\in\mathbb{R}^3\) denotes the observation for gate \(i\).
    • \(\alpha_i\in\mathbb{R}\) specifies the angle between the gate normal and the vector pointing from the quadrotor to the center of gate \(i\).

Apply input normalization: z-scoring.

Action Space

\[a_t = [f_1,f_2,f_3,f_4].\]

Using individual rotor commands allows for maximum versatility and aggressive flight maneuvers.

Network architecture

Policy training

  • PPO algorithm
    • first-order policy gradient method
    • good benchmark performances and simplicity in implementation
  • Challenges
    • high dimensionality of search space
    • complexity of the required maneuvers
    • training on complex 3D race track with arbitrary track lengths and numbers of gates.
  • Three key ingredients to achieve stable and fast training performance
    • Parallel sampling scheme
      • 100 environments in parallel.
      • speed-up of the data collection process.
      • increase the diversity of the collected environment interactions.
      • initialization strategies can be designed such that
        • rollout trajectories cover the whole race track instead of being limited to certain parts of the track.
      • avoid overfitting to a certain track configuration.
    • Distributed initialization strategy (early stages of learning)
      • if each quadrotor is initialized at the starting position
        • data collection is restricted to a small part of the state space.
        • require a large number of update steps until the policy is capable of exploring the full track.
      • employ a distributed initialization strategy
        • ensure a uniform exploration of relevant areas of the state space.
        • random initialize the quadrotors in a hover state around the centers of all path segments.
      • Sampling initial states from previous trajectories since the policy has learned to pass gates reliably.
    • Random track curriculum
      • random sampling race tracks
      • confront the agent with a new task in each episode.
      • design a track generator
        • concatenate a list of randomly generated gate primitives.
        • definition: \(T = [G_1, \cdots, G_{j+1}], j\in[1,\infty)\).
          • \(G_{j+1} = f(G_j,\Delta p, \Delta R)\).
            • \(\Delta p\) is relative position.
            • \(\Delta R\) is the orientation between two consecutive gates.
        • Adjust the range of relative poses and total number of gates.
        • Generate race tracks of arbitrary complexity and length.
      • automatically adapt the complexity of the sampled tracks based on the agent’s performance
        • can be interpreted as an automatic curriculum that is tailored towards the capabilities of the agent.
        • Start the training by sampling tracks with only small deviations from a straight line.
        • reduce the percentage of rollouts that results in a gate crash.
        • use crash ratio computed across all parallel environments to adapt the complexity of the sampled tracks.
        • Data collection is constrained to a certain class of tracks until a specified performance threshold is reached.

Experiments

Purpose

  1. Lap time achieved by RL-based method compare to optimization-based algorithms.
  2. Handle uncertainties in the track layout.
  3. Race on a completely unknown track.
  4. Execute with a physical quadrotor.

Setup

  • Simulator: Flightmare
    • OpenAI gym-style racing environment.
    • allow for simulating hundreds of quadrotors and race tracks in parallel.
    • collect up to 25000 environment interactions per second during training.
  • Race tracks:
    • 2019 AlphaPilot challenge
      • long straight segments
      • hairpin curves
    • 2019 NeurIPS AirSim Game of Drones challenge
      • rapid and very large elevation changes
    • Split-S (for real-world experiments)
      • introduce two vertically stacked gates
  • Robustness test
    • use random displacements of position and the yaw angle of all gates on AlphaPilot track.
    • Train and evaluate the policy on random tracks, which generated by the track generator.
  • Ablation studies
  • Execute the generated trajectory on a physical platform.

Baseline Comparison

  • Tracks: three deterministic tracks
  • Baselines
    • polynomial minimum-snap trajectory generation
      • implicit smoothness property
      • but the smoothness property prohibits the exploitation of the full actuator potential at every point in time.
      • lead to sub-optimal for racing.
    • optimization-based minimum-time trajectory generation
      • provide an asymptotic lower bound on the possibly achievable time for a given race track and quadrotor model.
    • sampling-based methods (do not compare)
      • due to the simplifying assumptions on the quadrotor dynamics.
      • there exist no sampling-based methods that take into account the full state space of the quadrotor.

Handling Track Uncertainties

  • Robustness test to partially unknown track layouts.
    • specify random gate displacements on AlphaPilot track
    • use bounded displacements on the position of each gate’s center and orientation in the horizontal plane: \([\Delta x, \Delta y, \Delta z, \Delta \alpha]\).
  • Train a policy from scratch by sampling different gate layouts from a uniform distribution of the individual displacement bounds.
  • Create 100 parallel racing environments with different track configurations for each training iteration.
  • Final model is selected based on the fastest lap time on the nominal version of the track.
  • Define two metrics to evaluate the performance of the trained policy:
    • the average lap time
    • the crash ratio

Learning a universal racing policy

  • Generate minimum-time trajectories for completely unseen tracks.
  • Train three different policies on random track layouts
    • differ in the training strategy
      • \(\pm\) automatic track adaptation
    • and the reward
      • \(\pm\) safety reward
  • Evaluate the performance of 100 unseen randomly sampled tracks.
  • Choose the policy with the lowest crash ratio on the validation set as the final model.
  • Adding automatic track adaptation:
    • improve performance by reducing the rask complexity at the begining of the training process.
    • take into account the current learning progress when sampling the random tracks.
  • Adding the safety reward:
    • reduce the crash ratio.

Real-world Flight

  • Quadrotor
    • carbonfiber frame
    • brushless DC motors
    • 5-inch propellers
    • BetaFlight controller
    • thrust-to-weight ratio of around 4
  • Reference trajectory
    • Generated by performing a deterministic policy rollout for 4 continuous laps on the track.
  • The reference trajectory is tracked by a MPC.
  • Enable the quadrotor to fly at high speeds of up to \(17m/s\).

Computation Time Comparison

  • Compare the computation time of learning-based approach to optimization-based method.
  • The computation time of the optimization-based method increases drastically with the length of the track as the number of decision variables in the optimization problem scales accordingly.
  • The computation time of learning-based method instead does not depend on the length of the tracks.
  • Thus, this method can be applied for large-scale environments.

Ablation Studies

  • Number of observed gates (1, 2, or 3)
    • Using only information about a single future gate is insufficient to achieve fast lap times and minimize the crash ratio.
  • Safety reward (\(d_{max} = \{ 0.0, 2.5, 5.0 \}\))
    • The safety reward increases the average safety margin for all configurations.
    • The safety reward consistently reduces the number of crashes on the test set.

References

  • Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous Drone Racing with Deep Reinforcement Learning,” 2021.