Contributions

Proposed an end-to-end policy network which takes the flight scenario as input and directly outputs thrust-attitude control commands for the quadrotor.

  • Present an imitate-reinforce training framework
  • Flying through a narrow gap using an end-to-end policy network
  • Propose a robust imitation of an optimal trajectory generator using multilayer perceptions
  • Show how reinforcement learning can improve the performance of imitation learning

Code and Results

Code: https://github.com/hku-mars/crossgap_il_rl

Results: https://www.youtube.com/watch?v=jU1qRcLdjx0

End-to-end method VS Traditional Pipeline

Traditional pipeline

Perception-Planning-Control:

  1. Estimate the robot state
  2. Build a map of its surrounding environment by SLAM
  3. A smooth, optimal trajectory is usually planned based on the map
  4. Track the generated trajectory by a low-level geometric controller

Usually applied in low-speed, static environments.

For aggressive robot maneuvers in cluttered, dynamic environments, such as drones racing in bush or indoor scenario, this pipeline becomes quite challenging because SLAM and trajectory optimization is memory and computationally expensive and degrade in performance for aggressive, dynamic maneuvers in non-static environments.

End-to-end method

Train a control policy that directly maps sensory inputs to control outputs.

Shorter pipeline: less computation time

Challenges:

  • Training data collected in trail tests
  • The trained policy network has no mathematical proof on its stability nor robustness

Method

Goal: Drone flying through a narrow gap

  • maximum speed up to \(3m/s\)
  • orientation angle up to \(60^{\circ}\)

Planning Network

Inputs: \(17 \times 1\) vector

Outputs: \(9 \times 1\) vector, including \(\Delta p_p, v_p, a_p\)

Loss Function:

\[Loss = w_p \Vert \Delta p_l - \Delta p_p \Vert^2 + w_v \Vert v_l - v_p \Vert^2 + w_a \Vert a_l - a_p \Vert^2 + g,\]

where \(\Delta p_l, v_l, a_l\) are the relative position, velocity, acceleration of labeling data generated from a conventional motion planner.

Controller Network

Inputs: \(12 \times 1\) vector, \(e_p, e_v, e_a\), Euler attitude

Outputs: \(3 \times 1\) vector, \(\phi_p \in [-30,30], \theta_p \in [-30,30], \mu_p \in [0,1]\)

Loss Function:

\[Loss = w_{thr} \vert \mu_l - \mu_p \vert + w_{eul} e_{l,p} +g.\]

End-to-end Planning and Control

Inputs:

  • \(17 \times 1\) vector for planning network
  • \(12 \times 1\) vector of current state

The input of the controller network is:

  • \(e_p = \Delta p_p - \Delta p_c\).
  • \(e_v = v_p - v_c\).
  • \(e_a = a_p - a_c\).

Outputs: \(3 \times 1\) vector, \(\phi_p, \theta_p, \mu_p\)

RL to improve the performance of Imitation Learning

Reward Function:

(1) Negative reward item:

\[R_{neg}(t) = - \left( w_w \Vert w(t) \Vert + w_\alpha \bigg\Vert \frac{dw(t)}{dt} \bigg\Vert + w_j \bigg\Vert \frac{da(t)}{dt} \bigg\Vert \right) \Delta t + C,\]

where \(w(t)\) and \(a(t)\) are the angular velocity and linear acceleration, \(C\) is the collision penalty, if the drone collides with anything (i.e. wall, ground and etc), \(C\) will be set to \(10^9\).

(2) Positive reward item: if the drone reaches the center of gap, a positive reward will be given

\[R_{pos}(t) = (w_r \max (0, d_a - \Vert p_c - p(t) \Vert))\Delta t + S,\]

where \(d_a\) is the activate distance of positive reward, \(S\) is a one-time reward which occurs at the first time the UAV obtains a positive reward.