Using SAC for Low-Level UAV Control

Motivation

The UAV platforms are naturally unstable systems for which many different control approaches have been proposed.
The classic and modern control algorithms require knowledge of the robot's dynamics.
- In fact, the dynamics model is non-linear and may be inaccurate due to the model’s inability to capture all aspects of the vehicle’s dynamic behavior.
Also, the traditional methods may be insufficient to cope with changing conditions, unforeseen situations, and complex-stochastic environments required for the new generation of UAVs.
As the robots employed become more complex or are naturally unstable, like humanoids or drones, the harder is. If quicker modeling is wanted, it would be more beneficial to learn control policies.
Model-free RL has been successfully used for controlling drones without any prior knowledge of the robot model.
High-level control:
- Navigation.
- Autonomous landing.
- Target tracking.
Low-level control:
- MPC-GPS
  - Computationally expensive.
- GymFC
  - Compared PID with PPO, TRPO and DDPG.
  - Focusing only on the propellers’ thrust and the agent’s angular velocities.
- Hwangbo
  - Use a PD controller to help the training phase.
  - Employ a model-free deterministic policy gradient approach, which requires an expensive exploration strateegy.
- PPO for position control
  - A stochastic policy for UAV control.
  - However, on-policy methods are still sampling inefficient than off-policy methods such as SAC.

Present a framework to train the SAC algorithm to low-level control of a quadrotor in a go-to-target task.
SAC can not only learn a robust policy, but it can cope with unseen scenarios.
- Better sample efficiency.
- Robustness and generalization capabilities.
Video.
- Training go-to-target task.
- Tracking a trajectory.
- Random start.
Code.

Actor-critic method.
Off-policy that allows reusing previously collected data.
Maximum entropy for stability and exploration.
- Add an entropy bonus to be maximized through the trajectory, encouraging exploration.
Excellent convergence property, needing fewer samples to reach good policies and finding policies with a higher reward.

\[J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t,a_t) \sim \tau} \left[ r(s_t,a_t) + \alpha \mathbb{H}(\pi(\cdot \vert s_t)) \right],\]

where \(\alpha\) controls the optimal policy stochasticity, and the entropy is measured by:

\[\mathbb{H}(P) = \mathbb{E}_{x\sim P}[-\log P(x)].\]

Drone dropped in the air in different positions and orientations.
Formulate the problem as a sequential decision-making problem under an RL framework.

State information given by the simulator.
- In real world, the agent observes the environment through its sensors.
Relative position: \((x,y,z)\).
Relative orientation: \((\phi, \theta, \psi)\).
Relative linear velocities: \((\dot{x}, \dot{y}, \dot{z})\).
Relative angular velocities: \((\dot{\phi}, \dot{\theta}, \dot{\psi})\).
Rotation matrix: \((R_{11}, R_{12}, R_{13}, R_{21}, R_{22}, R_{23}, R_{31}, R_{32}, R_{33})\).
- Although the rotation matrix has some redundant information about the agent’s state, it does not contain discontinuities, and it helps prevent perceptual aliasing by removing similar representations for distinct states.
The actions taken in the previous step \(a_{t-1}^n\) for all \(n\) motors.
- Added to represent the system dependency on the last action and to help it to infer higher-order models.

Map the action to the propeller thrust force by¹:

\[Tr(pwm) = 1.5618\times 10^{-4}\times pwm^2 + 1.0395\times 10^{-2} \times pwm + 0.13894.\]

Using positions, linear velocities, orientation, and angular velocities.
Death penalty: high penalty and episode termination if the agent gets too far from the target position (6.5m).
- It is important especially at the beginning of the training phase when the drone practically just fall in the ground.
Alive bonus: a bonus given for each time step the drone is still inside the radius of the interest.
Consider the success of getting close to the target and robustness and stability.
- Distance reward.
- Zeroing the angular velocity when the drone is at the target location.

\[r_t(s) = r_{alive} - 1.0 \Vert \epsilon_t(s) \Vert - 0.05 \Vert \dot{\phi} \Vert - 0.05 \Vert\dot{\theta}\Vert - 0.1 \Vert\dot{\psi}\Vert,\]

where

\(r_{alive} = 1.5\) is a constant, used to assure the drone earns a reward for flying inside a limited region.
- This term helps to improve sample efficiency and the training speed.
\(\epsilon_t\) is the position error, euclidean distance between the target position and the position at timestep \(t\).
Apply a higher penalty to the \(\dot{\psi}\) since it was the angular velocity component mostly responsible for the vibration (震动) in the drone.

Different start poses.
\([x,y] \in (-1.5,-1.0,-0.5,0.0,1.0,1.5)\).
\(z \in (1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2)\).
\([\phi,\theta,\psi] \in (-44.69,-36.1,-26.93,-9.17,0.0,9.17,26.93,36.1,44.69)\).

Go-to-target task.
Tracking a trajectory with different velocities.
- a straight line.
- a square.
- a sinusoidal.
Robustness test:
- Harsh untrained initial conditions.

G. M. Barros and E. L. Colombini, “Using Soft Actor-Critic for Low-Level UAV Control,” Oct. 2020.

Andres Hernandez, Cosmin Copot, Robin De Keyser, Tudor Vlas, and Ioan Nascu. Identification and path following control of an ar. drone quadrotor. In System Theory, Control and Computing (ICSTCC), 2013 17th International Conference, pages 583–588. IEEE, 2013. ↩