Using SAC for Low-Level UAV Control
Motivation
- The UAV platforms are
naturally unstable systems
for which many different control approaches have been proposed. - The classic and modern control algorithms
require knowledge of the robot's dynamics
.- In fact, the dynamics model is non-linear and may be inaccurate due to the model’s inability to capture all aspects of the vehicle’s dynamic behavior.
- Also, the
traditional methods
may be insufficient to cope with changing conditions,unforeseen situations
, and complex-stochastic environments required for the new generation of UAVs. - As the robots employed become more complex or are naturally unstable, like humanoids or drones, the harder is. If quicker modeling is wanted, it would be more beneficial to
learn control policies
. Model-free RL
has been successfully used for controlling drones without any prior knowledge of the robot model.- High-level control:
- Navigation.
- Autonomous landing.
- Target tracking.
- Low-level control:
- MPC-GPS
- Computationally expensive.
- GymFC
- Compared PID with PPO, TRPO and DDPG.
- Focusing only on the propellers’ thrust and the agent’s angular velocities.
- Hwangbo
- Use a
PD controller
to help the training phase. - Employ a model-free deterministic policy gradient approach, which requires an expensive
exploration strateegy
.
- Use a
- PPO for position control
- A stochastic policy for UAV control.
- However,
on-policy
methods are stillsampling inefficient
than off-policy methods such as SAC.
- MPC-GPS
Contributions
- Present a framework to train the SAC algorithm to low-level control of a quadrotor in a go-to-target task.
- SAC can not only learn a robust policy, but it can cope with unseen scenarios.
- Better sample efficiency.
- Robustness and generalization capabilities.
- Video.
- Training go-to-target task.
- Tracking a trajectory.
- Random start.
- Code.
Method
Soft Actor-Critic (SAC)
- Actor-critic method.
Off-policy
that allows reusing previously collected data.Maximum entropy
for stability and exploration.- Add an entropy bonus to be maximized through the trajectory,
encouraging exploration
.
- Add an entropy bonus to be maximized through the trajectory,
Excellent convergence property
, needing fewer samples to reach good policies and finding policies with a higher reward.
where \(\alpha\) controls the optimal policy stochasticity, and the entropy is measured by:
\[\mathbb{H}(P) = \mathbb{E}_{x\sim P}[-\log P(x)].\]Problem Formulation
- Drone dropped in the air in different positions and orientations.
- Formulate the problem as a
sequential decision-making
problem under an RL framework.
State
- State information given by the simulator.
- In real world, the agent observes the environment through its sensors.
- Relative position: \((x,y,z)\).
- Relative orientation: \((\phi, \theta, \psi)\).
- Relative linear velocities: \((\dot{x}, \dot{y}, \dot{z})\).
- Relative angular velocities: \((\dot{\phi}, \dot{\theta}, \dot{\psi})\).
- Rotation matrix: \((R_{11}, R_{12}, R_{13}, R_{21}, R_{22}, R_{23}, R_{31}, R_{32}, R_{33})\).
- Although the rotation matrix has some
redundant information
about the agent’s state, itdoes not contain discontinuities
, and it helpsprevent perceptual aliasing
by removing similar representations for distinct states.
- Although the rotation matrix has some
- The actions taken in the previous step \(a_{t-1}^n\) for all \(n\) motors.
- Added to represent the system
dependency on the last action
and to help it to infer higher-order models.
- Added to represent the system
Action
- \([a_1,a_2,a_3,a_4]\): PWM values of each motor.
- Range:
[-100, 100]
.
Map the action to the propeller thrust force by1:
\[Tr(pwm) = 1.5618\times 10^{-4}\times pwm^2 + 1.0395\times 10^{-2} \times pwm + 0.13894.\]Reward
- Using positions, linear velocities, orientation, and angular velocities.
Death penalty
: high penalty and episode termination if the agent gets too far from the target position (6.5m).- It is important especially at the beginning of the training phase when the drone practically just fall in the ground.
Alive bonus
: a bonus given for each time step the drone is still inside the radius of the interest.- Consider the success of getting close to the target and
robustness and stability
.- Distance reward.
- Zeroing the angular velocity when the drone is at the target location.
where
- \(r_{alive} = 1.5\) is a constant, used to assure the drone earns a reward for flying inside a limited region.
- This term helps to improve sample efficiency and the training speed.
- \(\epsilon_t\) is the position error, euclidean distance between the target position and the position at timestep \(t\).
- Apply a higher penalty to the \(\dot{\psi}\) since it was the angular velocity component mostly
responsible for the vibration
(震动) in the drone.
Simulation
- Coppelia simulator.
PyRep
: a C++ plugin with a Python wrapper, simulator API.- Speed up
20X
.
- Speed up
Environment
: corresponds to the MDP and general parameters modeled for each experiment.- Parrot AR Drone 2.0: return the agent’s sensor readings.
- Dimensions.
- Mass.
- Moments of inertia.
- Velocity-thrust function.
- Turn off the motors internal PID controller.
- Batch size: 4000.
- Buffer size: 1000000.
- discount factor \(\gamma\): 0.99.
- Learning rate: 0.0001.
- Actor network: (64,64), with
tanh
as activation function. - Value network: (256,256), with
RELU
.
Start State
- Different start poses.
- \([x,y] \in (-1.5,-1.0,-0.5,0.0,1.0,1.5)\).
- \(z \in (1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2)\).
- \([\phi,\theta,\psi] \in (-44.69,-36.1,-26.93,-9.17,0.0,9.17,26.93,36.1,44.69)\).
Target State
- Position: \([x,y,z] = [0,0,1.7]\).
- Orientation: \([\phi, \theta, \psi] = [0,0,0]\).
Evaluation
- Go-to-target task.
- Tracking a trajectory with different velocities.
- a straight line.
- a square.
- a sinusoidal.
- Robustness test:
- Harsh untrained initial conditions.
References
- G. M. Barros and E. L. Colombini, “Using Soft Actor-Critic for Low-Level UAV Control,” Oct. 2020.
-
Andres Hernandez, Cosmin Copot, Robin De Keyser, Tudor Vlas, and Ioan Nascu. Identification and path following control of an ar. drone quadrotor. In System Theory, Control and Computing (ICSTCC), 2013 17th International Conference, pages 583–588. IEEE, 2013. ↩