Motivation

The advantage of reinforcement learning over optimization approaches and guided policy search methods is that it does not need a predefined controller structure which limits the performance of the agent and costs more human effort.

Although reinforcement learning has been used in robotics for many decades, it has largely been confinde to higher-level decisions rather than low actuator commands.

Contributions

  • Map state to actuator command (rotor thrusts) directly
  • Present a new learning algorithm which is built on deterministic policy optimization using a natural gradient descent
    • value estimate from on-policy samples have lower variance
    • write the policy gradient in a simpler form
    • stochastic policy can lead to poor and unpredictable performance
  • Demonstrate the trained policy both in simulation and with a real quadrotor
  • Computation time of the policy is only \(7 \mu s\) per time step

Method

Network Structure

  • Value network
    • Input: State, 18-dimensional vector (rotation matrix, positon, linear velocity, angular velocity)
      • 9D rotation matrix: represent the rotation, quaternion has a pitfall that two values represent the same rotation (\(q = -q\))
      • 3D position-target position
      • 3D linear velocity
      • 3D angular velocity
    • Output: Value function
  • Policy network
    • Input: State, 18-dimensional vector (rotation matrix, positon, linear velocity, angular velocity)
    • Output: Action, 4-dimensional vector (rotor thrust)

UAV Model for Simulation

\[J^T T = M a + h,\]

where,

  • \(J\) is the stacked Jacobian matrices of the centers of the rotors.
  • \(T\) is the thrust forces.
  • \(M\) is the inertia matrix.
  • \(a\) is the generalized acceleration.
  • \(h\) is the coriolis and gravity effect.

The model ignores all drag forces acting on the body and use a simple floating body model with four thrust forces acting on the body.

Simulation Process

The output action motor thrusts have two parts, the sum of the two controllers used as a command:

  • A simple Proportional and Derivative (PD) controller for attitude with low gains,
  • The learning policy.

Why use a PD controller?

  • The simulation sometimes become unstable when the angular velocity becomes very high.
  • This is due to the fact that the algorithm initially explores a large region where we cannot simulate accurately.
  • The gradient observed when the quadrotor is upside-down is very low and discontinuous. Then, the gradient-based algorithms take very long time to learn.
  • The PD controller alone is insufficient and generates very high costs.
  • The quadrotor simply flies away since it is initialized with a high initial velocity.

The PD controller is used to stabilize the learning process but it does not aid the final controller since the final controller manifests much more sophisticated behaviors.

PD controller:

\[\tau_b = k_p R^T q_{euler} + k_d R^T w,\]

where,

  • \(\tau_b\) is the virtual torque produced on the main body as a result of the thrust forces.
  • \(k_p,k_d\) are the controller gains.
  • \(R\) is the rotation matrix.
  • \(q_{euler}\) is the orientation of the UAV in Euler vector.
  • \(w\) is the angular velocity.

Algorithm

Value Function Training

\[v_i = \sum_{t=i}^{T-1}\gamma^{t-i}r_t^p + \gamma^{T-i}V(s_T \vert \eta),\]

where \(\eta\) is the parameters of approximated value function, \(T\) is the length of the trajectory.

Policy Optimization

Reward Function

\[r_t = 4 \times 10^{-3} \Vert p_t \Vert + 2 \times 10^{-4} \Vert a_t \Vert + 3 \times 10^{-4} \Vert \omega_t \Vert + 5 \times 10^{-4} \Vert v_t \Vert,\]

where \(p_t\), \(a_t\), \(\omega_t\) and \(v_t\) are the position, action (motor thrust), angular velocities and linear velocities, respectively.

Code

Some of the code:

Network forward function

Learned policy \(+\) PD controller \(\to\) Thrust \(\to\) Rotor speed.

The output of PD controller is \((\tau_\phi, \tau_\theta, \tau_\psi)\), then, the thrust of rotors can be obtained:

\[\left[ \begin{matrix} \tau_\phi \\ \tau_\theta \\ \tau_\psi \\ mg \end{matrix}\right] = \left[ \begin{matrix} 0 & 0 & d & -d \\ -d & d & 0 & 0 \\ \frac{C_M}{C_T} & \frac{C_M}{C_T} & - \frac{C_M}{C_T} & - \frac{C_M}{C_T} \\ 1 & 1 & 1 & 1 \end{matrix}\right] \left[ \begin{matrix} T_1 \\ T_2 \\ T_3 \\ T_4 \end{matrix}\right].\]
void forward(Eigen::Vector4d &quat,
              Eigen::Vector3d &position,
              Eigen::Vector3d &angVel,
              Eigen::Vector3d &linVel,
              Eigen::Vector3d &targetPosition,
              Eigen::Vector4d &action) {

  Eigen::Matrix3d R;
  quatToRotMat(quat, R);
  Eigen::Matrix<double, 18, 1> state;
  state << R.col(0), R.col(1), R.col(2), (position-targetPosition+positionOffset) * positionScale_, angVel * angVelScale_, linVel * linVelScale_;
//    std::cout<<"state" <<state.transpose()<<std::endl;
//    std::cout<<"quat" <<quat.transpose()<<std::endl;

//    action = w3 * (w2 * (w1*state +b1).array().tanh().matrix() + b2).array().tanh().matrix() + b3;
  Eigen::Matrix<double, 64, 1> l1o = w1 * state + b1;
  for (int i = 0; i < 64; i++)
    l1o(i) = std::tanh(l1o(i));
  Eigen::Matrix<double, 64, 1> l2o = w2 * l1o + b2;
  for (int i = 0; i < 64; i++)
    l2o(i) = std::tanh(l2o(i));
  action = w3 * l2o + b3;

  double angle = 2.0 * std::acos(quat(0));
  double kp_rot = -0.2, kd_rot = -0.06;

  Eigen::Vector3d B_torque;

  if(angle > 1e-6)
    B_torque = kp_rot * angle * (R.transpose() * quat.tail(3))
      / std::sin(angle) + kd_rot * (R.transpose() * angVel);
  else
    B_torque = kd_rot * (R.transpose() * angVel);
  B_torque(2) = B_torque(2) * 0.15;
  Eigen::Vector4d genForce;
  genForce << B_torque, mass_ * 9.81;
//    std::cout<<"genForce "<<genForce.transpose()<<std::endl;
  Eigen::Vector4d thrust;
  thrust = transsThrust2GenForceInv * genForce + 2.0 * action;
  thrust = thrust.cwiseMax(1e-6);
  Eigen::Vector4d rotorSpeed;
//    rotorSpeed(0) = std::sqrt(thrust(0) / 0.00015677);
//    rotorSpeed(1) = std::sqrt(thrust(1) / 0.000123321);
//    rotorSpeed(2) = std::sqrt(thrust(2) / 0.00019696);
//    rotorSpeed(3) = std::sqrt(thrust(3) / 0.00015677);
  rotorSpeed = (thrust.array() / 8.5486e-6).sqrt();
  action = rotorSpeed;
}

Step and Reward function

void step(const Action &action_t,
          State &state_tp1,
          TerminationType &termType,
          Dtype &costOUT) {


  //Get current state from q_, u_
  orientation = q_.head(4); //Orientation of quadrotor
  double angle = 2.0 * std::acos(q_(0)); //Angle of rotation

  position = q_.tail(3); //Position of quadrotor
  R_ = Math::MathFunc::quatToRotMat(orientation); //Calculate rotation matrix

  v_I_ = u_.tail(3); //linear velocity
  w_I_ = u_.head(3); //angular velocity
  w_B_ = R_.transpose() * w_I_; //body rates // TODO: u_.head(3) = w_I_

  //Control input from action

  Action actionGenForce;
  actionGenForce = transsThrust2GenForce * actionScale_ * action_t; //Force generated by action
  B_torque = actionGenForce.segment(0, 3); //Moment input
  B_force << 0.0, 0.0, actionGenForce(3); // Thrust vector in {B}

  //Control input from PD stabilization
  double kp_rot = -0.2, kd_rot = -0.06;
  Torque fbTorque_b;

  if (angle > 1e-6)
    fbTorque_b = kp_rot * angle * (R_.transpose() * orientation.tail(3))
        / std::sin(angle) + kd_rot * (R_.transpose() * u_.head(3));
  else
    fbTorque_b = kd_rot * (R_.transpose() * u_.head(3));
  fbTorque_b(2) = fbTorque_b(2) * 0.15; //Lower yaw gains

  B_torque += fbTorque_b; //Sum of torque inputs
  B_force(2) += mass_ * 9.81; //Sum of thrust inputs

  // clip inputs
  Action genForce;
  genForce << B_torque, B_force(2);
  Action thrust = transsThrust2GenForceInv * genForce;    // TODO: This is not exactly thrust: change name
  thrust = thrust.array().cwiseMax(1e-8); // clip for min value
  genForce = transsThrust2GenForce * thrust;
  B_torque = genForce.segment(0, 3);
  B_force(2) = genForce(3);

  du_.tail(3) = (R_ * B_force) / mass_ + gravity_; // acceleration
  du_.head(3) = R_ * (inertiaInv_ * (B_torque - w_B_.cross(inertia_ * w_B_))); //acceleration by inputs

  //Integrate states
  u_ += du_ * this->controlUpdate_dt_; //velocity after timestep
  w_I_ = u_.head(3);
  w_IXdt_ = w_I_ * this->controlUpdate_dt_;
  orientation = Math::MathFunc::boxplusI_Frame(orientation, w_IXdt_);
  Math::MathFunc::normalizeQuat(orientation);
  q_.head(4) = orientation;
  q_.tail(3) = q_.tail(3) + u_.tail(3) * this->controlUpdate_dt_;
  w_B_ = R_.transpose() * u_.head(3);

  // visualizer_.drawWorld(visualizeFrame, position, orientation);

  if (std::isnan(orientation.norm())) {
    std::cout << "u_ " << u_.transpose() << std::endl;
    std::cout << "du_ " << du_.transpose() << std::endl;
    std::cout << "B_torque " << B_torque.transpose() << std::endl;
    std::cout << "orientation " << orientation.transpose() << std::endl;
    std::cout << "state_tp1 " << q_.transpose() << std::endl;
    std::cout << "action_t " << action_t.transpose() << std::endl;
  }

  u_(0) = clip(u_(0), -20.0, 20.0);
  u_(1) = clip(u_(1), -20.0, 20.0);
  u_(2) = clip(u_(2), -20.0, 20.0);
  u_(3) = clip(u_(3), -5.0, 5.0);
  u_(4) = clip(u_(4), -5.0, 5.0);
  u_(5) = clip(u_(5), -5.0, 5.0);

  getState(state_tp1);

//    costOUT = 0.004 * q_.tail(3).norm() +
//        0.0002 * action_t.norm() +
//        0.0003 * u_.head(3).norm() +
//        0.0005 * u_.tail(3).norm();

  costOUT = 0.004 * std::sqrt(q_.tail(3).norm()) +
      0.00005 * action_t.norm() +
      0.00005 * u_.head(3).norm() +
      0.00005 * u_.tail(3).norm();


//    std::cout << "distance cost " << 0.004 * q_.tail(3).squaredNorm() << std::endl;
//    std::cout << "actuation cost " << 0.0002 * action_t.squaredNorm() << std::endl;
//    std::cout << "angular velocity cost " << 0.0003 * u_.head(3).squaredNorm() << std::endl;
//    std::cout << "orientation cost " << 0.0005 * acos(q_(0)) * acos(q_(0)) << std::endl;
//    if (this->isViolatingBoxConstraint(state_tp1))
//      termType = TerminationType::timeout;

  // visualization
  if (this->visualization_ON_) {

    updateVisualizationFrames();
    visualizer_.drawWorld(visualizeFrame, position, orientation);
    double waitTime = std::max(0.0, this->controlUpdate_dt_ / realTimeRatio - watch.measure("sim", true));
    watch.start("sim");
    usleep(waitTime * 1e6);

  }
}

void stepSim(const Action &action_t) {
  Action thrust = action_t.array().square() * 8.5486e-6;
  thrust = thrust.array().cwiseMax(1e-8);
  Eigen::Matrix<Dtype, 4, 1> genforce;
  genforce = transsThrust2GenForce * thrust;
  B_torque = genforce.segment(0,3);
  B_force << 0.0, 0.0, genforce(3);

  orientation = q_.head(4);
  double angle = 2.0 * std::acos(q_(0));
  position = q_.tail(3);
  R_ = Math::MathFunc::quatToRotMat(orientation);

//    std::cout<<"genForce "<<genForce.transpose()<<std::endl;
//    std::cout<<"angle "<<angle<<std::endl;
//    std::cout<<"orientation.tail(3) / std::sin(angle) "<< orientation.tail(3) / std::sin(angle) <<std::endl;
//    std::cout<<"proportioanl part "<< kp_rot * angle * ( R_.transpose() * orientation.tail(3) ) / std::sin(angle) <<std::endl;
//    std::cout<<"diff part "<< kd_rot * ( R_.transpose() * u_.head(3) ) <<std::endl;

  du_.tail(3) = (R_ * B_force) / mass_ + gravity_;
  w_B_ = R_.transpose() * u_.head(3);

  /// compute in body coordinate and transform it to world coordinate
//    B_torque += comLocation_.cross(mass_ * R_.transpose() * gravity_);
  du_.head(3) = R_ * (inertiaInv_ * (B_torque - w_B_.cross(inertia_ * w_B_)));

  u_ += du_ * this->controlUpdate_dt_;

  w_I_ = u_.head(3);
  w_IXdt_ = w_I_ * this->controlUpdate_dt_;
  orientation = Math::MathFunc::boxplusI_Frame(orientation, w_IXdt_);
  Math::MathFunc::normalizeQuat(orientation);
  q_.head(4) = orientation;
  q_.tail(3) = q_.tail(3) + u_.tail(3) * this->controlUpdate_dt_;
  w_B_ = R_.transpose() * u_.head(3);

  if ( std::isnan(orientation.norm()) ) {
    std::cout << "u_ " << u_.transpose() << std::endl;
    std::cout << "du_ " << du_.transpose() << std::endl;
    std::cout << "B_torque " << B_torque.transpose() << std::endl;
    std::cout << "orientation " << orientation.transpose() << std::endl;
    std::cout << "state_tp1 " << q_.transpose() << std::endl;
    std::cout << "action_t " << action_t.transpose() << std::endl;
  }
  // visualization

  if (this->visualization_ON_) {
    updateVisualizationFrames();
    visualizer_.drawWorld(visualizeFrame, position, orientation);
    double waitTime = std::max(0.0, this->controlUpdate_dt_ / realTimeRatio - watch.measure("sim", true));
    watch.start("sim");
    usleep(waitTime * 1e6);
  }
}

References