Relation to Model-free Policy Optimization Algorithms

Model-free policy optimization algorithms were described here.

  • Direct Policy Search: learning complex, nonlinear policies with standard policy gradient methods
    • Require a huge number of samples and iterations
    • can be disastrously prone to poor local optimal

Guided Policy Search (GPS)

  • Use Differential Dynamic Programming (DDP) to generate “guiding samples”: assist the policy search by exploring high-reward regions.
  • An importance sampled variant of the likehood ratio estimator is used to incorporate these guiding samples directly into the policy search.
GPS Algorithm Pseudocode
GPS Algorithm Pseudocode

Differential Dynamic Programming (DDP)

Importance Sampling

References