[PhD Thesis Presentation] ‐ Mr. Paavo Parmas – “Total stochastic gradient algorithms and applications to model-based reinforcement learning”
Presenter: Mr. Paavo Parmas
Supervisor: Prof. Kenji Doya
Unit: Neural Computation
Title: Total stochastic gradient algorithms and applications to model-based reinforcement learning
Optimizing via stochastic gradients is a powerful and flexible technique ubiquitously used in machine learning, reinforcement learning, control, operations research, etc. In many of these applications, the gradients are estimated through a stochastic sampling process, and the learning performance hinges on the accuracy of the estimated gradients. This thesis develops a collection of several novel statistical algorithms to acquire improved gradient estimation accuracy. The need to develop such algorithms was motivated from a model-based reinforcement learning (MBRL) scenario, where I observed that chaotic properties of the dynamics caused the gradient variance to explode when using standard gradient estimation techniques, such as reparameterization gradients. The new techniques sometimes improve the accuracy by 10^6 times and more. The methods rely on both new gradient estimators, as well as clever algorithms to take advantage of the graph structure of the computations to combine estimators in a statistically principled way. While the work started by trying to solve a specific problem related to MBRL, the proposed solutions are general and applicable to any other stochastic computation graph. The problems with chaos have recently been also observed in other tasks, such as meta-learning or protein folding software, and my solutions may prove useful in those domains as well. The main contributions are an 1) MBRL framework called PIPPS, which is similar to the PILCO algorithm, but lifts all of its restrictions by swapping the cumbersome moment-matching computations with a particle sampling approach while achieving the same learning performance with no down-sides, 2) the total propagation algorithm, which is a replacement for backpropagation that prevents the exploding gradient problem by combining gradient estimators in the backwards pass, 3) the probabilistic computation graph framework, which is an intuitive visual method to reason about total gradients on graphs, 4) new policy gradient estimators derived by using the probabilistic computation graph framework, 5) some theoretical discussion about control variates for gradients as well as a unified theory of reparameterization and likelihood ratio gradient estimators. The research has so far lead to two publications (ICML’18 and NeurIPS’18), but also includes yet unpublished work. I hope that this work could lead towards new software frameworks that go beyond backpropagation, and implement more advanced methods for estimating gradients.