Analysis of centralized and distributed temporal difference learning
OA Version
Citation
Abstract
Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. This thesis provides an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times.
This thesis also provides a new non-asymptotic analysis of distributed TD(0) with linear function approximation. The approach relies on "one-shot averaging," where N agents run local copies of TD(0) and average the outcomes only once at the very end. This thesis considers two models: one in which the agents interact with an environment they can observe and whose transitions depend on all of their actions, which are referred to as the global state model; and one in which each agent can run a local copy of an identical Markov Decision Process, which is called the local state model. In the global state model, this thesis shows that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0), which is an improvement on the state-of-the-art. In the local state model, this thesis demonstrates a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of N faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.