Some statistical contributions to deep learning
OA Version
Citation
Abstract
This dissertation makes two statistical contributions to the topic of deep learning. In the first part, we study sparse Bayesian deep recurrent neural network (RNN) with applications to language modeling. RNN for natural language processing tasks typically lead to deep learning models with very large matrix parameters, which are computationally challenging to fit and use. In this work, we develop a Bayesian approach for RNN models that leads to sparse parameter estimates. The findings from our experiments on language modeling show that we can obtain highly-sparsified model with only moderate loss on model performance.
The other aspect of this dissertation revolves around energy-based models (EBMs), which are parametric statistical models that have been widely studied in applications. EBMs define an unnormalized probability density function through an energy function. An important feature of EBMs is that the normalizing constants are intractable, which is the main challenge in fitting these models and generate new samples. In this work, we make some contributions on addressing the computational challenges in EBMs. Inspired by (Nijkamp et al., 2019) that introduced a class of approximations based on short-run MCMC, we take another look at the generative approximations. We show that when applied to EBMs built from deep neural networks as energy functions, the short-run MCMC framework corresponds to a kernel maximum mean discrepancy (MMD) estimator of the approximating model, using the neural tangent kernel of the deep neural network. We demonstrate that the idea can be applied broadly, including for fast estimation for high-dimensional Gaussian graphical models under a ℓ¹-norm penalty, and for filtering problem.
In the last part of this dissertation, we take another look at EBMs by exploring two different methods that fit these models approximately. The first one uses a minimax scheme while the second approach is based on a path sampling representation. Our simulation studies show that the variational path sampling method works fine in the low-dimensional data and extra work needs to be done to extend it to higher dimension while the minimax scheme does not perform well. We provide related discussions and potential future directions in the end of this dissertation.