Statistical methods for certain large, complex data challenges
MetadataShow full item record
Big data concerns large-volume, complex, growing data sets, and it provides us opportunities as well as challenges. This thesis focuses on statistical methods for several specific large, complex data challenges - each involving representation of data with complex format, utilization of complicated information, and/or intensive computational cost. The first problem we work on is hypothesis testing for multilayer network data, motivated by an example in computational biology. We show how to represent the complex structure of a multilayer network as a single data point within the space of supra-Laplacians and then develop a central limit theorem and hypothesis testing theories for multilayer networks in that space. We develop both global and local testing strategies for mean comparison and investigate sample size requirements. The methods were applied to the motivating computational biology example and compared with the classic Gene Set Enrichment Analysis(GSEA). More biological insights are found in this comparison. The second problem is the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Ideally, we want to locate the sources based on all history data. However, this is often infeasible, because the history data is complex, high-dimensional and cannot be fully observed. Epidemiologists have recognized the crucial role of human mobility as an important proxy to a complete history, but little in the literature to date uses this information for source detection. We recast the source detection problem as identifying a relevant mixture component in a multivariate Gaussian mixture model. Human mobility within a stochastic PDE model is used to calibrate the parameters. The capability of our method is demonstrated in the context of the 2000-2002 cholera outbreak in the KwaZulu-Natal province. The third problem is about multivariate time series imputation, which is a classic problem in statistics. To address the common problem of low signal-to-noise ratio in high-dimensional multivariate time series, we propose models based on state-space models which provide more precise inference of missing values by clustering multivariate time series components in a nonparametric way. The models are suitable for large-scale time series due to their efficient parameter estimation.