Methods for size estimation of hidden population using large-scale health data

Embargo Date
2027-02-04
OA Version
Citation
Abstract
Accurately estimating hidden population sizes is essential for effective policy-making, but a traditional census is typically not feasible. Data-driven approaches that use existing sources are needed for reliable prevalence estimates. Capture-recapture methods, used in ecology to estimate population size, have been advanced in epidemiology over the past two decades to improve disease prevalence estimates and can relax unrealistic assumptions of naive models. Yet given the present development within epidemiology, difficulties still exist in using conventional capture-recapture methods to estimate prevalence in subpopulations across spatial units. Additionally, there is a need to estimate prevalence in socioeconomically stratified groups to understand the groups with higher risks and discover potential health disparities. Unfortunately, conventional approaches heavily rely on stand-alone stratified analysis, which may be less effective when certain subgroups display similar or more intricate patterns of correlated healthcare engagement. To address these challenges, we first articulated the fundamental concepts behind the capture-recapture method by comparing it to a similar approach, the multiplier benchmark method. Then we focused on the capture-recapture structure and proposed a Bayesian hierarchical spatial capture-recapture model that estimates individual detection probabilities and spatial variation of OUD prevalence. The proposed model enables population structure estimation from coarse summaries to finer-scale components using a spatially explicit areal adjacency-based smoothing process model. Finally, an extension of the proposed model is presented to incorporate the correlation structure between socioeconomically stratified subpopulations. We applied the extended model to the Massachusetts Public Health Data Warehouse to evaluate the efficiency of the proposed method compared to traditional methods. We used simulation studies for each work to investigate the performance of the proposed estimators in varying circumstances and to determine which method may be more effective in different scenarios. Our comprehensive evaluation found that the proposed methods could accurately estimate area-specific and group-specific prevalence with lower bias and variance. These methods effectively address the issue of data sparsity in subpopulations and account for the underlying structure that more accurately reflects what occurs during ecological and data collection processes. The methods developed in this dissertation provide a powerful tool for accurately estimating the disease burden in hidden subpopulations, making them essential for targeted interventions and effective public health policies.
Description
2023
License
Attribution-NonCommercial-NoDerivatives 4.0 International