Analyzing massive datasets with missing entries: models and algorithms
MetadataShow full item record
We initiate a systematic study of computational models to analyze algorithms for massive datasets with missing or erased entries and study the relationship of our models with existing algorithmic models for large datasets. We focus on algorithms whose inputs are naturally represented as functions, codewords, or graphs. First, we generalize the property testing model, one of the most widely studied models of sublinear-time algorithms, to account for the presence of adversarially erased function values. We design efficient erasure-resilient property testing algorithms for several fundamental properties of real-valued functions such as monotonicity, Lipschitz property, convexity, and linearity. We then investigate the problems of local decoding and local list decoding of codewords containing erasures. We show that, in some cases, these problems are strictly easier than the corresponding problems of decoding codewords containing errors. Moreover, we use this understanding to show a separation between our erasure-resilient property testing model and the (error) tolerant property testing model. The philosophical message of this separation is that errors occurring in large datasets are, in general, harder to deal with, than erasures. Finally, we develop models and notions to reason about algorithms that are intended to run on large graphs with missing edges. While running algorithms on large graphs containing several missing edges, it is desirable to output solutions that are close to the solutions output when there are no missing edges. With this motivation, we define average sensitivity, a robustness metric for graph algorithms. We discuss various useful features of our definition and design approximation algorithms with good average sensitivity bounds for several optimization problems on graphs. We also define a model of erasure-resilient sublinear-time graph algorithms and design an efficient algorithm for testing connectivity of graphs.
RightsAttribution 4.0 International