Memorization and privacy in learning: fundamental limits and new algorithms

OA Version
Citation
Abstract
Data analysts often work with records pertaining to specific individuals; the results of their analysis, by design, depends on the details of the individuals' data. This dependence may occur in surprising ways and have negative consequences. In this dissertation, we investigate the dependence of learning algorithms upon individual inputs through the lenses of memorization and privacy. We reveal fundamental limits on when memorization is inescapable, experimentally demonstrate the vulnerability of conventional learning methods, and design new algorithms with rigorous privacy guarantees. We present learning tasks, stylized variants of image classification and sequence prediction, and prove that any high-accuracy learner for these tasks must encode a huge amount of information about the training data. We show that much of this stored information is irrelevant to the learning task, taking the form of noise or useless features. For some tasks, we prove a further result: high-accuracy learners need to memorize almost the entirety of a large number of input examples. We identify similar phenomena in a streaming setting, where examples arrive one at a time. For natural sparse hypothesis classes, such as sparse linear classification over a polynomial kernel, we prove that nontrivial algorithms with few samples need large memories. This lower bound smoothly decreases as the sample size increases: with a longer stream, less memorization is required. Moving beyond theorems, we investigate whether this stored information can be extracted by an attacker. We show that standard classifiers may be vulnerable to data reconstruction attacks. In our experiments, the vulnerability is extreme, with an attacker able to recover high-fidelity copies of a large fraction of the training data. Finally, we approach data privacy from the other direction: how can we prevent individual data points from having a large effect on the output of an analysis? We design new differentially private algorithms for statistical estimation. Given samples from a subgaussian distribution, our algorithms accurately estimate the distribution's mean and covariance. These algorithms have low overhead relative to nonprivate estimators, requiring similar sample sizes and computational resources.
Description
2024
License
Attribution 4.0 International