The differential geometric structure in supervised learning of classifiers
In this thesis, we study the overfitting problem in supervised learning of classifiers from a geometric perspective. As with many inverse problems, learning a classification function from a given set of example-label pairs is an ill-posed problem, i.e., there exist infinitely many classification functions that can correctly predict the class labels for all training examples. Among them, according to Occam's razor, simpler functions are favored since they are less overfitted to training examples and are therefore expected to perform better on unseen examples. The standard technique to enforce Occam's razor is to introduce a regularization scheme, which penalizes some type of complexity of the learned classification function. Some widely used regularization techniques are functional norm-based (Tikhonov) techniques, ensemble-based techniques, early stopping techniques, etc. However, there is important geometric information in the learned classification function that is closely related to overfitting, and has been overlooked by previous methods. In this thesis, we study the complexity of a classification function from a new geometric perspective. In particular, we investigate the differential geometric structure in the submanifold corresponding to the estimator of the class probability P(y|x), based on the observation that overfitting produces rapid local oscillations and hence large mean curvature of this submanifold. We also show that our geometric perspective of supervised learning is naturally related to an elastic model in physics, where our complexity measure is a high dimensional extension of the surface energy in physics. This study leads to a new geometric regularization approach for supervised learning of classifiers. In our approach, the learning process can be viewed as a submanifold fitting problem that is solved by a mean curvature flow method. In particular, our approach finds the submanifold by iteratively fitting the training examples in a curvature or volume decreasing manner. Our technique is unified for both binary and multiclass classification, and can be applied to regularize any classification function that satisfies two requirements: firstly, an estimator of the class probability can be obtained; secondly, first and second derivatives of the class probability estimator can be calculated. For applications, where we apply our regularization technique to standard loss functions for classification, our RBF-based implementation compares favorably to widely used regularization methods for both binary and multiclass classification. We also design a specific algorithm to incorporate our regularization technique into the standard forward-backward training of deep neural networks. For theoretical analysis, we establish Bayes consistency for a specific loss function under some mild initialization assumptions. We also discuss the extension of our approach to situations where the input space is a submanifold, rather than a Euclidean space.