Privacy-preserving smart-room visual analytics
MetadataShow full item record
The proliferation of sensors in living spaces in the last few years has led to the concept of a smart room of the future - an environment that allows intelligent interaction with its occupants, be it a living or conference room. Among the promised benefits of future smart rooms are improved energy efficiency, health benefits and increased productivity. To realize such benefits, accurate and reliable localization of occupants and recognition of their poses, activities, and facial expressions are crucial. Extensive research has been performed to date in these areas, primarily using video cameras. However, with increasing concerns about privacy, the use of standard video cameras seems ill-suited for smart spaces; alternative sensing modalities and visual analytics techniques, that preserve privacy, are urgently needed. Motivated by such demand, this thesis aims to develop image and video analysis methodologies that protect occupant’s (visual) privacy while preserving utility for an inference task. We propose two distinct methodologies to accomplish this. In the first one, we address privacy concerns by degrading the spatial resolution of images/videos to the point where it no longer provides visual utility to eavesdroppers. We have conducted proof-of-concept studies for the problems of head pose estimation, indoor occupant localization, and human action recognition at extremely low resolutions (eLR) (lower than 16×16 pixels). For the problem of pose estimation, specifically head pose, from a single image at resolutions as low as 10×10 pixels or even 3×3 pixels, we developed an estimation algorithm using a classical data-driven approach. For occupant localization based on data from overhead-mounted single-pixel visible-light sensors, we developed both coarse- and fine-grained estimation algorithms using classical machine learning techniques. For action recognition from eLR visual data, motivated by the success of deep learning in computer vision, we developed multiple two-stream Convolutional Neural Networks (ConvNets) that fuse spatial and temporal information. In particular, we proposed a novel semi-coupled, filter-sharing network that leverages high-resolution videos to train an eLR ConvNet. We demonstrated that practically useful inference performance can be achieved at eLR. While the use of eLR data can mitigate visual privacy concerns, it can also significantly limit utility compared to full-resolution data. Thus, in addition to developing inference methods for eLR data, we took advantage of recent advancements in representation learning to design an identity-invariant data representation that also permits synthesis of utility-equivalent realistic full-resolution data with a different identity. To this end, we proposed two novel models tailored for 2D images. We tested our models on a number of visual analytics tasks such as recognizing facial expressions, estimating head poses, or illumination condition. A thorough evaluation of the proposed approaches under various threat scenarios demonstrates that our approaches strike a balance between preservation of privacy and data utility. As additional benefits, our approach enables performing expression-and head-pose-preserving face morphing.