Efficient deep learning for image and video understanding

Date
2024
DOI
Version
OA Version
Citation
Abstract
Deep Neural Networks have excelled in computer vision tasks, such as classifying a photo into a fixed set of categories or writing a short caption. However, significant challenges remain in scaling up deep learning by considering the limited model/computation/data resources in real applications. In this thesis, we focus on these three major challenges: 1) memory efficiency, 2) training data efficiency, and 3) computation efficiency in popular computer vision frameworks. The challenge of memory efficiency arises due to the large size of the model required to achieve high accuracy, especially since each narrow task demands its own fine-tuned network. To reduce memory, previous work explored the possibility of solving multiple tasks using a single network. However, challenges of designing such multi-task learning (MTL) still remain in allocating which parameters to share across tasks. Also, it is important to maximize data efficiency (e.g. minimize the data cost) for computer vision tasks. We focus mainly on improving data efficiency in MTL and recent Vision-Language Foundation Models (VFLM). Despite many works in MTL, exploring how to optimally allocate the labeling budget across tasks for maximum cost efficiency remains an unresolved problem. Foundation models also have terrible data efficiency. Vision-Language Foundation Models demonstrated superior multitask performance, but requested billions of data during the pretraining. Attempts to reproduce them using smaller public datasets often lead to reduced accuracy. Moreover,even though VLFMs exhibit impressive transferability, adapting them to new tasks remains a challenge, since the data annotations are limited in the downstream task. Last but not least, the bad computational efficiency of large models is concerning, and especially it is challenging to apply them to video understanding. To enhance computational efficiency, previous studies have introduced numerous methods to modify network design, yet they largely overlook the bit-width of weights and activations. This oversight is significant as the bit-width plays a crucial role in influencing computational costs during inference. To address these challenges, our research develops several targeted approaches. In Part I, to address memory efficiency concerns in the general multi-task setting, we propose an adaptive parameter sharing scheme which learns the network sharing pattern from the given task data. In Part II, we first propose the budget allocation problem in multi-task learning and design a task-adaptive budget allocation algorithm to further optimize data annotation efficiency. Then, to address the prohibitive pre-training and finetuning cost associated with VLFMs, we introduce a novel distillation mechanism ``DIME-FM'' and a prompt learning method ``DualCoOp'' to largely reduce the amount of data in the pretraining and finetuning phases, respectively. In Part III, we propose an adaptive quantization framework to choose the optimal bit-width for each frame by balancing performance and computational efficiency, which finally reduces the computation for over 50% without loss of accuracy. Together, we improve the efficiency of deep learning across three different dimensions to understand images and videos.
Description
2024
License
Attribution 4.0 International