Vision and language understanding with localized evidence
MetadataShow full item record
Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.