Describing and retrieving visual content using natural language
MetadataShow full item record
Modern deep learning methods have boosted research progress in visual recognition and text understanding but it is a non-trivial task to unite these advances from both disciplines. In this thesis, we develop models and techniques that allow us to connect natural language and visual content enabling automatic video subtitling, visual grounding, and text-based image search. Such models could be useful in a wide range of applications in robotics and human-computer interaction bridging the gap in vision and language understanding. First, we develop a model that generates natural language descriptions of the main activities and scenes depicted in short videos. While previous methods were constrained to a predefined list of objects, actions, or attributes, our model learns to generate descriptions directly from raw pixels. The model exploits available audio information and the video’s category (e.g., cooking, movie, education) to generate more relevant and coherent sentences. Then, we introduce a technique for visual grounding of generated sentences using the same video description model. Our approach allows for explaining the model’s prediction by localizing salient video regions for corresponding words in the generated sentence. Lastly, we address the problem of image retrieval. Existing cross-modal retrieval methods work by learning a common embedding space for different modalities using parallel data such as images and their accompanying descriptions. Instead, we focus on the case when images are connected by relative annotations: given the context set as an image and its metadata, the user can specify desired semantic changes using natural language instructions. The model needs to capture distinctive visual differences between image pairs as described by the user. Our approach enables interactive image search such that the natural language feedback significantly improves the efficacy of image retrieval. We show that the proposed methods advance the state-of-the-art for video captioning and image retrieval tasks in terms of both accuracy and interpretability.