From coarse to fine-grained concept based discrimination for phrase detection
OA Version
Citation
Abstract
Phrase Detection is a vision and language task where the goal is to determine if a phrase is relevant to an image and localize it, if applicable. The task has many important downstream applications such as Assistive Robotics where a Robot needs to detect/localize an object in an environment. However, training discriminative Phrase Detection models is difficult due to two main challenges 1) sparse training labels: ground truth regions are not exhaustively labeled with all applicable phrases which makes determining negative phrases challenging 2) the training distribution is heavily imbalanced; a small subgroup of the phrases constitute most of the training data. We address these problems through a novel coarse style discrimination: Negative Coarse Concepts (NCC). The method involves grouping visually coherent phrases (concepts) and using the concepts as negative samples which effectively expose the model to a wide and diverse portion of the distribution while minimizing the chance of false negatives. Furthermore, we supplement the coarse discrimination method with a fine grained module (FGM) which effectively discriminates between mutually exclusive tokens of diverse groups such as colors, sizes, etc. Finally, we combine these two novel modules into one model: CFCD-Net, which improves the state-of-the-art performance of Phrase Detection models on Flickr30K Entities and RefCOCO+ Datasets by 1.5-2 points. We further demonstrate how these improvements directly translate to improvements in the downstream task of Binary Image Selection (BISON).
Description
License
Attribution 4.0 International