Learning and evaluating multimodal representations for digital domains
OA Version
Citation
Abstract
Digital domains such as mobile apps and webpages have become fundamental to everyday life. Humans perform many tasks on their phones and online, like reading recipes, booking calendar events, viewing images, and shopping for food or clothes. A prerequisite to building Artificially Intelligent models to aid in these tasks is the process of learning embeddings, i.e., representations, of mobile app and webpage data. In this thesis, we (1) Curate multimodal app and webpage datasets. Digital domains capture four modalities: image, text, structure, and action. We contribute the first multimodal app dataset with all app modalities and language annotations and the first multimodal webpage dataset to retain structure with all image and text content in a unified webpage sample. (2) Define new tasks to evaluate app and webpage understanding. Using our new app dataset, we define an instruction following benchmark that requires mapping a natural language high-level user goal to a sequence of low-level actions. We also define a novel feasibility classification task, in which we predict which user requests can be satisfied in the app environment. Using our new webpage dataset, we define three generation-style tasks: webpage description generation, section summarization, and contextual image captioning. This aims to evaluate webpage understanding at a global, regional, and local level, respectively. (3) Evaluate the importance of each data modality. With our new benchmarks, we determine the impact of each modality on downstream task performance. We find images to be useful for classifying whether a user command is actually satisfiable in an app environment and key to correcting over-reliance on text information. For our webpage benchmarks, contextual text and images aid all tasks, helping image captions retain knowledge-based detail and page descriptions or section summaries retain topical relevance or specificity. (4) Propose new methods for learning multimodal representations of digital domains. Utilizing all available modalities, we contribute a novel attention scheme to make use of webpage structure, separating the most salient content for each task. Results demonstrate that our multimodal encoder is more performant and more computationally efficient. For mobile app representations, we propose using text descriptions and action sequences to learn embeddings that can encode both global and local features while being significantly more data efficient. We outperform prior work on a suite of app understanding tasks while only utilizing publicly available data.
Description
2024
License
Attribution 4.0 International