Efficient vision and language models for autonomous systems
OA Version
Citation
Abstract
The transition of vision-based systems, including autonomous vehicles, from controlled lab environments to real-world deployment poses significant challenges due to constraints such as limited data availability and computational resources. Current approaches often rely on sending data to remote cloud servers for processing, leading to energy inefficiencies. To address these challenges, this thesis proposes novel methodologies for achieving reliable and efficient inference in dynamic scenarios. Firstly, large-scale vision-based models are evaluated within the CARLA autonomous driving simulator. We introduce a 'switch' policy to offload inference between local and cloud models, trained using reinforcement learning. We evaluate the effectiveness of this policy using a newly introduced evaluation metric: Ecological Navigation Score (ENS). This metric considers route deviations, collisions, and energy consumption, critical factors for assessing driving agent effectiveness. Secondly, we present a novel dataset to assess the performance of language models (LLMs) and vision language models (VLMs) in autonomous driving tasks. The dataset comprises questions from learner's license examinations, covering various question formats and including both text-based and visual-text pairs. We conduct evaluations on small-scale LLMs using this dataset to analyze their performance, aiming to achieve a passing score of 80% or higher on the driving test. Through this comprehensive approach, the thesis aims to address the pressing need for robust and efficient vision-based systems in dynamic real-world environments, contributing to advancements in autonomous driving research and technology.
Description
2024