Towards a scalable and generalized end-to-end policy for autonomous driving

OA Version
Citation
Abstract
The field of autonomous driving has witnessed a significant surge in the adoption of end-to-end frameworks, where raw sensor input is directly mapped to vehicle control signals. In contrast to the modular pipeline, which follows a sequential order of perception, prediction, and planning, the end-to-end approach offers advantages of integrated feature optimization for both perception and planning. However, it still faces several critical challenges, including scalability, interpretability, and robustness. Achieving scalable and generalized end-to-end policies for autonomous driving remains a fundamental challenge due to limitations in data diversity, distributional bias, and generalization to novel environments. Prior approaches, such as X-World (Zhang et al., 2021a) and ASSISTER (Huang et al., 2022), demonstrate suboptimal performance when deployed in diverse real-world settings, highlighting the need for more robust and scalable learning frameworks. In this work, we introduce a series of improvements across representation learning, policy learning, and structured sensorimotor training to enhance generalization and scalability in end-to-end driving policies. First, we focus on learning scalable representations via pre-training. We propose NeMo (Huang et al., 2024), a neural volumetric world modeling approach that can be pre-trained in a self-supervised manner for image reconstruction and occupancy prediction tasks, benefiting scalable training and deployment paradigms such as imitation learning. We also introduce XVO (Lai et al., 2023), a semi-supervised learning method, pre-trained for multi-modal auxiliary tasks, i.e., segmentation, flow, depth, and audio prediction, to facilitate generalized representations for monocular Visual Odometry (VO) tasks across diverse datasets and settings. These techniques help improve spatial-temporal scene understanding across diverse conditions. Second, we enhance policy learning through diverse supervision. To increase the supervision with new perspectives and maneuvers, we introduce LbW (Zhang and Ohn-Bar, 2021) which enables learning a driving policy from the demonstrations of other non-ego vehicles in a given scene without requiring full knowledge of neither the state nor expert actions. We further propose SelfD (Zhang et al., 2022b), an iterative semi-supervised training framework for learning scalable driving by utilizing large-scale unlabeled online monocular data. Both methods leverage diverse supervision for relieving the long-tailed maneuvers and domain shift issue and advance the underlying development process of safe and scalable autonomous vehicles. Finally, we introduce structured learning strategies for sensorimotor agents. We present CaT (Zhang et al., 2023a), a novel distillation scheme that is trained with richer supervision in feature space and optimized via a student-paced coaching mechanism. We also introduce FeD (Zhang et al., 2024) that leverages advances in Large Language Models (LLMs) to provide corrective fine-grained feedback regarding the underlying reason behind driving prediction failures to improve robustness in complex driving scenarios. Through extensive evaluation across multiple benchmarks, we demonstrate that our approaches significantly improves generalization, robustness, and scalability in end-to-end autonomous driving policies. Our findings highlight the potential of representation pre-training, diverse supervision, and structured learning to bridge the gap between simulation and real-world deployment, advancing the field toward truly scalable and adaptable autonomous driving systems.
Description
2025
License