Advancing assistive systems with multimodal machine learning
OA Version
Citation
Abstract
Assistive systems play a critical role in supporting individuals with disabilities by enhancing perception, decision-making, and interaction in complex real-world environments. Recent advances in machine learning, particularly in multimodal machine learning (MML), have led to significant progress in assistive systems for navigation, instruction following, and educational support, primarily in controlled settings and task-specific scenarios. Despite these advances, current assistive systems face fundamental challenges in scalability and generalization due to limitations in data diversity and scale, and heavily rely on task-specific model designs. Many existing models are trained on small, specialized datasets that are expensive to collect and difficult to extend to new users, environments, or tasks.In this dissertation, we address these limitations by framing assistive systems development as an end-to-end multimodal machine learning pipeline, with a focus on applications for people with visual impairments and students with learning disabilities. We present a series of works that systematically address challenges in data collection, representation learning, task-oriented modeling, and deployment efficiency.
We begin by addressing challenges in data collection, as multimodal data in assistive settings are often scarce, incomplete, or biased. Existing benchmarks highlight that many models frequently fail to recognize accessibility-related objects, including wheelchairs and canes, due to their underrepresentation in training data. To mitigate these limitations, we conduct real-world studies and develop a simulation environment that supports four benchmark challenges, each designed to provide more diverse and informative multimodal observations for assistive technology research.
Building on this enriched data foundation, we next investigate representation learning through multimodal pre-training. We demonstrate that pre-training plays a central role in mitigating missing data issues, and that incorporating auxiliary tasks during multimodal pre-training significantly improves both generalization and downstream performance.
With robust representations in place, we then develop task-specific models for assistive navigation for people with visual impairments. The proposed models, ASSISTER and TIMELI, leverage multimodal features to align visual observations with user goals and temporal context. This alignment enables the generation of interpretable navigation instructions and supports proactive, context-aware decision-making in dynamic real-world environments.
Finally, to enable deployment in real-world assistive systems, we introduce a local–cloud collaborative framework, UniLCD, which dynamically balances computation between on-device and cloud resources. This design explicitly addresses deployment and inference efficiency under the computational and latency constraints of portable edge devices.
Overall, this dissertation presents a series of contributions across the assistive systems pipeline, demonstrating that an end-to-end multimodal machine learning approach can effectively address key challenges in assistive systems. The findings highlight the potential of deployment-aware, data-efficient multimodal learning to enable robust and scalable assistive technologies for real-world use.
Description
2026
License
Attribution 4.0 International