4th DriveX Workshop In conjunction with CVPR 2026

Foundation Models for Autonomous Driving

A premier forum uniting academic, industry, and standards communities to shape the next generation of cooperative, foundation-model-driven autonomous driving and intelligent transportation systems.

Wednesday, June 3, 2026 Denver, Colorado, USA Colorado Convention Center, Room: 207 In conjunction with CVPR 2026
Curated keynote lineup from academia & industry
Focus on real-world V2X datasets & benchmarks
Safety, robustness, and trustworthy autonomy

Introduction

The 4th edition of the DriveX Workshop focuses on how foundation models and cooperative systems can redefine perception, prediction, planning, and decision-making for autonomous driving and intelligent transportation infrastructure.

Traditional single-vehicle pipelines have achieved impressive progress in 3D detection and tracking, yet they remain constrained by limited viewpoints, occlusions, and domain shifts. Cooperative driving systems, powered by V2X communication and roadside/edge intelligence, extend sensing range, enrich scene context, and enable shared representations across vehicles and infrastructure.

In parallel, foundation models, including vision, vision-language, and multi-modal large models, unlock powerful generalization capabilities: open-vocabulary understanding, scalable pretraining, zero-shot adaptation, and interpretable reasoning about complex road scenes. Emerging end-to-end and agentic systems such as large driving models promise unified perception-to-control frameworks but raise new questions in trustworthiness, reliability, calibration, and evaluation at urban scale.

DriveX 2026 convenes researchers and practitioners from computer vision, robotics, communications, transportation, AI safety, and policy to:

Topics of Interest

Schedule (Tentative)

Time Session
08:00 – 08:05 Opening Remarks – Welcome & Workshop Overview
08:05 – 08:20
Dr. Walter Zimmer
Opening Keynote Keynote

Dr. Walter Zimmer

University of California Los Angeles (UCLA) & Technical University of Munich (TUM), USA

Abstract

Autonomous driving in urban environments is fundamentally limited by the range, occlusions, and failure modes of vehicle-only perception. This opening keynote shows research advances in cooperative roadside–vehicle perception by fusing multi-modal data from onboard sensors and intelligent roadside infrastructure via V2X communication to extend situational awareness beyond line of sight. The proposed methods improve real-time 3D object detection and tracking in dense traffic and are supported by new large-scale, multi-modal datasets for benchmarking cooperative perception in real-world urban settings. By integrating recent advances in foundation models, including vision-language models, the work further enables semantic understanding of complex traffic scenes, laying the groundwork for AI-driven urban digital twins and safer, more efficient intelligent transportation systems.

Speaker Bio

Dr. rer. nat. Walter Zimmer is a post-doctoral researcher at the University of California Los Angeles (UCLA) and guest researcher at the Technical University of Munich (TUM). He received his Ph.D. from the Technical University of Munich (TUM) in 2025. His research focuses on cooperative autonomous driving, 3D perception and 3D foundation models. He has authored over 40 publications at top venues such as CVPR, ICCV, ECCV, ICML, NeurIPS, and T-PAMI. Dr. Zimmer previously worked as an Autonomous Systems Engineer at the STTech startup and research assistant at Siemens AG. His work has earned multiple awards, including the IEEE ITSS Best Student Paper Award 2023 and IEEE ITSS Best Dissertation Award 2025.

08:20 – 08:40
Dr. Balajee Kannan
Keynote 1 Keynote

Dr. Balajee Kannan

Motional, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

08:40 – 09:00
Prof. Dr. Jiaqi Ma
Keynote 2 Keynote

Prof. Dr. Jiaqi Ma

University of California Los Angeles (UCLA), USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Dr. Jiaqi Ma is Director of the FHWA/UCLA Center of Excellence on New Mobility and Automated Vehicles, Professor at the UCLA Samueli School of Engineering, Director of the UCLA Mobility Lab, and Associate Director of the UCLA Institute of Transportation Studies. He has led and managed numerous transportation research projects funded by the U.S. Department of Transportation, National Science Foundation, state Departments of Transportation, and other federal, state, and local agencies. His research spans automated driving, mobility digital twins, multimodal sensing, cooperative perception and decision-making, robotics, spatial data mining, simulation, and reasoning.

09:00 – 09:20
Dr. Mingxing Tan
Keynote 3 Keynote

Dr. Mingxing Tan

Waymo, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

09:20 – 09:40
Dr. Jamie Shotton
Keynote 5 Keynote

Dr. Jamie Shotton

Wayve, Vancouver

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

09:40 – 10:00
Mustafa Bal
Keynote 4 Keynote

Mustafa Bal

Nomadic AI, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

10:00 – 10:50 Poster Session I (ExHall A) & Coffee Break Posters Coffee Break
10:00 – 10:30 Nomadic AI — Live Demo Demo
10:50 – 11:10
Prof. Dr. Marco Pavone
Keynote 6 Keynote

Prof. Dr. Marco Pavone

Stanford University & NVIDIA, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

11:10 – 11:30
Dr. Phil Duan
Keynote 7 Keynote

Dr. Phil Duan

Tesla, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

11:30 – 12:00 Panel Discussion I: Industry Track Panel
12:00 – 13:00 Lunch Break & Networking
13:00 – 13:20
Dr. Tony Qi
Keynote 8 Challenge

Dr. Xuewei (Tony) Qi

Motional AD LLC, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

13:20 – 13:40
Prof. Dr. Manmohan Chandraker
Keynote 9 Keynote

Prof. Dr. Manmohan Chandraker

UCSD & NEC Labs, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

13:40 – 14:00
Prof. Dr. Sharon Li
Keynote 10 Keynote

Prof. Dr. Sharon Li

University of Wisconsin-Madison, USA

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

14:00 – 14:20
Prof. Dr. Holger Caesar
Keynote 11 Keynote

Prof. Dr. Holger Caesar

Delft University of Technology, Netherlands

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

14:20 – 14:40
Prof. Dr. Daniel Cremers
Keynote 12 Keynote

Prof. Dr. Daniel Cremers

Technical University of Munich, Germany

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

14:40 – 15:00
Prof. Dr. Angela Dai
Keynote 13 Keynote

Prof. Dr. Angela Dai

Technical University of Munich, Germany

Abstract

Abstract will be announced closer to the workshop date.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

15:00 – 16:00 Poster Session II & Coffee Break Posters Coffee Break
16:00 – 16:30 Panel Discussion II: Academic Track Panel
16:30 – 16:40
Oral Paper Presentation 1: Calibration-Free View-Agnostic Monocular 3D Object Detection for Urban Scenes Oral

Dr. Mehmet Kerem Turkcan

Columbia University, USA

Abstract

Cooperative vehicle-to-everything (V2X) perception requires 3D object detection across heterogeneous cameras whose intrinsic parameters may be unavailable, imprecise, or drifting. We present UrbanOmniDetect, a calibration-free monocular 3D object detection framework that predicts ordered 2D projections of 3D bounding box vertices from a single RGB image. By formulating 3D detection as keypoint regression within a backbone-agnostic single-stage architecture, a single model generalizes across ego-vehicle, infrastructure, and aerial viewpoints without camera intrinsics or scene priors. We construct the UrbanOmniView dataset by unifying KITTI, DAIR-V2X, and high-fidelity Unreal Engine 5 synthetic data (4K, ray-traced) spanning ground-level, traffic-surveillance, and drone perspectives. A homography-based bird's-eye-view head maps predicted ground-contact keypoints to a top-down plane, enforcing geometric consistency without camera parameters. We experiment with YOLO11 backbone variants at multiple scales and augmented feature pyramid levels. On the KITTI benchmark, our best model achieves AP_3D = 30.71 (Moderate) and AP_BEV = 35.19 at IoU >= 0.7, outperforming calibration-dependent baselines on the Moderate and Hard splits, with an mAP_50:95 of 0.751 and 10 ms inference on an A100 GPU. Calibration-dependent baselines degrade catastrophically under small intrinsic perturbations, whereas our formulation is invariant by construction. UrbanOmniDetect provides a deployment-ready framework for autonomous driving, drone surveillance, and V2X cooperative perception.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

16:40 – 16:50
Oral Paper Presentation 2: 4-D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather Oral

Melih Yazgan

Columbia University, USA

Abstract

Cooperative perception is critical for autonomous driving but remains highly fragile when cameras and LiDAR degrade in adverse weather. We address this critical limitation by elevating 4D imaging radar to a first-class modality within collaborative frameworks and introducing the first Doppler-guided spatial attention mechanism for multi-agent fusion. Our approach extends two representative backbones: (i) radar substitutes LiDAR to form a radar-camera pipeline, and (ii) radar complements LiDAR to form a LiDAR-radar pipeline. A Doppler-derived mask dynamically emphasizes moving objects while preserving static context, significantly enhancing robustness in cluttered and low-visibility scenes. To support comprehensive evaluation, we release radar-augmented benchmarks (OPV2V-R and Adver-City-R) featuring physics-based LiDAR degradation. Experiments demonstrate that substituting LiDAR with radar nearly doubles baseline detection accuracy in fog, while our Doppler-guided attention provides the essential refinement needed to achieve high precision. Furthermore, our LiDAR-radar fusion equipped with this attention mechanism achieves state-of-the-art robustness under heavy rain and fog. Additional validation on the real-world TruckScenes dataset confirms that our Doppler-guided radar modules transfer effectively beyond simulation, firmly establishing 4D radar as a primary modality for all-weather collaborative perception.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

16:50 – 17:00
Oral Paper Presentation 3: Rethinking Intermediate Module Utilization in V2X End-to-End Autonomous Driving Oral

Yiming Kan

Technical University of Munich (TUM), Germany

Abstract

End-to-end autonomous driving has progressed rapidly, with vehicle-side models relying on perception or ego status. UniV2X has extended this paradigm to the Vehicle-to-Everything (V2X) domain, where the broader perceptual scope of V2X offers a more revealing context for revisiting the effective utilization of intermediate modules. Prior work has examined the utility of intermediate modules in vehicle-side models, with studies suggesting that historical trajectories or current ego status alone may suffice for achieving competitive performance on open-loop datasets. Our paper aims to revisit this assumption in the V2X setting. Using the UniV2X model as the baseline and the V2X-Seq dataset as the testbed, we examine the contribution of intermediate modules to the final planning output and explore the extent to which their utility is fully realized. Our study reveals that current end-to-end models tend to underutilize the guidance provided by intermediate modules to the planning stage, reflecting a lack of planning-oriented design. To address this issue, we propose Optimized Multi-Experts Guided Autonomous Driving (OMEGA), a functional integration mechanism that explicitly improves the contribution of intermediate modules to the planning process. Experimental results demonstrate that our approach significantly enhances the functional contribution of each intermediate component. Our findings suggest that performance limitations are not due to the lack of new modules but stem from the underutilization of existing ones, urging a reconsideration of current end-to-end design practices.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

17:00 – 17:10
Oral Paper Presentation 4: DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather Oral

Christof Leitgeb

Graz Uni. of Technology, Austria

Abstract

Reliable and weather-robust perception systems are essential for save autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a Dinov3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. Code and trained models will be made publicly available.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

17:10 – 17:20
Oral Paper Presentation 5: GaussianDet3D: Bridging Gaussian Splatting and Sparse LiDAR Detection for Multi-View 3D Object Detection Oral

Malaz Tamim

Technical University of Munich (TUM), Germany

Abstract

Reliable and weather-robust perception systems are essential for save autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a Dinov3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. Code and trained models will be made publicly available.

Speaker Bio

Speaker bio will be announced closer to the workshop date.

17:20 – 17:30 Paper Awards Ceremony – Best Paper, Runner-Up, Best Application Paper, Best Poster & Best Keynote
17:30 – 17:40 Challenge Winner Presentation
17:40 – 17:50 Challenge Awards Ceremony
17:50 – 18:00 Closing Remarks & Group Photo
19:00 – 21:00 Workshop Reception & Networking

Final schedule, room allocation, and speaker order will be announced closer to the workshop date.

Paper Track

DriveX 2026 invites high-quality contributions on foundation models, cooperative perception, large driving models, and related topics outlined above.

We welcome:

Submissions must follow the official CVPR 2026 style: LaTeX or Typst.

📘 Archival Track (Proceedings Track)

Submit Now

📗 Non-Archival Track

Submit Now

Accepted Papers

DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

Christof Leitgeb, Thomas Puchleitner, Max Peter Ronecker, Daniel Watzenig

V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving

Xuewen Luo, Fengze Yang, Ding Fan, Xiangbo Gao, Bo Yu, Zihao Li, Zhengzhong Tu, Yang Zhou, Chenxi Liu

Rethinking Intermediate Module Utilization in V2X End-to-End Autonomous Driving

Yiming Kan, Huilin Yin, Daniel Watzenig

DRIVEXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

IAO-SLAM: Real-time Illumination-Aware Object SLAM for Robust Perception in Low-Light Environments

Pengju Zhen, Huilin Yin, Linchuan Zhang, Xin Su

Calibration-Free View-Agnostic Monocular 3D Object Detection for Urban Scenes

Mehmet Kerem Turkcan, Devika Gumaste, Zoran Kostic

Importance-Driven 3D Gaussian SLAM for Efficient Mapping and Communication-Aware Sharing

Zhaolin Yang, Huilin Yin, Linchuan Zhang, Mingyu Liu, Alois Knoll

4-D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather

Melih Yazgan, Iramm Hamdard, Qiyuan Wu, Svetlana Pavlitska, J. Marius Zöllner

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Cheng, Sikai Chen, Bin Ran

GaussianDet3D: Bridging Gaussian Splatting and Sparse LiDAR Detection for Multi-View 3D Object Detection

Malaz Tamim, Wenzhao Zheng, Johannes Michael Meier, Daniel Cremers, Kurt Keutzer

Paper Awards

Challenge Awards

Challenge winners will receive award certificates and are invited to present their results at the workshop.

DriveX Grand Challenge

🔥All tracks are open now

Competition Timeline

Top-performing teams will be invited to present at the workshop and will receive money prizes ($300) and award certificates. Detailed rules, baselines, and submission instructions are available on the official challenge page.

Organizers

Invited Program Committee

Wesley Maia

UC Merced

Bo Yang

UCLA

Dr. Camila Correa-Jullian

UCLA

Prof.Jiachen Li

UC Riverside

Afnan Alofi

Nourah Bint Abdulrahman University

Peizheng Li

Uni Tübingen

Marc Unzueta

Cruise

Kianna Ng

UC Merced

Dr. Xu Han

UCLA

Wei Cao

Uni. of Illinois at Urbana-Champaign

Angel Martinez-Sanchez

UC Merced

Prof.Ziran Wang

Purdue Uni.

Qiyuan Wu

Cornell Uni

Erika Maquiling

UC Merced

Parthib Roy

UC Merced

Zhenzhen Liu

Cornell Uni.

Kunlin Cai

UCLA

Markus Gross

Fraunhofer IVI & TUM

Prof. Hang Qiu

UC Riverside

Dr. Katie Z Luo

Stanford

Cheng Perng Phoo

Waymo

Zhenghao Peng

UCLA

Shiyu Jin

UC Berkeley

Johnson Liu

UCLA

Haoxuan Ma

UCLA

Yifan Liu

UCLA

Sponsors