DriveX 2026 - Foundation Models for V2X-based Cooperative Autonomous Driving

Introduction

The 7th edition of the full-day DriveX workshop explores advances in Foundation Models and 3D Perception in Cooperative Autonomous Driving (CAD). This workshop brings together leading researchers and practitioners to discuss cutting-edge developments in large language models (LLMs), vision-language models (VLMs), vision-language action models (VLAs), and their applications to autonomous driving systems. Topics include 3D object detection, semantic segmentation, sensor fusion, V2X communication, and cooperative perception.

We explore methods to enhance scene understanding, perception accuracy, dataset curation, and novelty detection. By uniting experts across perception, V2X, and foundation model domains, this workshop aims to foster innovation in cooperative autonomous driving and intelligent transportation systems. The workshop addresses critical challenges in multi-modal sensor data fusion, vehicle-infrastructure coordination, and intelligent transportation systems that leverage both onboard and roadside sensing capabilities.

This year, we expanded our focus with the addition of V2X applications, exploring real-world vehicle-to-infrastructure connectivity that extends past collaborative perception. The workshop provides a platform for discussing V2X for localization, tolling, road safety, monitoring, and data analytics, bridging the gap between theoretical advances and practical deployment in intelligent transportation systems. Through keynote presentations, panel discussions, paper presentations, and challenge tracks, DriveX 2026 creates a comprehensive forum for advancing the state-of-the-art in foundation model-driven cooperative autonomous driving.

Topics of Interest

3D Environment Perception

3D Scene Understanding
3D Instance Segmentation
3D Occupancy Prediction
3D Detection and Tracking

Cooperative Perception

V2X Communication
Vehicle-Infrastructure Fusion
Roadside & ITS Sensors (RSUs)
Multi-modal Sensor Data Fusion

Foundation Models

LLM-assisted Perception & Prediction
Vision-Language Models (VLMs)
FMs for Dataset Curation & Labeling
FMs for Accident & Novelty Detection

Schedule (Tentative)

Start	End	Program
09:00	09:10	Introduction
09:10	09:30	Keynote Presentation 1 Keynote
09:30	09:50	Keynote Presentation 2 Keynote
09:50	10:10	Keynote Presentation 3 Keynote
10:10	10:25	Coffee Break
10:25	10:45	Keynote Presentation 4 Keynote
10:45	11:05	Keynote Presentation 5 Keynote
11:05	11:25	Keynote Presentation 6 Keynote
11:25	12:00	Academic Panel Discussion Panel
12:00	13:00	Lunch
13:00	13:20	Keynote Presentation 7 Keynote
13:20	13:40	Keynote Presentation 8 Keynote
13:40	14:00	Keynote Presentation 9 Keynote
14:00	14:20	Keynote Presentation 10 Keynote
14:20	15:00	Industry Panel Discussion Panel
15:00	15:15	Coffee Break
15:00	15:15	Poster Session Posters
15:15	17:45	Paper Presentations Oral
17:15	17:30	Competition Winner Presentation & Awards Ceremony Competition
17:35	17:45	Best PaperPresentation & Awards Ceremony Best Paper
17:45	17:55	Final Remarks & Summary
17:55	18:00	Group Photo
19:00	21:00	Social Mixer Networking & Dinner Reception

Final schedule, room allocation, and speaker order will be announced closer to the workshop date.

Confirmed Keynote Speakers

Prof. Dr. Christoph Stiller

Karlsruher Institut für Technologie (KIT)

Lara Amini-Rentsch

LOXO AG

Jim Misener

WSP

Prof. Dr. Cathy Wu

Massachusetts Institute of Technology

Prof. Dr. Ziran Wang

Purdue University

Prof. Dr. Jiaqi Ma

University of California, Los Angeles

Prof. Ignacio Alvarez

THI

Dr. Mao Shan

The University of Sydney

Prof. Valentina Donzella

Queen Mary Uni. of London

Prof. Manabu Tsukada

University of Tokyo

DriveX Grand Challenge

Track 1 TUMTraf-V2X Challenge Apr 15 - July 1 Track 2 doScenes Challenge Apr 15 - July 1 Track 3 MDrive Challenge Apr 15 - July 1 Track 4 nuReasoning Challenge Apr 15 - July 1

V2I-Based Cooperative Perception
Infrastructure–vehicle fusion using TUMTraf-V2X. Focus on cooperative 3D detection and tracking with infrastructure-mounted LiDAR, radar, and cameras, emphasizing occlusion handling, long-range awareness, and reliability under real-world conditions.
Natural Language Instruction for Human Interaction and Vision-Language Navigation
Built upon doScenes. Participants design models for natural language instructions and visual language navigation to facilitate research on human-vehicle instruction interactions.
Multi-Agent Reasoning
Using MDrive teams explore a cooperative driving benchmark for end-to-end closed-loop multi-agent systems.

Competition Timeline

Competition Announcement: Apr 13, 2026
Competition Start: Apr 15, 2026 (10 weeks)
Submission Deadline: July, 1 (23:59 PST)
Notification to Participants: July 7, 2026

Top-performing teams will be invited to present at the workshop and will receive money prizes ($100) and award certificates. Detailed rules, baselines, and submission instructions are available on the official challenge page.