LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

We introduce a large-scale dataset of human-annotated annotations focusing on interactive behavior to support behavior analysis and language-to-simulation research. The dataset includes 125,000 Waymo scenes annotated with 209 distinct agent-agent interaction subtypes, capturing fine-grained multi-agent dynamics in diverse driving scenarios. We also develop an automatic annotation pipeline for labeling single-agent behaviors.

LangTraj is a scene-level diffusion backbone that directly conditions on natural language via a BERT-based encoder. We apply LoRA for efficient end-to-end training, summarizing sentence via <cls> token to reduce computational cost.

To mitagate compunding errors during autoregressive rollout, we develop a closed-loop training strategy for diffusion models to improve the realism. As shown below, first, the model generates multiple denoised trajectory candidates using forward diffusion from ground truth. The closest candidate to the ground truth is selected and then executed, enabling the model to experience its own distribution during training

We compare our language-conditioned simulation results with the ProSim baseline across different text prompts. For fairness, we use ProSim’s text format for the ProSim model and LangTraj’s text format for our model to describe the same behaviors, ensuring each model receives in-distribution prompts.

Text Input:
A3: passes through an intersection without stopping, turning left.
A12: slows down and lets A3 pass at the intersection, turning right.

We observe that increasing the Classifier-Free Guidance weight in certain scenarios more reliably enforces text conditioning, improving instruction-following even in challenging cases with minor realism tradeoff.

We demonstrate that direct language conditioning is compatible with extending beyond the training distribution by adopting guidance functions to generate safety-critical scenarios from text inputs.

Flexible controllability: LangTraj provides a flexible diffusion models that can combine guidance and direct conditioning for versatile controllability.
Realism: Sampling multiple times from the diffusion model can occasionally produce off-road behaviors even with closed-loop training, showing a gap compared to tokenized models.
Evaluation: Designing metrics for instruction-following beyond minADE remains an open challenge.

BibTeX

@InProceedings{Chang_2025_ICCV,
          author = {Chang, Wei-Jer and Zhan, Wei and Tomizuka, Masayoshi and Chandraker, Manmohan and Pittaluga, Francesco},
          title = {LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation},
          booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
          month = {October},
          year = {2025},
          pages = {26622-26631}
          }
    }

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

Abstract

INTERDRIVE Dataset

Language-Conditioned Scene Diffusion Model

Comparison to ProSim

ProSim

LangTraj (Our Method)

Classifier-Free Guidance (CFG) Comparison

Text-Conditioned Safety-Critical Simulation

More Text-Conditioned Simulation Qualitative Results

Discussion and Future Directions

BibTeX