LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

1University of California Berkeley, 2NEC Labs America, 3University of California San Diego
ICCV 2025

Abstract

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

Figure

INTERDRIVE Dataset

We introduce a large-scale dataset of human-annotated annotations focusing on interactive behavior to support behavior analysis and language-to-simulation research. The dataset includes 125,000 Waymo scenes annotated with 209 distinct agent-agent interaction subtypes, capturing fine-grained multi-agent dynamics in diverse driving scenarios. We also develop an automatic annotation pipeline for labeling single-agent behaviors.

Figure
Figure

Language-Conditioned Scene Diffusion Model

Figure

  • LangTraj is a scene-level diffusion backbone that directly conditions on natural language via a BERT-based encoder. We apply LoRA for efficient end-to-end training, summarizing sentence via <cls> token to reduce computational cost.
  • To mitagate compunding errors during autoregressive rollout, we develop a closed-loop training strategy for diffusion models to improve the realism. As shown below, first, the model generates multiple denoised trajectory candidates using forward diffusion from ground truth. The closest candidate to the ground truth is selected and then executed, enabling the model to experience its own distribution during training
  • Illustration of Closed-loop Training of diffusion models
    Figure

    Comparison to ProSim

    We compare our language-conditioned simulation results with the ProSim baseline across different text prompts. For fairness, we use ProSim’s text format for the ProSim model and LangTraj’s text format for our model to describe the same behaviors, ensuring each model receives in-distribution prompts.

    ProSim

    Text Input: Loading...

    LangTraj (Our Method)

    Text Input: Loading...

    Qualitatively, LangTraj produces more realistic and better instruction-following behaviors, especially in interactive scenarios.

    0:00 / 0:00

    Text-Conditioned Safety-Critical Simulation

    We demonstrate that direct language conditioning is compatible with extending beyond the training distribution by adopting guidance functions to generate safety-critical scenarios from text inputs.



    More Text-Conditioned Simulation Qualitative Results

    Merge/Yield Behavior

    Intersection negotiation

    Yield/Pass Behavior

    Pedestrian Crossing

    Discussion and Future Directions

    1. Flexible controllability: LangTraj provides a flexible diffusion models that can combine guidance and direct conditioning for versatile controllability.
    2. Realism: Sampling multiple times from the diffusion model can occasionally produce off-road behaviors even with closed-loop training, showing a gap compared to tokenized models.
    3. Evaluation: Designing metrics for instruction-following beyond minADE remains an open challenge.

    BibTeX

    @misc{chang2025langtrajdiffusionmodeldataset,
          title = {LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation},
          author = {Wei-Jer Chang and Wei Zhan and Masayoshi Tomizuka and Manmohan Chandraker and Francesco Pittaluga},
          year = {2025},
          eprint = {2504.11521},
          archivePrefix = {arXiv},
          primaryClass = {cs.LG},
          url = {https://arxiv.org/abs/2504.11521}
        }