LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

1University of California Berkeley, 2NEC Labs America, 3University of California San Diego
ICCV 2025

Abstract

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

Figure

INTERDRIVE Dataset

We introduce a large-scale dataset of human-annotated annotations focusing on interactive behavior to support behavior analysis and language-to-simulation research. The dataset includes 125,000 Waymo scenes annotated with 209 distinct agent-agent interaction subtypes, capturing fine-grained multi-agent dynamics in diverse driving scenarios. We also develop an automatic annotation pipeline for labeling single-agent behaviors.

Figure
Figure

Language-Conditioned Scene Diffusion Model

Figure

  • LangTraj is a scene-level diffusion backbone that directly conditions on natural language via a BERT-based encoder. We apply LoRA for efficient end-to-end training, summarizing sentence via <cls> token to reduce computational cost.
  • To mitagate compunding errors during autoregressive rollout, we develop a closed-loop training strategy for diffusion models to improve the realism. As shown below, first, the model generates multiple denoised trajectory candidates using forward diffusion from ground truth. The closest candidate to the ground truth is selected and then executed, enabling the model to experience its own distribution during training
  • Illustration of Closed-loop Training of diffusion models
    Figure

    Comparison to ProSim

    We compare our language-conditioned simulation results with the ProSim baseline across different text prompts. For fairness, we use ProSim’s text format for the ProSim model and LangTraj’s text format for our model to describe the same behaviors, ensuring each model receives in-distribution prompts.

    ProSim

    Text Input: Loading...

    LangTraj (Our Method)

    Text Input: Loading...

    Qualitatively, LangTraj produces more realistic and better instruction-following behaviors for language input, especially in interactive scenarios. Note that Prosim had other conditioning modalities (goal points, sketch), here we specifically focus on language input.

    0:00 / 0:00

    Classifier-Free Guidance (CFG) Comparison

    Text Input:
    A3: passes through an intersection without stopping, turning left.
    A12: slows down and lets A3 pass at the intersection, turning right.

    CFG Weight: 0.0
    CFG Weight: 0.5
    CFG Weight: 1.0

    We observe that increasing the Classifier-Free Guidance weight in certain scenarios more reliably enforces text conditioning, improving instruction-following even in challenging cases with minor realism tradeoff.

    Text-Conditioned Safety-Critical Simulation

    We demonstrate that direct language conditioning is compatible with extending beyond the training distribution by adopting guidance functions to generate safety-critical scenarios from text inputs.


    More Text-Conditioned Simulation Qualitative Results

    Merge/Yield Behavior

    Intersection negotiation

    Yield/Pass Behavior

    Pedestrian Crossing

    Discussion and Future Directions

    1. Flexible controllability: LangTraj provides a flexible diffusion models that can combine guidance and direct conditioning for versatile controllability.
    2. Realism: Sampling multiple times from the diffusion model can occasionally produce off-road behaviors even with closed-loop training, showing a gap compared to tokenized models.
    3. Evaluation: Designing metrics for instruction-following beyond minADE remains an open challenge.

    BibTeX

    @InProceedings{Chang_2025_ICCV,
              author = {Chang, Wei-Jer and Zhan, Wei and Tomizuka, Masayoshi and Chandraker, Manmohan and Pittaluga, Francesco},
              title = {LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation},
              booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
              month = {October},
              year = {2025},
              pages = {26622-26631}
              }
        }