Katie Luo

Contacts:

Email: katieluo at stanford dot edu

Katie Luo

Katie Luo is a Postdoctoral Researcher at the Stanford University ASL group, working with Prof. Marco Pavone. She obtained her Ph.D. at Cornell University, advised by Prof. Kilian Q. Weinberger and Prof. Bharath Hariharan. Her research interest lies in visual understanding of the world, including 3D perception and multi-modal learning, combining visual data with other sensory inputs to enhance environmental understanding.

Awards

Nvidia Graduate Student Fellowship, 2023
American Association of University Women (AAUW) Dissertation Fellowship, 2024

ASL Publications

L. Wild, K. Z. Luo, and M. Pavone, “Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding,” arXiv preprint arXiv:2605.20942, 2026.
[BibTeX] [Abstract]

Abstract: Structured road understanding of lane geometry, topology, and traffic element relationships is boundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.
```
@article{WildLuoEtAl2026,
  author = {Wild, L. and Luo, K. Z. and Pavone, M.},
  title = {Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding},
  journal = {arXiv preprint arXiv:2605.20942},
  year = {2026},
  url = {https://arxiv.org/abs/2605.20942},
  owner = {katieluo},
  timestamp = {2026-06-09}
}
```
J. Dao, M. Ganai, Y. Abukhadra, A. Sridhar, M. N. Azadani, K. Luo, C. Barrett, J. Wu, C. Finn, and M. Pavone, “DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?,” 2026. (Submitted)
[BibTeX] [Abstract]

Abstract: Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success–cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model’s success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.
```
@article{DaoGanaiEtAl2026,
  author = {Dao, J. and Ganai, M. and Abukhadra, Y. and Sridhar, A. and Azadani, M. N. and Luo, K. and Barrett, C. and Wu, J. and Finn, C. and Pavone, M.},
  title = {DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?},
  year = {2026},
  keywords = {sub},
  note = {Submitted},
  owner = {mganai},
  timestamp = {2026-06-16},
  url = {https://arxiv.org/abs/2606.12402}
}
```
M. Ganai, K. Luo, J. Frey, C. Barrett, and M. Pavone, “Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning,” in Robotics: Science and Systems, Sydney, Australia, 2026.
[BibTeX] [Abstract]

Abstract: Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.
```
@inproceedings{GanaiLuoEtAl2026,
  author = {Ganai, M. and Luo, K. and Frey, J. and Barrett, C. and Pavone, M.},
  title = {Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning},
  booktitle = {{Robotics: Science and Systems}},
  year = {2026},
  address = {Sydney, Australia},
  owner = {mganai},
  url = {https://arxiv.org/abs/2602.08167},
  timestamp = {2026-04-26}
}
```