Milan Ganai

Contacts:

Email: mganai at stanford dot edu

Milan Ganai

Milan Ganai is a PhD student in the Department of Computer Science advised by Professors Marco Pavone and Clark Barrett. His research interests lie at the intersection of safe AI and robotics, concentrating on developing generalizable physical reasoning capabilities for autonomous systems to reliably adapt to novel environments. Prior to Stanford, he received his BS in Computer Science, summa cum laude with highest distinction, and MS in Computer Science at UC San Diego, where he was a Jacobs School Scholar and Regents Scholar. He performed research in the intersection of control and reinforcement learning under Professors Sicun Gao and Sylvia Herbert and has interned at Amazon Web Services.

ASL Publications

M. N. Azadani, Y. Wang, Y. Zhu, L. Chen, M. Ganai, S. Sedwards, M. Pavone, and K. Czarnecki, “VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence,” 2026. (Submitted)
[BibTeX] [Abstract]

Abstract: Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.
```
@article{AzadaniWangEtAl2026,
  author = {Azadani, M. N. and Wang, Y. and Zhu, Y. and Chen, L. and Ganai, M. and Sedwards, S. and Pavone, M. and Czarnecki, K.},
  title = {VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence},
  year = {2026},
  keywords = {sub},
  note = {Submitted},
  owner = {mganai},
  timestamp = {2026-06-15},
  url = {https://arxiv.org/abs/2605.20676}
}
```
J. Dao, M. Ganai, Y. Abukhadra, A. Sridhar, M. N. Azadani, K. Luo, C. Barrett, J. Wu, C. Finn, and M. Pavone, “DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?,” 2026. (Submitted)
[BibTeX] [Abstract]

Abstract: Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success–cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model’s success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.
```
@article{DaoGanaiEtAl2026,
  author = {Dao, J. and Ganai, M. and Abukhadra, Y. and Sridhar, A. and Azadani, M. N. and Luo, K. and Barrett, C. and Wu, J. and Finn, C. and Pavone, M.},
  title = {DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?},
  year = {2026},
  keywords = {sub},
  note = {Submitted},
  owner = {mganai},
  timestamp = {2026-06-16},
  url = {https://arxiv.org/abs/2606.12402}
}
```
M. Ganai, K. Luo, J. Frey, C. Barrett, and M. Pavone, “Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning,” in Robotics: Science and Systems, Sydney, Australia, 2026.
[BibTeX] [Abstract]

Abstract: Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.
```
@inproceedings{GanaiLuoEtAl2026,
  author = {Ganai, M. and Luo, K. and Frey, J. and Barrett, C. and Pavone, M.},
  title = {Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning},
  booktitle = {{Robotics: Science and Systems}},
  year = {2026},
  address = {Sydney, Australia},
  owner = {mganai},
  url = {https://arxiv.org/abs/2602.08167},
  timestamp = {2026-04-26}
}
```
K. A. Christensen, A. G. Tufte, A. Gusev, R. Sinha, M. Ganai, O. A. Alsos, M. Pavone, and M. Steinert, “Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models,” in Ocean Engineering, 2026.
[BibTeX] [Abstract]

Abstract: The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable safe-keeping policy. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., a diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision–language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast–slow anomaly pipeline with a short-horizon, human-overridable fallback makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained vision–language model bridge that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert→bridge→operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The bridge policy outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as a semantic fallback “bridge policy” compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird’s-eye-view perception and short-horizon replanning.
```
@inproceedings{ChristensenTufteEtAl2026,
  author = {Christensen, K. A. and Tufte, A. G. and Gusev, A. and Sinha, R. and Ganai, M. and Alsos, O. A. and Pavone, M. and Steinert, M.},
  title = {Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models},
  journal = {Ocean Engineering},
  year = {2026},
  timestamp = {2026-02-12},
  url = {https://arxiv.org/abs/2512.24470},
  owner = {mganai}
}
```
M. Ganai, R. Sinha, C. Agia, D. Morton, L. Di Lillo, and M. Pavone, “Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning,” in Conf. on Robot Learning, Seoul, Korea, 2025.
[BibTeX] [Abstract]

Abstract: While foundation models offer promise toward improving robot safety in out-of-distribution (OOD) scenarios, how to effectively harness their generalist knowledge for real-time, dynamically feasible response remains a crucial problem. We present FORTRESS, a joint reasoning and planning framework that generates semantically safe fallback strategies to prevent safety-critical, OOD failures. At a low frequency under nominal operation, FORTRESS uses multi-modal foundation models to anticipate possible failure modes and identify safe fallback sets. When a runtime monitor triggers a fallback response, FORTRESS rapidly synthesizes plans to fallback goals while inferring and avoiding semantically unsafe regions in real time. By bridging open-world, multi-modal reasoning with dynamics-aware planning, we eliminate the need for hard-coded fallbacks and human safety interventions. FORTRESS outperforms on-the-fly prompting of slow reasoning models in safety classification accuracy on synthetic benchmarks and real-world ANYmal robot data, and further improves system safety and planning success in simulation and on quadrotor hardware for urban navigation.
```
@inproceedings{GanaiSinhaEtAl2025,
  author = {Ganai, M. and Sinha, R. and Agia, C. and Morton, D. and Di Lillo, L. and Pavone, M.},
  title = {Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning},
  booktitle = {{Conf. on Robot Learning}},
  year = {2025},
  month = jul,
  address = {Seoul, Korea},
  owner = {mganai},
  url = {https://arxiv.org/abs/2505.10547},
  timestamp = {2025-06-08},
  note = {oral}
}
```