Vision-Language Models for Navigation and Manipulation (VLMNM)

Full-day hybrid workshop at ICRA 2024

Friday, May 17, 2024, Japan Standard Time (JST), Yokohama (Japan)


With the rising capabilities of LLMs and VLMs, the past two years have seen a surge in research work using VLMs for navigation and manipulation. Fusing the capabilities of visual interpretation with natural language processing, these models are poised to redefine how robotic systems interact with both their environment and human counterparts. The relevance of this topic cannot be overstated; as the frontier of human-robot interaction expands, so does the necessity for robots to comprehend and operate within complex environments using naturalistic instructions. Our workshop will not only reflect the state-of-the-art advancements in this domain, by featuring a diverse set of speakers, from senior academics to researchers in early careers, from industry researchers to companies producing mobile manipulation platforms, from researchers who are enthusiastic about using VLMs for robotics to those who have reservations about it. We aim for this event to be a catalyst for originality and diversity at ICRA 2024. We believe that, amidst a sea of workshops, ours will provide unique perspectives that will push the boundaries of what's achievable in robot navigation and manipulation.

In this workshop, we plan to discuss:

Call for Papers

We invite submissions including but not limited to the following topics: Submissions should have up to 4 pages of technical content, with no limit on references/appendices. Submissions are suggested to follow the ICRA double-column format with the template available here. We encourage authors to upload videos, code, or data as supplementary material (due on the same day as the paper). Following the main conference, our workshop will use a single-blind review process. We welcome both unpublished, original contributions and recently published relevant works. Accepted papers will be presented as posters or orals and made public via the workshop’s OpenReview page with the authors’ consent. We strongly encourage at least one of the authors to present on-site during the workshop. Our workshop will feature a Best Paper Award.

Important Dates: Submissions will be accepted through OpenReview. Submissions will not be public during the review process. Only accepted papers will be made public.


Are you going to record the talks and post them later on YouTube?
We’re going to post the talks of speakers who permit us to. But we will NOT post the recordings of the panel discussion, the debate, or the open discussion at the end.
Can I present remotely if my paper is accepted as a poster or a spotlight talk?
We will allow for it, but we will strongly encourage you to present in person or find friends to help present in person.

Tentative Schedule

Time (JST) Event Description Time (PDT)
(1 Day Earlier)
8:30 - 8:50 Coffee and Pasteries 15:30 - 15:50
8:50 - 9:00 Introduction by the Organizing Committee 15:50 - 16:00
9:00 - 9:20 LLM-State: Adaptive State Representation for Long-Horizon Task Planning in the Open World

Prof. David Hsu | National University of Singapore
16:00 - 16:20
9:20 - 9:40 Limitations of using LLM for Planning

Prof. Subbarao Kambhampati | Arizona State University
16:20 - 16:40
9:40 - 10:00 LLM + Formal Planner

Prof. Chuchu Fan | Massachusetts Institute of Technology
16:40 - 17:00
10:00 - 10:30 Panel: Bridging the Gap between Research & Industry

Prof. Charlie Kemp
| Hello Robot

Takafumi Watanabe
| Preferred Networks Robotics

Dr. Mohit Shridhar
| Dyson Robot Learning Lab
17:00 - 17:30
10:30 - 11:00 Coffee Break and Poster Session 30 Mins 17:30 - 18:00
11:00 - 11:20 LLM for Mobile Manipulation

Prof. Yuke Zhu | University of Texas at Austin
18:00 - 18:20
11:20 - 11:40 Mobile Manipulation, Multi-Agent Coordination, Long Horizon Tasks

Prof. Jeannette Bohg | Stanford University
18:20 - 18:40
11:40 - 12:00 Mobile Manipulation

Prof. Roberto Martín-Martín | University of Texas at Austin
18:40 - 19:00
12:00 - 14:00 Lunch Break 2 Hours 19:00 - 21:00
14:00 - 14:30 Spotlight Talks Six 5-min talks 21:00 - 21:30
14:30 - 15:30 Poster Session and Coffee Break 30 Mins 22:00 - 22:30
15:30 - 15:50 LLM and VLM for Manipulation

Dr. Andy Zeng | Google DeepMind
22:30 - 22:50
15:50 - 16:10 Foundation Models of and for Navigation

Dhruv Shah | University of California, Berkeley
22:50 - 23:10
16:10 - 16:30 (Tentative)
VLM for Mobile Manipulation / ProgPrompt 2

Prof. Dieter Fox | University of Washington / NVIDIA
23:10 - 23:30
16:30 - 16:50 BEHAVIOR Benchmark

Dr. Ruohan Zhang | Ruohan Zhang
23:30 - 23:50
16:50 - 17:20 Debate: Are LLMs, VLMs, and Large Foundation Models the Only Things We Need? 23:50 - 00:20
17:20 - 17:50 Moderated Open Discussion: What’s Down the Horizon? / The 100 Billion Proposal All in-person attendees 00:20 - 00:50
17:50 - 18:00 Awards Ceremony and Closing Remarks 00:50 - 01:00
↑ Time (JST) ↑ Event ↑ Description ↑ Time (PDT)
(1 Day Earlier)


Chris Paxton
FAIR, Meta

Fei Xia
Google Deepmind

Karmesh Yadav
Georgia Tech

Nur Muhammad Mahi Shafiullah
New York University

Naoki Wake
Microsoft Research

Weiyu Liu
Stanford University

Yujin Tang
Sakana AI

Zhutian Yang
MIT, NVIDIA Research


For further information or questions, please contact us at vlm-navigation-manipulation-workshop [AT] gmail [DOT] com

Access Map