Vision-Language Models for Navigation and Manipulation (VLMNM)

Full-day hybrid workshop at ICRA 2024, Room 315, Yokohama (Japan)

Friday, May 17, 9 am - 5 pm (JST)

Recordings of invited talks can be found in our YouTube Channel

With the rising capabilities of LLMs and VLMs, the past two years have seen a surge in research work using VLMs for navigation and manipulation. Fusing the capabilities of visual interpretation with natural language processing, these models are poised to redefine how robotic systems interact with both their environment and human counterparts. The relevance of this topic cannot be overstated; as the frontier of human-robot interaction expands, so does the necessity for robots to comprehend and operate within complex environments using naturalistic instructions. Our workshop will not only reflect the state-of-the-art advancements in this domain, by featuring a diverse set of speakers, from senior academics to researchers in early careers, from industry researchers to companies producing mobile manipulation platforms, from researchers who are enthusiastic about using VLMs for robotics to those who have reservations about it. We aim for this event to be a catalyst for originality and diversity at ICRA 2024. We believe that, amidst a sea of workshops, ours will provide unique perspectives that will push the boundaries of what's achievable in robot navigation and manipulation.

In this workshop, we plan to discuss: All accepted workshop papers: OpenReview. Please bring a poster. We will have double-sided A0 portrait poster boards.

Final Schedule

Time (JST) Event Description Time (PDT)
(May 16)
8:30 - 8:50 Coffee and Pasteries
Poster presenters set up posters
16:30 - 16:50
8:50 - 9:00 Introduction 16:50 - 17:00
9:10 - 9:35 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks [Recording]

Prof. Subbarao Kambhampati | Arizona State University
17:10 - 17:35
9:35 - 10:00 LLM-based Task and Motion Planning for Robots [Recording]

Prof. Chuchu Fan | Massachusetts Institute of Technology
17:35 - 18:00
10:00 - 10:20 On the Challenges and Opportunities of Policy Learning for Mobile Manipulation [Recording]

Prof. Jeannette Bohg | Stanford University
18:00 - 18:20
10:25 - 10:45 Coffee Break and Poster Session 20 Mins 18:25 - 18:45
11:00 - 11:20 LLM-State: Adaptive State Representation for Long-Horizon Task Planning in the Open World [Recording]

Prof. David Hsu | National University of Singapore
19:00 - 19:20
11:20 - 11:40 LLMs for System 1 Generalization [Recording]

Prof. Yuke Zhu | University of Texas at Austin
19:20 - 19:40
11:40 - 12:00 Panel: Bridging the Gap between Research & Industry
Moderator: Naoki Wake, Microsoft Research

Chris Paxton
| Hello Robot

Takafumi Watanabe
| Preferred Robotics Inc.

Dr. Mohit Shridhar
| Dyson Robot Learning Lab

Prof. Lerrel Pinto
| NYU Courant

19:40 - 20:00
12:00 - 12:20
Demo: a Chat with Kachaka, a Home Robot

Takafumi Watanabe, Kenichi Hidai
| Preferred Robotics Inc.
20:00 - 20:20
12:40 - 13:30 Lunch Break 50 min 20:40 - 21:30
13:30 - 13:50 Language as Bridge for Sim2Real [Recording]

Prof. Roberto Martín-Martín | University of Texas at Austin
21:30 - 21:50
13:50 - 14:10 Foundation Models of and for Navigation [Recording]

Dhruv Shah | University of California, Berkeley
21:50 - 22:10
14:15 - 14:35 BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [Recording]

Dr. Ruohan Zhang | Stanford University
22:15 - 22:35
14:35 - 15:05 Spotlight Talks (six)
  • RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
    Ruihai Wu | Peking University
  • MOSAIC: A Modular System for Assistive and Interactive Cooking
    Kushal Kedia | Cornell Univeristy
  • CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models
    Haoxu Huang | Shanghai Qizhi Institute
  • Deploying and Evaluating LLMs to Program Service Mobile Robots
    Zichao Hu | UT Austin
  • Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation<
    Daniel Honerkamp | University of Freiburg
  • Language Models as Zero-Shot Trajectory Generators
    Norman Di Palo | Imperial College London

22:35 - 23:05
15:10 - 16:00 Coffee Break and a Longer Poster Session 1 Hour 23:10 - 24:00
16:00 - 16:40 Debate: Is Large Foundation Models the most important research topic in the next 5 years? And various other questions.
Moderator: Nur Muhammad Mahi Shafiullah, New York University

Prof. Roberto Martín-Martín
| UT Austin

Dhruv Shah
| Berkeley

Ted Xiao
| Google DeepMind

Lin Shao

GPT-4-o *
| Open AI

*The organizers may or may not be serious about this special guest.

00:00 - 00:40
16:40 - 16:55 Moderated Open Discussion:
What’s Down the Horizon? / The 1 Billion Dollar Proposal
All in-person attendees 00:40 - 00:55
16:55 - 17:00 Best Paper Awards Ceremony and Closing Remarks 00:55 - 01:00
↑ Time (JST) ↑ Event ↑ Time (PDT)
(1 Day Earlier)



Are you going to record the talks and post them later on YouTube?
We’re going to post the talks of speakers who permit us to on YouTube. But we will NOT post the recordings of the panel discussion, the debate, or the open discussion at the end.
Can I present remotely if my paper is accepted as a poster or a spotlight talk?
We will play a pre-recording of your spotlight talk and we will strongly encourage you to find friends to help present the poster in person.

Call for Papers

We invite submissions including but not limited to the following topics: Submissions should have up to 4 pages of technical content, with no limit on references/appendices. Submissions are suggested to follow the ICRA double-column format with the template available here. We encourage authors to upload videos, code, or data as supplementary material (due on the same day as the paper). Following the main conference, our workshop will use a single-blind review process. We welcome both unpublished, original contributions and recently published relevant works. Accepted papers will be presented as posters or orals and made public via the workshop’s OpenReview page with the authors’ consent. We strongly encourage at least one of the authors to present on-site during the workshop. Our workshop will feature a Best Paper Award.
Important Dates:


Chris Paxton
FAIR, Meta

Fei Xia
Google Deepmind

Karmesh Yadav
Georgia Tech

Nur Muhammad Mahi Shafiullah
New York University

Naoki Wake
Microsoft Research

Weiyu Liu
Stanford University

Yujin Tang
Sakana AI

Zhutian Yang
MIT, NVIDIA Research


For further information or questions, please contact vlm-navigation-manipulation-workshop [AT] googlegroups [DOT] com

Access Map