Introduction

ChatGPT is powerful but ungrounded. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? See our Twitter post for a blueprint of this future.

Massively Multitask Benchmarking Suite

MineDojo is a new framework built on the popular Minecraft game for embodied agent research. MineDojo features a simulation suite with 1000s of open-ended and language-prompted tasks, where the AI agents can freely explore a procedurally generated 3D world with diverse terrains to roam, materials to mine, tools to craft, structures to build, and wonders to discover.

Open-ended Exploration in Overworld, The Nether, and The End

Fight against an Ender dragon
Scoop a bucket of lava
Explore an ocean monument
Find a desert pyramid

Wide Variety of Terrains, Weathers, and Items

Equip different levels of armor
Build a Nether Portal and enter it
Visit an end city
Traverse different terrains

Diverse and Creative Tool Usage

Encircle llamas with fences
Play fireball with a ghast
Grow wheat
Block damage with a shield

Internet-scale Knowledge Base

Minecraft has more than 100M active players, who have collectively generated an enormous wealth of data. MineDojo features a massive database collected automatically from the internet. AI agents can learn from this treasure trove of knowledge to harvest actionable insights, acquire diverse skills, develop complex strategies, and discover interesting objectives to pursue. All our databases are open-access and available to download today! Click on each card below to find out more.

MineCLIP

We propose a conceptually simple method to learn a Minecraft-playing agent from in-the-wild YouTube videos. It is far from solving the game, but shows a baby step towards our vision of an “embodied GPT3” that takes the right actions given any language prompts.

Since our YouTube dataset has time-aligned narration, we are able to train a video-language contrastive model called MineCLIP. Intuitively, it learns to associate a video with the text that describes the video activity. MineCLIP computes a correlation score between [0, 1].

How do we use MineCLIP for training? Given a text prompt, the agent interacts with the Minecraft sim and generates a video, which can be fed to MineCLIP to compute correlation with the prompt. The higher the correlation, the more the agent’s behavior is on the right track.

Team

Email lead developers. * Equal contribution. † Equal advising.

Check out our paper!

@inproceedings{fan2022minedojo,
  title = {MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge},
  author = {Linxi Fan and Guanzhi Wang and Yunfan Jiang and Ajay Mandlekar and Yuncong Yang and Haoyi Zhu and Andrew Tang and De-An Huang and Yuke Zhu and Anima Anandkumar},
  booktitle = {Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year = {2022},
  url = {https://openreview.net/forum?id=rc8o_j8I8PX}
}

Building Open-Ended Embodied Agents with Internet-Scale Knowledge

💫✨NeurIPS 2022 Outstanding Paper Award✨💫

Introduction

Massively Multitask Benchmarking Suite

Internet-scale Knowledge Base

MineCLIP

Team

Jim (Linxi) Fan

Guanzhi Wang^*

Yunfan Jiang^*

Ajay Mandlekar

Yuncong Yang

Haoyi Zhu

Andrew Tang

De-An Huang

Yuke Zhu^†

Anima Anandkumar^†