Introduction
ChatGPT is powerful but ungrounded. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? See our Twitter post for a blueprint of this future.
Massively Multitask Benchmarking Suite
MineDojo is a new framework built on the popular Minecraft game for embodied agent research. MineDojo features a simulation suite with 1000s of open-ended and language-prompted tasks, where the AI agents can freely explore a procedurally generated 3D world with diverse terrains to roam, materials to mine, tools to craft, structures to build, and wonders to discover.
Open-ended Exploration in Overworld, The Nether, and The End
-
Fight against an Ender dragon
-
Scoop a bucket of lava
-
Explore an ocean monument
-
Find a desert pyramid
Wide Variety of Terrains, Weathers, and Items
-
Equip different levels of armor
-
Build a Nether Portal and enter it
-
Visit an end city
-
Traverse different terrains
Diverse and Creative Tool Usage
Internet-scale Knowledge Base
Minecraft has more than 100M active players, who have collectively generated an enormous wealth of data. MineDojo features a massive database collected automatically from the internet. AI agents can learn from this treasure trove of knowledge to harvest actionable insights, acquire diverse skills, develop complex strategies, and discover interesting objectives to pursue. All our databases are open-access and available to download today! Click on each card below to find out more.
MineCLIP
We propose a conceptually simple method to learn a Minecraft-playing agent from in-the-wild YouTube videos. It is far from solving the game, but shows a baby step towards our vision of an “embodied GPT3” that takes the right actions given any language prompts.
Since our YouTube dataset has time-aligned narration, we are able to train a video-language contrastive model called MineCLIP. Intuitively, it learns to associate a video with the text that describes the video activity. MineCLIP computes a correlation score between [0, 1].
How do we use MineCLIP for training? Given a text prompt, the agent interacts with the Minecraft sim and generates a video, which can be fed to MineCLIP to compute correlation with the prompt. The higher the correlation, the more the agent’s behavior is on the right track.
Team
-
Jim (Linxi) Fan
-
Guanzhi Wang*
-
Yunfan Jiang*
-
Ajay Mandlekar
-
Yuncong Yang
-
Haoyi Zhu
-
Andrew Tang
-
De-An Huang
-
Yuke Zhu†
-
Anima Anandkumar†
Email lead developers. * Equal contribution. † Equal advising.
Check out our paper!
@inproceedings{fan2022minedojo,
title = {MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge},
author = {Linxi Fan and Guanzhi Wang and Yunfan Jiang and Ajay Mandlekar and Yuncong Yang and Haoyi Zhu and Andrew Tang and De-An Huang and Yuke Zhu and Anima Anandkumar},
booktitle = {Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year = {2022},
url = {https://openreview.net/forum?id=rc8o_j8I8PX}
}
MineDojo team ©2022