Deepmind Generalist AI Agent for 3D Virtual Environments

Deepmind Generalist AI Agent for 3D Virtual Environments

Updated: April 30 2024 16:37


Google DeepMind has made significant strides in AI research, particularly in the realm of video games. From mastering Atari games to achieving human-grandmaster level performance in StarCraft II, the company has consistently pushed the boundaries of what AI can accomplish. Now, they have introduced SIMA (Scalable Instructable Multiworld Agent), a generalist AI agent capable of understanding and following natural-language instructions across a variety of 3D virtual environments.

In SIMA, DeepMind collect a large and diverse dataset of gameplay from both curated research environments and commercial video games. This dataset is used to train agents to follow open-ended language instructions via pixel inputs and keyboard-and-mouse action outputs. Agents are then evaluated in terms of their behavior across a broad range of skills.


Partnering with Game Developers


Deepmind use over ten 3D environments in SIMA, consisting of commercial video games and research environments. The diversity of these environments is seen in their wide range of visual observations and environmental affordances. Yet, because these are all 3D environments, basic aspects of 3D embodied interaction, such as navigation, are shared. Commercial video games offer a higher degree of rich interactions and visual fidelity, while research environments serve as a useful testbed for probing agent capabilities.


DeepMind run instances of each game in a secure Google Cloud environment, using hardware accelerated rendering to a virtual display. This display is streamed to a browser for human gameplay, or to a remote agent client process during evaluation. Brief description of the games DeepMind used:

  • Goat Simulator 3: A third-person game where the player is a goat in a world with exaggerated physics. The player can complete quests, most of which involve wreaking havoc. The goat is able to lick, headbutt, climb, drive, equip a wide range of visual and functional items, and perform various other actions. Throughout the course of the game, the goat unlocks new abilities, such as the ability to fly.

  • Hydroneer: A first-person mining and base building sandbox where the player is tasked with digging for gold and other resources to turn a profit and enhance their mining operation. To do this, they must build and upgrade their set-ups and increase the complexity and levels of automation until they have a fully automated mining system. Players can also complete quests from non-player characters to craft bespoke objects and gain extra money. Hydroneer requires careful planning and managing of resources.

  • No Man’s Sky: A first- or third-person survival game where the player seeks to explore a galaxy full of procedurally-generated planets. This involves flying between planets to gather resources, trade, build bases, and craft items that are needed to upgrade their equipment and spaceship while surviving a hazardous environment. No Man’s Sky includes a large amount of visual diversity—which poses important challenges for agent perception—and rich interactions and skills.

  • Satisfactory: A first-person, open-world exploration and factory building game, in which players attempt to build a space elevator on an alien planet. This requires building increasingly complex production chains to extract natural resources and convert them into industrial goods, tools, and structures—whilst navigating increasingly hostile areas of a large open environment.

  • Teardown: A first-person, sandbox–puzzle game in a fully destructible voxel world where players are tasked with completing heists to gain money, acquiring better tools, and undertaking even more high-risk heists. Each heist is a unique scenario in one of a variety of locations where players must assess the situation, plan the execution of their mission, avoid triggering alarms, and escape before a timer expires. Teardown involves planning and using the environment to one’s advantage to complete the tasks with precision and speed.

  • Valheim: A third-person survival and sandbox game in a world inspired by Norse mythology. Players must explore various biomes, gather resources, hunt animals, build shelter, craft equipment, sail the oceans and defeat mythological monsters to advance in the game—while surviving challenges like hunger and cold.

  • Wobbly Life: A third-person, open-world sandbox game where the player can explore the world, unlock secrets, and complete various jobs to earn money and buy items, leading up to buying their own house. They must complete these jobs whilst contending with the rag-doll physics of their characters and competing against the clock. The jobs require timing, planning, and precision to be completed. The world is extensive and varied, with a diverse range of interactive objects.


Training Methodology


The SIMA agent maps visual observations and language instructions to keyboard-and-mouse actions (Figure 4). Given the complexity of this undertaking—such as the high dimensionality of the input and output spaces, and the breadth of possible instructions over long timescales—we predominantly focus on training the agent to perform instructions that can be completed in less than approximately 10 seconds. Breaking tasks into simpler sub-tasks enables their reuse across different settings and entirely different environments, given an appropriate sequence of instructions from the user.

SIMA Agent Architecture


The SIMA agent maps visual observations and language instructions to keyboard-and-mouse actions. Given the complexity of this undertaking—such as the high dimensionality of the input and output spaces, and the breadth of possible instructions over long timescales—we predominantly focus on training the agent to perform instructions that can be completed in less than approximately 10 seconds. Breaking tasks into simpler sub-tasks enables their reuse across different settings and entirely different environments, given an appropriate sequence of instructions from the user.


The SIMA agent is capable of performing a range of language-instructed tasks across diverse 3D virtual environments. The above image shows several representative, visually salient examples of the agent’s capabilities that demonstrate basic navigation and tool use skills.

Evaluating SIMA's Performance


The current version of SIMA was evaluated across 600 basic skills, spanning navigation, object interaction, and menu use. The tasks were designed to be completed within about 10 seconds. The results showed that SIMA agents trained on a set of nine 3D games significantly outperformed specialized agents trained on individual games. Impressively, an agent trained on all but one game performed nearly as well on the unseen game as an agent trained specifically on it, highlighting SIMA's ability to generalize beyond its training.


SIMA's performance heavily relies on language. In a control test where the agent was not given any language training or instructions, it behaved in an appropriate but aimless manner, such as gathering resources instead of following specific instructions. This demonstrates the crucial role of natural language in guiding the AI agent's actions.


Agents exhibit varying degrees of performance across the diverse skills that were evaluate, performing some skills reliably and others with more limited success. Skill categories are grouped into clusters (color), which are derived from the evaluation tasks.

Future Directions

Google DeepMind's research with SIMA is building towards more general AI systems and agents that can understand and safely carry out a wide range of tasks in a helpful manner, both online and in the real world. As SIMA is exposed to more training worlds and incorporates more advanced models, it is expected to become more generalizable and versatile. The ultimate goal is to improve SIMA's understanding and ability to act on higher-level language instructions to achieve more complex goals.

SIMA represents a significant milestone in AI research, demonstrating the potential for generalist, language-driven AI agents that can understand and operate across diverse virtual environments. By leveraging partnerships with game developers and innovative training methodologies, Google DeepMind has created an AI agent that can follow natural-language instructions to complete tasks in a variety of settings. As research continues, SIMA and other agent research could pave the way for more helpful AI assistants in both virtual and real-world environments.

Full Report: Scaling Instructable Agents Across Many Simulated Worlds


Check out my recent posts