Deep Mind Genie: The DALL·E 1 Stage of Playable 3D Environments Generation

Jun 19, 2024

Text within this block will maintain its original spacing when published

From a single frame,
Fantasy realms breathe and grow,
Interactive play.

Google DeepMind has presented a new generative model that showcases a glimpse of AI-based worldbuilding with interactive / playable affordances. In a nutshell, Genie can generate action-controllable platforms video games based on 2D images, videos or even sketches.

Genie has been trained on large datasets of Internet videos of platformer games and was able to learn detailed playable character controls without any action labels and additional text prompts. Remarkably, the trained model can be prompted with a single image, photo or even a sketch to create a new playable environment with coherent interactive elements (e.g. a movable character exploring a platform fantasy world).

Deep Mind emphasized that Genie is a general model that excels not only in a niche game genre generation but can work across multiple domains, such as training embodied generalist agents, as early tests with recreating consistent movement of real-life robotic arms were done.

Today, Genie might seem an interesting but not that impressive development, however, it is an important stepping stone on a path to the procedural generation of multi-purpose interactive environments. We see it, or rather its 3D-capable successors, as components of future design pipelines. Imagine a synthetic video sequence(s) generated at scale by OpenAI Sora used as a prompt to trigger playable worlds generation. Also, volumetric video scans of real-world locations could facilitate turning physical locations into playable environments. Remember the first GAN imagery generated 3-5 years ago? It offered far from perfect, low-resolution, glitchy, images, aesthetically situated in an uncanny valley with distorted body shapes and unearthly, sometimes creepy objects. Today, style transfer, upscaling or text-to-image diffusion-based tools like Midjourney, DALL·E, and Stable Diffusion are user-friendly everyday tools, and we anticipate similarly rapid progress in the domain of playable 3D worlds generation. Even in its current stage, Genie is able to learn (unsupervised) and implement both basic game mechanics as well as advanced aesthetic effects such as the parallax effect to add spatiality to a 2D platformer game. In other words, we are at the DALL·E 1 stage of image-to-game generation. The next major step for similar models would be to improve the framerate and develop the ability to implement physics-based interactions, and further down the road, create an explorable 3D environment based either on photo/video documentation of real-life locations or on synthetically generated images depicting non-existing locations and characters.

Recommended Reads

https://arxiv.org/html/2402.15391v1

https://www.technologyreview.com/2024/02/29/1089317/google-deepminds-new-generative-model-makes-super-mario-like-games-from-scratch/

Omni Lucenti

Deep Mind Genie: The DALL·E 1 Stage of Playable 3D Environments Generation