Controllable 3D scene generation has extensive applications in virtual reality and interior design, where
the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene
graphs provide a suitable data representation that facilitates these applications.
However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit
insufficient adaptability to flexible user inputs, hindering the ability to precisely control object
geometry.
To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation
that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor.
The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional
relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous
control over the geometry of objects in the generated scenes.
The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual
representations using text embeddings.
Furthermore, our relation predictor leverages node representations to infer absent relationships between
nodes, resulting in more coherent scene layouts.
Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry,
achieving state-of-the-art scene generation performance.
Overview of MMGDreamer. Our pipeline consists of the Latent Mixed-Modality Graph, the Graph Enhancement Module, and the Dual-Branch Diffusion Model. During inference, MMGDreamer initiates with the Latent Mixed-Modality Graph, which undergoes enhancement via the Visual Enhancement Module and the Relation Predictor, resulting in the formation of a Visual-Enhanced Graph and a Mixed-Enhanced Graph. The Mixed-Enhanced Graph is then input into the Graph Encoder E_g within the Dual-Branch Diffusion Model for relationship modeling, using a triplet-GCN structured module integrated with an echo mechanism. Subsequently, the Layout Branch (C.2) and the Shape Branch (C.3) use denoisers conditioned on the nodes' latent representations to generate layouts and shapes, respectively. The final output is a synthesized 3D indoor scene where the generated shapes are seamlessly integrated into the generated layouts.
@article{yang2025mmgdreamer,
title={MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation},
author={Yang, Zhifei and Lu, Keyang and Zhang, Chao and Qi, Jiaxing and Jiang, Hanqi and Ma, Ruifei and Yin, Shenglin and Xu, Yifan and Xing, Mingzhe and Xiao, Zhen and others},
journal={arXiv preprint arXiv:2502.05874},
year={2025}
}