RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Paper Project Code Live Demo G-Objaverse Home

architecture

Abstract

Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the com- plex entanglement of materials and lighting in natural im- ages. Existing methods have shown promise by first creat- ing the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appear- ance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and nor- mals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generaliz- able Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumi- nation effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when in- tegrated into existing text-to-3D pipelines, our models sig- nificantly enhance the detail richness, achieving state-of- the-art results.

Methodology

We introduce a generalizable Normal-Depth diffusion model that is trained on the LAION-2B dataset with normal and depth predicted by Midas, followed by fine-tuning on the synthetic dataset. Our model can be incorporated with the DMTet and NeRF representations to enhance the geometry generation. To alleviate the ambiguity in appearance modeling, we propose an albedo diffusion model to impose data-drive prior on the albedo component.

architecture

Video

For netizens in China, considering the problem of Internet restrictions, we provide a video link to bilibili.

Gallery Results of Ours (Sphere)

a DSLR photo of a cake covered in colorful frosting with a slice being taken out, high resolution
A crocheted doll wearing a crown, 4K, HD
A statue of angel, 3d asset
Fire-breathing Phoenix, mythical bird, engulfed in flames, rebirth and renewal, 3d asset.mp4
a DSLR photo of an origami motorcycle
mini China town, highly detailed, 3d asset.mp4
a DSLR photo of a knight chopping wood
A punk rock squirrel in a studded leather jacket shouting into a microphone while standing on a stump and holding a beer
a squirrel dressed up like a Victorian woman
a turtle standing on its hind legs, wearing a top hat
a DSLR photo of an astronaut standing on the surface of mars
a DSLR photo of edible typewriter made out of vegetables
flying Dragon, highly detailed, breathing fire, 3d asset
a tiger wearing sunglasses and a leather jacket, riding a motorcycle
Panda samurai, anthropomorphic panda in samurai armour, soldier, game asset
Ninja Assassin, stealthy operative, high-tech weaponry
the leaning tower of Pisa, aerial view
a DSLR photo of the Statue of Liberty, aerial view
Results from DreamFusion Prompts Additional Results (Coming Soon)

Gallery Results of Ours (NeRF)

a Christmas tree with donuts as decorations
a confused beagle sitting at a desk working on homework
a fox holding a videogame controller
a group of dogs playing poker, 3d asset
a group of squirrels rowing crew
a gummy bear driving a convertible
a human skeleton relaxing in a lounge chair
a humanoid robot sitting on a chair drinking a cup of coffee
a panda wearing a chefs hat and kneading bread dough on a countertop
a squirrel dressed like Henry VIII king of England
a tiger waiter at a fancy restaurant
a wide angle zoomed out DSLR photo of a skiing penguin wearing a puffy jacket
an origami hippo in a river
Majestic Peacock Throne, golden opulence, feathers adorned with jewels, royal symbolism, 3D asset
two raccoons playing poker
Army Jacket, 3D scan
Humoristic san goku body mixed with wild boar head running, 4K, HD
An intricate complex with steam-powered machinery, twisting pipes, and brick warehouses, shrouded in a foggy, industrial atmosphere, 8K, blender 3d
Results from DreamFusion Prompts Additional Results (Coming Soon)

Text to Normal-Depth

Sampling results of our Normal-Depth diffusion model trained on the Laion-2B dataset

text-to-nd

BibTex

If you find our approach helpful, you may consider citing our work.

@inproceedings{qiu2024richdreamer,
                  title={Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d},
                  author={Qiu, Lingteng and Chen, Guanying and Gu, Xiaodong and Zuo, Qi and Xu, Mutian and Wu, Yushuang and Yuan, Weihao and Dong, Zilong and Bo, Liefeng and Han, Xiaoguang},
                  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                  pages={9914--9925},
                  year={2024}
                }