RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Abstract

Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the com- plex entanglement of materials and lighting in natural im- ages. Existing methods have shown promise by first creat- ing the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appear- ance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and nor- mals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generaliz- able Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumi- nation effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when in- tegrated into existing text-to-3D pipelines, our models sig- nificantly enhance the detail richness, achieving state-of- the-art results.

Methodology

We introduce a generalizable Normal-Depth diffusion model that is trained on the LAION-2B dataset with normal and depth predicted by Midas, followed by fine-tuning on the synthetic dataset. Our model can be incorporated with the DMTet and NeRF representations to enhance the geometry generation. To alleviate the ambiguity in appearance modeling, we propose an albedo diffusion model to impose data-drive prior on the albedo component.

Video

For netizens in China, considering the problem of Internet restrictions, we provide a video link to bilibili.

Gallery Results of Ours (Sphere)

a DSLR photo of a cake covered in colorful frosting with a slice being taken out, high resolution

A crocheted doll wearing a crown, 4K, HD

A statue of angel, 3d asset

Fire-breathing Phoenix, mythical bird, engulfed in flames, rebirth and renewal, 3d asset.mp4

a DSLR photo of an origami motorcycle

mini China town, highly detailed, 3d asset.mp4

a DSLR photo of a knight chopping wood

A punk rock squirrel in a studded leather jacket shouting into a microphone while standing on a stump and holding a beer

a squirrel dressed up like a Victorian woman

a turtle standing on its hind legs, wearing a top hat

a DSLR photo of an astronaut standing on the surface of mars

a DSLR photo of edible typewriter made out of vegetables

flying Dragon, highly detailed, breathing fire, 3d asset

a tiger wearing sunglasses and a leather jacket, riding a motorcycle

Panda samurai, anthropomorphic panda in samurai armour, soldier, game asset

Ninja Assassin, stealthy operative, high-tech weaponry

the leaning tower of Pisa, aerial view

a DSLR photo of the Statue of Liberty, aerial view

Gallery Results of Ours (NeRF)

a Christmas tree with donuts as decorations

a confused beagle sitting at a desk working on homework

a fox holding a videogame controller

a group of dogs playing poker, 3d asset

a group of squirrels rowing crew

a gummy bear driving a convertible

a human skeleton relaxing in a lounge chair

a humanoid robot sitting on a chair drinking a cup of coffee

a panda wearing a chefs hat and kneading bread dough on a countertop

a squirrel dressed like Henry VIII king of England

a tiger waiter at a fancy restaurant

a wide angle zoomed out DSLR photo of a skiing penguin wearing a puffy jacket

an origami hippo in a river

Majestic Peacock Throne, golden opulence, feathers adorned with jewels, royal symbolism, 3D asset

two raccoons playing poker

Army Jacket, 3D scan

Humoristic san goku body mixed with wild boar head running, 4K, HD

An intricate complex with steam-powered machinery, twisting pipes, and brick warehouses, shrouded in a foggy, industrial atmosphere, 8K, blender 3d

Text to Normal-Depth

Sampling results of our Normal-Depth diffusion model trained on the Laion-2B dataset

BibTex

If you find our approach helpful, you may consider citing our work.

@inproceedings{qiu2024richdreamer,
                  title={Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d},
                  author={Qiu, Lingteng and Chen, Guanying and Gu, Xiaodong and Zuo, Qi and Xu, Mutian and Wu, Yushuang and Yuan, Weihao and Dong, Zilong and Bo, Liefeng and Han, Xiaoguang},
                  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                  pages={9914--9925},
                  year={2024}
                }