High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding

1Institute of Intelligent Computing, Alibaba Group 2SSE, CUHKSZ *Equal Contribution

Abstract

3D vision is inherently characterized by sparse spatial structures, which propels the necessity for an efficient paradigm tailored to 3D generation. Another discrepancy is the amount of training data, which undeniably affects generalization if we only use limited 3D data. To solve these, we design a 3D generation framework that maintains most of the building blocks of StableDiffusion with minimal adaptations for textured shape generation. We design a Sparse Encoding Module for details preservation and an Adversarial Decoding Module for better shape recovery. Moreover, we clean up data and build a benchmark on the biggest 3D dataset (Objaverse). We drop the concept of `specific class' and treat the 3D Textured Shapes Generation as an open-vocabulary problem. We first validate our network design on ShapeNetV2 with 55K samples on single-class unconditional generation and multi-class conditional generation tasks. Then we report metrics on processed G-Objaverse with 200K samples on the image conditional generation task. Extensive experiments demonstrate our proposal outperforms SOTA methods and takes a further step towards open-vocabulary 3D generation.

Method

MY ALT TEXT

Sparse3D only varies with StableDiffusion in several specific components. In the input stage, a dense point cloud (with 1M to 4M points) is voxelized by a resolution of 1000^3, and then fed to a Sparse Convolutional Network to extract coordinate-based features. To align the huge amount of sparse features to a dense representation, we project multiple features onto specific pixel grids and compute the mean values of these features. After we translate 3D dense point clouds into 2D feature maps, we finetune the pre-trained StableDiffusion for 2D feature map generation. In the meanwhile, we decode the feature maps as explicit mesh through a differentiable mesh extraction layer (Flexicubes) and then optimize the variational autoencoder using a rendering-based reconstruction penalty. Since our output contains RGB renderings in a similar domain with natural images, we further use an N-layer discriminator for adversarial finetuning to enhance the texture quality after reconstruction converges.

Dataset Overview

MY ALT TEXT

We manually split Objaverse into 10 general classes as the color bands depict. Note we do not use the "Building && Outdoor" and "Poor Quality" classes, since we empirically find that they are harmful to model convergence and we put further analysis in the appendix. By splitting Objaverse into general classes, we can build a benchmark on it, which we have achieved by shuffling class data and splitting it into certain proportions respectively for training, validation, and testing.

Unconditional Results

MY ALT TEXT

Exported textured shapes of our single-class unconditional generation models. All models are exported using an UV unwrapper xatlas and rendered using Mitsuba.

Conditional Results

MY ALT TEXT

Qualitative conditional generation results on ShapeNetV2. For each conditioned image, the result of our method stands in the first line and the result of TexturedLAS stands in the second line. The last four columns show the shapes that are queried from 3DILG and 3DS2Vec using OpenShape.

Open-Vocabulary Results

MY ALT TEXT

Qualitative results on various image-to-3D methods. Due to different settings, the image is treated as either a condition or the input.

Text-to-Image-to-3D Results

MY ALT TEXT

Visualization of a text-image-3D pipeline utilizing off-the-shell SDXL.