We provide a pretrained model for estimating normals of human body images. This model is capable of estimating high-quality normals given an input RGB(a) image.
Our approach constructs an Encoder-Decoder network architecture to learn pixel-aligned features, thereby estimating normal maps. To enhance the network's capacity for details, we introduce ray direction features at the per-pixel level to augment details perception.
The figure below illustrates the network architecture, where skip connections and ray direction embeddings enable each pixel to enhance the perception of normal details. The linear layers after the decoder is to learn better upsampling functions, ensuring the retention of as much detail as possible.
Network Architecture
We have collected a set of publicly accessible 3D human models, including such as Thuman, 2K2K, and CustomHuman. Additionally, we have purchased high-quality human models from corresponding websites. These models are processed through a high performance ray-tracing renderer TIDE to produce multi-view RGB/Normal data pairs. The normal data is defined in the camera coordinate system, where the X, Y, and Z axes denote the left, down, and front directions, respectively. The following video shows some rendered examples.
In addition to deal with the domain gap, we have captured about 20,000 high-quality normal results from real-world images. The dataset comprises roughly 55,000 pairs of images and corresponding normal maps in total, of which the rendered images are provided for public access. If you'd like to access the data, please contact the authors.
The model has been deployed on the Modelscope, and the inference code can be referred on the ModelScope GitHub.