ByteDance Launches DiffPortrait3D: Pioneering Single-Image Multi-Angle 3D Portrait Generation Technology

2023-12-29

Recently, large-scale language models (LLMs) have gained popularity in the artificial intelligence (AI) community, thanks to their impressive capabilities and performance. These models have been applied in various subfields of AI, including natural language processing, natural language generation, and computer vision. While computer vision, especially diffusion models, has received significant attention, generating high-fidelity and coherent new perspectives from limited input remains a challenge.

To address this challenge, a recent study by a research team at ByteDance proposed DiffPortrait3D, a unique conditional diffusion model designed to create realistic 3D multi-view views from single natural environment portraits. DiffPortrait3D can reconstruct a single two-dimensional (2D) unconstrained portrait into a three-dimensional (3D) facial representation.

The model preserves the identity and expression of the subject while generating realistic facial details under new camera angles. The main innovation of this approach lies in its zero-shot capability, allowing it to generalize to a wide range of facial portraits, including those with unposed camera angles, extreme facial expressions, and various artistic styles, without the need for time-consuming optimization or fine-tuning procedures.

The foundation of DiffPortrait3D is a generative prior obtained from 2D diffusion models that have been pre-trained on large image datasets and serve as the rendering framework for the model. A separate attention control mechanism helps with denoising by controlling appearance and camera poses. The appearance context of reference images is injected into the self-attention layers of frozen UNets, which are crucial components of the propagation mechanism.

DiffPortrait3D uses a special conditional control module to alter the rendering view. This module analyzes conditional images of objects captured from the same angle to interpret camera poses, enabling the model to synthesize consistent facial features from different perspectives.

To further enhance visual consistency, a trainable cross-view attention module is proposed. This module is particularly helpful in scenarios where intense facial expressions or unposed camera angles may pose challenges.

Additionally, a unique 3D-aware noise generation mechanism is included to ensure robustness during the inference process. This stage enhances overall stability and realism of the synthesized images. The team has evaluated and assessed the performance of DiffPortrait3D in challenging multi-view and natural environment benchmarks, demonstrating advanced levels of quality and quantity. This approach showcases its effectiveness in addressing the challenges of single-image 3D portrait synthesis, producing realistic and high-quality facial reconstructions in various artistic styles and settings.

The team shared their main contributions:

  • Introduced a unique zero-shot approach that creates 3D consistent new perspectives from a single portrait by extending 2D stable diffusion.
  • Demonstrated impressive achievements in synthesizing portraits with unique perspectives, supporting portrait synthesis with various appearances, expressions, attitudes, and styles without the need for tedious fine-tuning.
  • Utilized a clearly separated control system to independently handle appearance and camera viewpoints, allowing efficient camera manipulation without affecting the subject's expression or identity.
  • Combined a cross-view attention module and a 3D-aware noise creation technique to provide long-range consistency in 3D views.