UrbanGIRAFFE: A Controllable 3D Perception Generation Model for Realistic Urban Scene Synthesis

2023-11-20

Researchers from Zhejiang University have proposed a method for realistic image synthesis called UrbanGIRAFFE, which allows for control over camera poses and scene content. To address the challenges of generating urban scenes with free camera viewpoint control and scene editing, the model utilizes a compositional and controllable strategy, leveraging a coarse 3D panorama prior. It also includes the distribution of uncountable materials and countable object layouts. This approach decomposes the scene into objects, materials, and the sky, enabling diverse control such as large camera movements, material editing, and object manipulation. Previous methods have shown impressive results in conditional image synthesis, particularly those using Generative Adversarial Networks (GANs) to generate realistic images. However, existing methods limit image synthesis conditions to semantic segmentation maps or layouts, focusing primarily on object-centric scenes and neglecting complex, unaligned urban scenes. UrbanGIRAFFE is a 3D-aware generative model specifically designed for urban scenes, addressing these limitations and providing diverse control, including large camera movements, material editing, and object manipulation. GANs have proven effective in generating controllable and realistic images. However, existing methods are limited to object-centric scenes and require assistance in the context of urban scenes, hindering free camera viewpoint control and scene editing. UrbanGIRAFFE decomposes the scene into materials, objects, and the sky, utilizing them before employing semantic voxel grids and object layouts for diverse control. UrbanGIRAFFE innovatively decomposes urban scenes into uncountable materials, countable objects, and the sky, leveraging material and object priors to unravel complex urban environments. The model features a conditional material generator that utilizes semantic voxel grids as material priors for integrating coarse semantic and geometric information. An object layout prior aids in learning an object generator from cluttered scenes. The model undergoes end-to-end training using adversarial and reconstruction losses, optimizing sampling positions with ray-voxel and ray-box intersection strategies to reduce the required number of sampling points. In comprehensive evaluations, the proposed UrbanGIRAFFE method surpasses various 2D and 3D baselines on synthetic and real datasets, demonstrating outstanding controllability and fidelity. Qualitative evaluations on the KITTI-360 dataset show that UrbanGIRAFFE outperforms GIRAFFE in background modeling, achieving better material editing and camera viewpoint control. A ablation study on KITTI-360 confirms the effectiveness of UrbanGIRAFFE's architectural components, including reconstruction loss, object discriminator, and innovative object modeling. The use of a moving average model during the inference process further improves the quality of generated images. UrbanGIRAFFE innovatively addresses the complex task of controllable 3D-aware image synthesis, achieving remarkable diversity in camera viewpoint manipulation, semantic layout, and object interaction within urban scenes. By utilizing a 3D panorama prior, the model effectively decomposes the scene into materials, objects, and the sky, facilitating compositional generation modeling. This method highlights the progress of UrbanGIRAFFE in 3D-aware generative models for complex, unbounded collections. Future directions include integrating a semantic voxel generator for novel scene sampling and exploring light control through light-environment color separation to provide finer-grained control over scene generation. The importance of reconstruction loss is emphasized for maintaining fidelity and generating diverse results, especially for infrequently encountered semantic categories. Future work for UrbanGIRAFFE includes incorporating a semantic voxel generator for novel scene sampling to enhance the model's ability to generate diverse and novel urban scenes. Additionally, exploring light control through light-environment color separation is planned to provide more fine-grained control over the visual aspects of scene generation. One potential method for improving the quality of generated images is to use a moving average model during the inference process.