LiDAR Diffusion

Semantic-Map-to-LiDAR

LiDAR Diffusion allows image-based conditioning with concatenation operation (e.g., semantic maps).

Camera-to-LiDAR

LiDAR Diffusion accepts token-based conditioning through cross-attention mechanism.

Zero-Shot Text-to-LiDAR

LiDAR Diffusion can be controlled by text with a pretrained CLIP-based Camera-to-LiDAR LiDM.

LiDAR Synthesis Quality with Different Scaling Factors

Reconstruction (val) with Different Scaling Factors

Perceptual Metrics: Fréchet Range Image Distance (FRID), Fréchet Sparse Volume Distance (FSVD). Through a comprehensive study, we demonstrate both effectiveness and efficiency of Hybrid (Curvewise+Patchwise) Encoding for LiDAR Compression.

Ground Truth

Hybrid Encoding (Compression Rate: 16)
For Synthesis (LiDM), FRID(↓): 145 | FSVD(↓): 77

Hybrid Encoding (Compression Rate: 32)
For Synthesis (LiDM), FRID(↓): 162 | FSVD(↓): 56

Curvewise Encoding (Compression Rate: 8)
For Synthesis (LiDM), FRID(↓): 162 | FSVD(↓): 85

Curvewise Encoding (Compression Rate: 16)
For Synthesis (LiDM), FRID(↓): 142 | FSVD(↓): 116

Patchwise Encoding (Compression Rate: 16)
For Synthesis (LiDM), FRID(↓): 180 | FSVD(↓): 60

Patchwise Encoding (Compression Rate: 64)
For Synthesis (LiDM), FRID(↓): 192 | FSVD(↓): 88

Abstract

Diffusion models (DMs) excel in photo-realistic image synthesis, but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of their representation power. In this paper, we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism, geometry realism, and object realism. Specifically, we introduce curve-wise compression to simulate real-world LiDAR patterns, point-wise coordinate supervision to learn scene geometry, and patch-wise encoding for a full 3D object context. With these three core designs, our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation, while maintaining high efficiency compared to point-based DMs (up to 107x faster). Furthermore, by compressing LiDAR scenes into a latent space, we enable the controllability of DMs with various conditions such as semantic maps, camera views, and text prompts.

Citation

@inproceedings{ran2024towards,
    title={Towards Realistic Scene Generation with LiDAR Diffusion Models},
    author={Ran, Haoxi and Guizilini, Vitor and Wang, Yue},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2024}
}

LiDAR Diffusion

Towards Realistic Scene Generation with LiDAR Diffusion Models

LiDAR Diffusion allows controllability of semantic maps, cameras, bounding boxes, text, etc.

Video

Comparisons on Unconditional Generation