DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection

Sangyun Shin¹, Yuhang He², Xinyu Hou¹, Samuel Hodgson¹, Andrew Markham¹, Niki Trigoni¹

¹University of Oxford ²Microsoft Research

Paper Code (coming)

ICCV 2025 (Highlight) Domain Adaptive 3D Detection Diffusion Point Clouds

Abstract

We propose DiffRefine, a diffusion-based module that densifies sparse object points inside box proposals to improve second-stage refinement in 3D object detection under domain shift. Motivated by the observation that proposals often have good localization but low objectness, DiffRefine iteratively generates points on object surfaces to reinforce missing features that lead to false negatives. Our approach uses differentiable 3D generation on voxel grids and conditions on spatial context to prevent hallucinations. Experiments on KITTI, nuScenes, and Waymo demonstrate competitive improvements, especially for distant objects where sparsity is severe.

Figure 1 teaser: proposal-specific generation boosts detection performance

Figure 1 (paper): DiffRefine performs proposal-specific generation to boost detection performance.

Method Overview

Figure 2: Overall pipeline — Figure 2 (paper): Overall pipeline—box proposals → point extraction & size-agnostic voxelization → diffusion-based densification (with differentiable warping) → second-stage refinement.

Key Ideas

Proposal-specific generation: operate only inside candidate boxes (object-centric) rather than the whole cloud.
Diffusion for point densification: treat sparse surface points as noisy samples; denoise to recover dense structure.
Size-agnostic voxelization: normalized box view to reduce object size variance across domains.
Spatial-context conditioning: fuse neighborhood features to curb false positives during generation.

Differentiable Object-Point Generation

Figure 3: Concept of differentiable generation on voxel grids — Figure 3 (paper): Conceptual diagram—(a) sparse input points, (b) generation target, (c) offset prediction, (d) generation via differentiable warping across diffusion steps.

Figure 4: Generated object points across denoising steps — Figure 4 (paper): Generated object points across denoising steps (e.g., car and pedestrian classes).

Implementation sketch: given proposal voxels, the diffusion model predicts offsets to the nearest occupied voxels and warps points progressively—providing a differentiable path for learning to densify object surfaces.

Spatial Context & Refinement

Figure 6: Impact of spatial context feature — Figure 6 (paper): Spatial context feature reduces hallucinations by correlating generated points with neighborhood structure; improves refinement vs. no-context.

We fuse a spatial context feature from a BEV encoder with generated object points using a cross-attention style correlation. This steers densification toward plausible geometry and helps the second-stage refinement reject false positives.

Ablations show the largest performance drop when removing context.
Joint training of generation and refinement further boosts accuracy.

Results

Figure 5: Qualitative comparisons across baselines and ours — Figure 5 (paper): Qualitative comparisons (e.g., Waymo → nuScenes, nuScenes → KITTI). DiffRefine reduces false negatives for distant, sparse objects.

Figure 7: Runtime and denoising steps analysis — Figure 7 (paper): Runtime vs. number of denoising steps; performance improves up to ~6 steps.

Figure 8: Distance vs. number of surface points and AP — Figure 8 (paper): Object distance vs. surface points & AP—DiffRefine helps particularly for distant objects suffering from sparsity.

Table 1: Quantitative results across adaptation scenarios — Table 1 (paper): Quantitative results on domain adaptation (e.g., Waymo→KITTI, nuScenes→KITTI, Waymo→nuScenes) with SECOND-IoU and PointPillars.

BibTeX

@inproceedings{shin2025diffrefine,
  title     = {DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection},
  author    = {Shin, Sangyun and He, Yuhang and Hou, Xinyu and Hodgson, Samuel and Markham, Andrew and Trigoni, Niki},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}