Existing LSS methods are trained and evaluated on point clouds drawn from the same domain (left). We focus on studying LSS under domain shifts, where the test samples are drawn from a different data distribution (right). Our paper aims to address the generalization aspect of this task.
The ability to deploy robots that can operate safely in diverse environments is crucial for developing embodied intelligent agents. As a community, we have made tremendous progress in within-domain \textit{LiDAR semantic segmentation}. However, do these methods generalize \textit{across} domains?
To answer this question, we design the first experimental setup for studying domain generalization (DG) for LiDAR semantic segmentation (DG-LSS). Our results confirm a significant gap between methods, evaluated in a cross-domain setting: for example, a model trained on the source dataset (SemanticKITTI) obtains 26.53 mIoU on the target data, compared to 48.49 mIoU obtained by the model trained on the target domain (nuScenes).
To tackle this gap, we propose the first method specifically designed for DG-LSS, which obtains 34.88 mIoU on the target domain, outperforming all baselines. Our method augments a sparse-convolutional encoder-decoder 3D segmentation network with an additional, dense 2D convolutional decoder that learns to classify a birds-eye view of the point cloud. This simple auxiliary task encourages the 3D network to learn features that are robust to sensor placement shifts and resolution, and are transferable across domains.
With this work, we aim to inspire the community to develop and evaluate future models in such cross-domain conditions.
We encode our input LiDAR scan $P_j$ using the 3D backbone $g^{3D}$ to learn the occupied voxels' feature representations $F^{3D}$. ( Upper branch - main task ) We apply a sparse segmentation head on $F^{3D}$ and supervise with 3D semantic labels, $\mathcal{Y}_j^{3D}$. ( Lower branch - auxiliary task ) We project those features along the height-axis to obtain a dense 2D bird's-eye (BEV) view features $F^{BEV}$, and apply several 2D convolutional layers to learn the 2D BEV representation. We supervise the BEV auxiliary task by using BEV-view of semantic labels, $\mathcal{Y}_j^{BEV}$. We train jointly on both $L^{3D}$ and $L^{BEV}$.}
@inproceedings{saltori2023walking,
title={Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation},
author={Saltori, Cristiano and Osep, Aljosa and Ricci, Elisa and Leal-Taix{\'e}, Laura},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={196--206},
year={2023}
}
This project was partially funded by the Sofja Kovalevskaja Award of the Humboldt Foundation, the EU ISFP project PRECRISIS (ISFP-2022-TFI-AG-PROTECT-02-101100539), the PRIN project LEGO-AI (Prot. 2020TA3K9N) and the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU. It was carried out in the Vision and Learning joint laboratory of FBK-UNITN and used the CINECA, NVIDIA-AI TC clusters to run part of the experiments. We thank T. Meinhardt and I. Elezi for their feedback on this manuscript. The authors of this work take full responsibility for its content.