TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libr...

Full description

Bibliographic Details
Main Authors:	Tang, Haotian, Yang, Shang, Liu, Zhijian, Hong, Ke, Yu, Zhongming, Li, Xiuyu, Dai, Guohao, Wang, Yu, Han, Song
Format:	Article
Language:	English
Published:	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture 2024
Online Access:	https://hdl.handle.net/1721.1/153260

_version_	1811086313184559104
author	Tang, Haotian Yang, Shang Liu, Zhijian Hong, Ke Yu, Zhongming Li, Xiuyu Dai, Guohao Wang, Yu Han, Song
author_facet	Tang, Haotian Yang, Shang Liu, Zhijian Hong, Ke Yu, Zhongming Li, Xiuyu Dai, Guohao Wang, Yu Han, Song
author_sort	Tang, Haotian
collection	MIT
description	Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9 × , 3.3 × , 2.2 × and 1.7 × measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3 × faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6 × faster inference speed compared with state-of-the-art graph deep learning libraries. Our code is publicly released at https://github.com/mit-han-lab/torchsparse.
first_indexed	2024-09-23T13:24:11Z
format	Article
id	mit-1721.1/153260
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T13:24:11Z
publishDate	2024
publisher	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture
record_format	dspace
spelling	mit-1721.1/1532602024-01-03T03:29:23Z TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs Tang, Haotian Yang, Shang Liu, Zhijian Hong, Ke Yu, Zhongming Li, Xiuyu Dai, Guohao Wang, Yu Han, Song Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9 × , 3.3 × , 2.2 × and 1.7 × measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3 × faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6 × faster inference speed compared with state-of-the-art graph deep learning libraries. Our code is publicly released at https://github.com/mit-han-lab/torchsparse. 2024-01-02T19:51:01Z 2024-01-02T19:51:01Z 2023-10-28 2024-01-01T08:47:54Z Article http://purl.org/eprint/type/ConferencePaper 979-8-4007-0329-4 https://hdl.handle.net/1721.1/153260 Tang, Haotian, Yang, Shang, Liu, Zhijian, Hong, Ke, Yu, Zhongming et al. 2023. "TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs." PUBLISHER_CC PUBLISHER_CC en https://doi.org/10.1145/3613424.3614303 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture
spellingShingle	Tang, Haotian Yang, Shang Liu, Zhijian Hong, Ke Yu, Zhongming Li, Xiuyu Dai, Guohao Wang, Yu Han, Song TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title_full	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title_fullStr	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title_full_unstemmed	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title_short	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
title_sort	torchsparse efficient training and inference framework for sparse convolution on gpus
url	https://hdl.handle.net/1721.1/153260
work_keys_str_mv	AT tanghaotian torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT yangshang torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT liuzhijian torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT hongke torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT yuzhongming torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT lixiuyu torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT daiguohao torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT wangyu torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus AT hansong torchsparseefficienttrainingandinferenceframeworkforsparseconvolutionongpus

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Similar Items