Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution

Abstract Background Semantic segmentation of brain tumors plays a critical role in clinical treatment, especially for three-dimensional (3D) magnetic resonance imaging, which is often used in clinical practice. Automatic segmentation of the 3D structure of brain tumors can quickly help physicians un...

Full description

Bibliographic Details
Main Authors:	Yimin Cai, Yuqing Long, Zhenggong Han, Mingkun Liu, Yuchen Zheng, Wei Yang, Liming Chen
Format:	Article
Language:	English
Published:	BMC 2023-02-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Deep learning Medical image segmentation 3D Swin Transformer Brain tumor
Online Access:	https://doi.org/10.1186/s12911-023-02129-z

_version_	1827983910050463744
author	Yimin Cai Yuqing Long Zhenggong Han Mingkun Liu Yuchen Zheng Wei Yang Liming Chen
author_facet	Yimin Cai Yuqing Long Zhenggong Han Mingkun Liu Yuchen Zheng Wei Yang Liming Chen
author_sort	Yimin Cai
collection	DOAJ
description	Abstract Background Semantic segmentation of brain tumors plays a critical role in clinical treatment, especially for three-dimensional (3D) magnetic resonance imaging, which is often used in clinical practice. Automatic segmentation of the 3D structure of brain tumors can quickly help physicians understand the properties of tumors, such as the shape and size, thus improving the efficiency of preoperative planning and the odds of successful surgery. In past decades, 3D convolutional neural networks (CNNs) have dominated automatic segmentation methods for 3D medical images, and these network structures have achieved good results. However, to reduce the number of neural network parameters, practitioners ensure that the size of convolutional kernels in 3D convolutional operations generally does not exceed $$7 \times 7 \times 7$$ 7 × 7 × 7 , which also leads to CNNs showing limitations in learning long-distance dependent information. Vision Transformer (ViT) is very good at learning long-distance dependent information in images, but it suffers from the problems of many parameters. What’s worse, the ViT cannot learn local dependency information in the previous layers under the condition of insufficient data. However, in the image segmentation task, being able to learn this local dependency information in the previous layers makes a big impact on the performance of the model. Methods This paper proposes the Swin Unet3D model, which represents voxel segmentation on medical images as a sequence-to-sequence prediction. The feature extraction sub-module in the model is designed as a parallel structure of Convolution and ViT so that all layers of the model are able to adequately learn both global and local dependency information in the image. Results On the validation dataset of Brats2021, our proposed model achieves dice coefficients of 0.840, 0.874, and 0.911 on the ET channel, TC channel, and WT channel, respectively. On the validation dataset of Brats2018, our model achieves dice coefficients of 0.716, 0.761, and 0.874 on the corresponding channels, respectively. Conclusion We propose a new segmentation model that combines the advantages of Vision Transformer and Convolution and achieves a better balance between the number of model parameters and segmentation accuracy. The code can be found at https://github.com/1152545264/SwinUnet3D .
first_indexed	2024-04-09T22:52:37Z
format	Article
id	doaj.art-3278ee11233246a9988c144703167f24
institution	Directory Open Access Journal
issn	1472-6947
language	English
last_indexed	2024-04-09T22:52:37Z
publishDate	2023-02-01
publisher	BMC
record_format	Article
series	BMC Medical Informatics and Decision Making
spelling	doaj.art-3278ee11233246a9988c144703167f242023-03-22T11:31:35ZengBMCBMC Medical Informatics and Decision Making1472-69472023-02-0123111310.1186/s12911-023-02129-zSwin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolutionYimin Cai0Yuqing Long1Zhenggong Han2Mingkun Liu3Yuchen Zheng4Wei Yang5Liming Chen6School of Medical, Guizhou UniversitySchool of Stomatolog, ZunYi Medical UniversityKey Laboratory of Advanced Manufacturing Technology of Ministry of Education, Guizhou UniversitySchool of Medical, Guizhou UniversitySchool of Medical, Guizhou UniversitySchool of Medical, Guizhou UniversityGuiyang Dental Hospital (Dental Hospital of Guizhou University), Guizhou UniversityAbstract Background Semantic segmentation of brain tumors plays a critical role in clinical treatment, especially for three-dimensional (3D) magnetic resonance imaging, which is often used in clinical practice. Automatic segmentation of the 3D structure of brain tumors can quickly help physicians understand the properties of tumors, such as the shape and size, thus improving the efficiency of preoperative planning and the odds of successful surgery. In past decades, 3D convolutional neural networks (CNNs) have dominated automatic segmentation methods for 3D medical images, and these network structures have achieved good results. However, to reduce the number of neural network parameters, practitioners ensure that the size of convolutional kernels in 3D convolutional operations generally does not exceed $$7 \times 7 \times 7$$ 7 × 7 × 7 , which also leads to CNNs showing limitations in learning long-distance dependent information. Vision Transformer (ViT) is very good at learning long-distance dependent information in images, but it suffers from the problems of many parameters. What’s worse, the ViT cannot learn local dependency information in the previous layers under the condition of insufficient data. However, in the image segmentation task, being able to learn this local dependency information in the previous layers makes a big impact on the performance of the model. Methods This paper proposes the Swin Unet3D model, which represents voxel segmentation on medical images as a sequence-to-sequence prediction. The feature extraction sub-module in the model is designed as a parallel structure of Convolution and ViT so that all layers of the model are able to adequately learn both global and local dependency information in the image. Results On the validation dataset of Brats2021, our proposed model achieves dice coefficients of 0.840, 0.874, and 0.911 on the ET channel, TC channel, and WT channel, respectively. On the validation dataset of Brats2018, our model achieves dice coefficients of 0.716, 0.761, and 0.874 on the corresponding channels, respectively. Conclusion We propose a new segmentation model that combines the advantages of Vision Transformer and Convolution and achieves a better balance between the number of model parameters and segmentation accuracy. The code can be found at https://github.com/1152545264/SwinUnet3D .https://doi.org/10.1186/s12911-023-02129-zDeep learningMedical image segmentation3D Swin TransformerBrain tumor
spellingShingle	Yimin Cai Yuqing Long Zhenggong Han Mingkun Liu Yuchen Zheng Wei Yang Liming Chen Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution BMC Medical Informatics and Decision Making Deep learning Medical image segmentation 3D Swin Transformer Brain tumor
title	Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
title_full	Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
title_fullStr	Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
title_full_unstemmed	Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
title_short	Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
title_sort	swin unet3d a three dimensional medical image segmentation network combining vision transformer and convolution
topic	Deep learning Medical image segmentation 3D Swin Transformer Brain tumor
url	https://doi.org/10.1186/s12911-023-02129-z
work_keys_str_mv	AT yimincai swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT yuqinglong swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT zhenggonghan swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT mingkunliu swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT yuchenzheng swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT weiyang swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution AT limingchen swinunet3dathreedimensionalmedicalimagesegmentationnetworkcombiningvisiontransformerandconvolution

Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution

Similar Items