A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application

Abstract Background Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessin...

Full description

Bibliographic Details
Main Authors: Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa. M. Hayes
Format: Article
Language:English
Published: BMC 2023-03-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05235-x
_version_ 1797859796745977856
author Mpho Mokoatle
Vukosi Marivate
Darlington Mapiye
Riana Bornman
Vanessa. M. Hayes
author_facet Mpho Mokoatle
Vukosi Marivate
Darlington Mapiye
Riana Bornman
Vanessa. M. Hayes
author_sort Mpho Mokoatle
collection DOAJ
description Abstract Background Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. Methods In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. Results The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models.
first_indexed 2024-04-09T21:35:24Z
format Article
id doaj.art-07b74d23f28a472db760ff1b38bfeeac
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-09T21:35:24Z
publishDate 2023-03-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-07b74d23f28a472db760ff1b38bfeeac2023-03-26T11:18:44ZengBMCBMC Bioinformatics1471-21052023-03-0124112510.1186/s12859-023-05235-xA review and comparative study of cancer detection using machine learning: SBERT and SimCSE applicationMpho Mokoatle0Vukosi Marivate1Darlington Mapiye2Riana Bornman3Vanessa. M. Hayes4Department of Computer Science, University of PretoriaDepartment of Computer Science, University of PretoriaCapeBio TM TechnologiesSchool of Health Systems and Public Health, University of PretoriaSchool of Medical Sciences, The University of SydneyAbstract Background Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. Methods In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. Results The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models.https://doi.org/10.1186/s12859-023-05235-xCancer detectionDNAMachine learningSentenceBertSimCSE
spellingShingle Mpho Mokoatle
Vukosi Marivate
Darlington Mapiye
Riana Bornman
Vanessa. M. Hayes
A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
BMC Bioinformatics
Cancer detection
DNA
Machine learning
SentenceBert
SimCSE
title A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_full A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_fullStr A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_full_unstemmed A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_short A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_sort review and comparative study of cancer detection using machine learning sbert and simcse application
topic Cancer detection
DNA
Machine learning
SentenceBert
SimCSE
url https://doi.org/10.1186/s12859-023-05235-x
work_keys_str_mv AT mphomokoatle areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT vukosimarivate areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT darlingtonmapiye areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT rianabornman areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT vanessamhayes areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT mphomokoatle reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT vukosimarivate reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT darlingtonmapiye reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT rianabornman reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT vanessamhayes reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication