A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis

In the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a novel deep-learning architectu...

Full description

Bibliographic Details
Main Authors:	Mebarka Allaoui, Mohammed Lamine Kherfi, Oussama Aiadi, Samir Brahim Belhaouari
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Clustering COVID-19 deep learning dimensionality reduction document organization topic modeling
Online Access:	https://ieeexplore.ieee.org/document/10242111/

_version_	1797685347393470464
author	Mebarka Allaoui Mohammed Lamine Kherfi Oussama Aiadi Samir Brahim Belhaouari
author_facet	Mebarka Allaoui Mohammed Lamine Kherfi Oussama Aiadi Samir Brahim Belhaouari
author_sort	Mebarka Allaoui
collection	DOAJ
description	In the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a novel deep-learning architecture to organize a large dataset of COVID-19-related scientific literature and provides a clear overview of the current state of knowledge. The proposed model is developed based on two main bases to ensure robustness and efficiency. In particular, we trained a denoising autoencoder with clean and noisy data to make the model can balance, preserving the underline structure and generalizing the new unseen data. Furthermore, the cornerstone of the proposed architecture lies in training the autoencoder using a two-fold objective function that jointly incorporates the data’s reconstruction and clustering. The advantage behind this combination is to avoid the distortion of the latent space and to improve the model efficiency. Afterward, we use the Latent Dirichlet Allocation (LDA) to analyze the document’s topics. For the sake of computational efficiency, instead of feeding the LDA with the whole dataset of documents, we fed it with the clusters produced in the phase of dimensionality reduction and clustering to count the frequency of topics in each cluster. The model was trained on a large public corpus of COVID-19-related articles and evaluated using a set of evaluation metrics. Experimental results indicate the superiority of our proposed model compared to several recent studies.
first_indexed	2024-03-12T00:43:51Z
format	Article
id	doaj.art-195fbe2d447a442c8c734b53db7bd87b
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T00:43:51Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-195fbe2d447a442c8c734b53db7bd87b2023-09-14T23:00:33ZengIEEEIEEE Access2169-35362023-01-0111969239693810.1109/ACCESS.2023.331262210242111A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document AnalysisMebarka Allaoui0https://orcid.org/0000-0002-1175-6087Mohammed Lamine Kherfi1Oussama Aiadi2https://orcid.org/0000-0002-4102-1735Samir Brahim Belhaouari3https://orcid.org/0000-0003-2336-0490Department of Computer Science and Information Technologies, University Kasdi Merbah Ouargla (UKMO), Ouargla, AlgeriaNational Higher School of Artificial Intelligence, Algiers, AlgeriaDepartment of Computer Science and Information Technologies, University Kasdi Merbah Ouargla (UKMO), Ouargla, AlgeriaDivision of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, QatarIn the midst of the ongoing COVID-19 pandemic, there has been a surge in scientific literature aimed at understanding the virus and its impact. However, it has become challenging for a researcher to deal with thousands of articles published daily. This paper proposes a novel deep-learning architecture to organize a large dataset of COVID-19-related scientific literature and provides a clear overview of the current state of knowledge. The proposed model is developed based on two main bases to ensure robustness and efficiency. In particular, we trained a denoising autoencoder with clean and noisy data to make the model can balance, preserving the underline structure and generalizing the new unseen data. Furthermore, the cornerstone of the proposed architecture lies in training the autoencoder using a two-fold objective function that jointly incorporates the data’s reconstruction and clustering. The advantage behind this combination is to avoid the distortion of the latent space and to improve the model efficiency. Afterward, we use the Latent Dirichlet Allocation (LDA) to analyze the document’s topics. For the sake of computational efficiency, instead of feeding the LDA with the whole dataset of documents, we fed it with the clusters produced in the phase of dimensionality reduction and clustering to count the frequency of topics in each cluster. The model was trained on a large public corpus of COVID-19-related articles and evaluated using a set of evaluation metrics. Experimental results indicate the superiority of our proposed model compared to several recent studies.https://ieeexplore.ieee.org/document/10242111/ClusteringCOVID-19deep learningdimensionality reductiondocument organizationtopic modeling
spellingShingle	Mebarka Allaoui Mohammed Lamine Kherfi Oussama Aiadi Samir Brahim Belhaouari A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis IEEE Access Clustering COVID-19 deep learning dimensionality reduction document organization topic modeling
title	A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_full	A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_fullStr	A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_full_unstemmed	A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_short	A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis
title_sort	novel two fold loss function for data clustering and reconstruction application to document analysis
topic	Clustering COVID-19 deep learning dimensionality reduction document organization topic modeling
url	https://ieeexplore.ieee.org/document/10242111/
work_keys_str_mv	AT mebarkaallaoui anoveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT mohammedlaminekherfi anoveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT oussamaaiadi anoveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT samirbrahimbelhaouari anoveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT mebarkaallaoui noveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT mohammedlaminekherfi noveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT oussamaaiadi noveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis AT samirbrahimbelhaouari noveltwofoldlossfunctionfordataclusteringandreconstructionapplicationtodocumentanalysis

A Novel Two-Fold Loss Function for Data Clustering and Reconstruction: Application to Document Analysis

Similar Items