Classification of Full Text Biomedical Documents: Sections Importance Assessment

The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how t...

Full description

Bibliographic Details
Main Authors:	Carlos Adriano Oliveira Gonçalves, Rui Camacho, Célia Talma Gonçalves, Adrián Seara Vieira, Lourdes Borrajo Diz, Eva Lorenzo Iglesias
Format:	Article
Language:	English
Published:	MDPI AG 2021-03-01
Series:	Applied Sciences
Subjects:	full text classification preprocessing techniques section weighing scheme information retrieval
Online Access:	https://www.mdpi.com/2076-3417/11/6/2674

_version_	1797541084080898048
author	Carlos Adriano Oliveira Gonçalves Rui Camacho Célia Talma Gonçalves Adrián Seara Vieira Lourdes Borrajo Diz Eva Lorenzo Iglesias
author_facet	Carlos Adriano Oliveira Gonçalves Rui Camacho Célia Talma Gonçalves Adrián Seara Vieira Lourdes Borrajo Diz Eva Lorenzo Iglesias
author_sort	Carlos Adriano Oliveira Gonçalves
collection	DOAJ
description	The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
first_indexed	2024-03-10T13:10:10Z
format	Article
id	doaj.art-f16425566605407f96e0ead99ae874d1
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T13:10:10Z
publishDate	2021-03-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-f16425566605407f96e0ead99ae874d12023-11-21T10:50:27ZengMDPI AGApplied Sciences2076-34172021-03-01116267410.3390/app11062674Classification of Full Text Biomedical Documents: Sections Importance AssessmentCarlos Adriano Oliveira Gonçalves0Rui Camacho1Célia Talma Gonçalves2Adrián Seara Vieira3Lourdes Borrajo Diz4Eva Lorenzo Iglesias5Computer Science Department, University of Vigo, Escuela Superior de Ingeniería Informática, 32004 Ourense, SpainFaculdade de Engenharia da Universidade do Porto, LIAAD-INESC TEC, 4200-465 Porto, PortugalISCAP—P.PORTO, CEOS.PP, LIACC, Campus da FEUP, 4369-00 Porto, PortugalComputer Science Department, University of Vigo, Escuela Superior de Ingeniería Informática, 32004 Ourense, SpainComputer Science Department, University of Vigo, Escuela Superior de Ingeniería Informática, 32004 Ourense, SpainComputer Science Department, University of Vigo, Escuela Superior de Ingeniería Informática, 32004 Ourense, SpainThe exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.https://www.mdpi.com/2076-3417/11/6/2674full text classificationpreprocessing techniquessection weighing schemeinformation retrieval
spellingShingle	Carlos Adriano Oliveira Gonçalves Rui Camacho Célia Talma Gonçalves Adrián Seara Vieira Lourdes Borrajo Diz Eva Lorenzo Iglesias Classification of Full Text Biomedical Documents: Sections Importance Assessment Applied Sciences full text classification preprocessing techniques section weighing scheme information retrieval
title	Classification of Full Text Biomedical Documents: Sections Importance Assessment
title_full	Classification of Full Text Biomedical Documents: Sections Importance Assessment
title_fullStr	Classification of Full Text Biomedical Documents: Sections Importance Assessment
title_full_unstemmed	Classification of Full Text Biomedical Documents: Sections Importance Assessment
title_short	Classification of Full Text Biomedical Documents: Sections Importance Assessment
title_sort	classification of full text biomedical documents sections importance assessment
topic	full text classification preprocessing techniques section weighing scheme information retrieval
url	https://www.mdpi.com/2076-3417/11/6/2674
work_keys_str_mv	AT carlosadrianooliveiragoncalves classificationoffulltextbiomedicaldocumentssectionsimportanceassessment AT ruicamacho classificationoffulltextbiomedicaldocumentssectionsimportanceassessment AT celiatalmagoncalves classificationoffulltextbiomedicaldocumentssectionsimportanceassessment AT adriansearavieira classificationoffulltextbiomedicaldocumentssectionsimportanceassessment AT lourdesborrajodiz classificationoffulltextbiomedicaldocumentssectionsimportanceassessment AT evalorenzoiglesias classificationoffulltextbiomedicaldocumentssectionsimportanceassessment

Classification of Full Text Biomedical Documents: Sections Importance Assessment

Similar Items