FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications

The unprecedented growth of the research publications in diversified domains has overwhelmed the research community. It requires a cumbersome process to extract this enormous information by manually analyzing these research documents. To automatically extract content of a document in a structured wa...

Full description

Bibliographic Details
Main Authors:	Muhammad Waqas Ahmed, Muhammad Tanvir Afzal
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Machine learning research article metadata extraction text patterns document structure analysis
Online Access:	https://ieeexplore.ieee.org/document/9102282/

_version_	1818853393950572544
author	Muhammad Waqas Ahmed Muhammad Tanvir Afzal
author_facet	Muhammad Waqas Ahmed Muhammad Tanvir Afzal
author_sort	Muhammad Waqas Ahmed
collection	DOAJ
description	The unprecedented growth of the research publications in diversified domains has overwhelmed the research community. It requires a cumbersome process to extract this enormous information by manually analyzing these research documents. To automatically extract content of a document in a structured way, metadata and content must be annotated. Scientific community has been focusing on automatic extraction of content by forming different heuristics and applying different machine learning techniques. One of the renowned conference organizers, ESWC organizes state-of-the-art challenge to extract metadata like authors, affiliations, countries in affiliations, supplementary material, sections, table, figures, funding agencies, and EU funded projects from PDF files of research articles. We have proposed a feature centric technique that can be used to extract logical layout structure of articles from publishers with diversified composition styles. To extract unique metadata from a research article placed in logical layout structure, we have developed a four-staged novel approach “FLAG-PDFe”. The approach is built upon distinct and generic features based on the textual and the geometric information from the raw content of research documents. At the first stage, the distinct features are used to identify different physical layout components of an individual article. Since research journals follow their unique publishing styles and layout formats, therefore, we develop generic features to handle these diversified publishing patterns. We employ support vector classification (SVC) in the third stage to extract the logical layout structure (LLS)/ sections of an article, after performing comprehensive evaluation of generic features and machine learning models. Finally, we further apply heuristics on LLS to extract the desired metadata of an article. The outcomes of the study are obtained using the gold standard data set. The results yields 0.877 recall, precision 0.928 and 0.897 F-measure. Our approach has achieved a 16% gain on f-measure when compared to the best approach of the ESWC challenge.
first_indexed	2024-12-19T07:36:07Z
format	Article
id	doaj.art-268fb64774804b74a014664c85078515
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-19T07:36:07Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-268fb64774804b74a014664c850785152022-12-21T20:30:35ZengIEEEIEEE Access2169-35362020-01-018994589946910.1109/ACCESS.2020.29979079102282FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific PublicationsMuhammad Waqas Ahmed0https://orcid.org/0000-0002-4563-8951Muhammad Tanvir Afzal1Department of Computer Science, Capital University of Science and Technology, Islamabad, PakistanDepartment of Computer Science, Capital University of Science and Technology, Islamabad, PakistanThe unprecedented growth of the research publications in diversified domains has overwhelmed the research community. It requires a cumbersome process to extract this enormous information by manually analyzing these research documents. To automatically extract content of a document in a structured way, metadata and content must be annotated. Scientific community has been focusing on automatic extraction of content by forming different heuristics and applying different machine learning techniques. One of the renowned conference organizers, ESWC organizes state-of-the-art challenge to extract metadata like authors, affiliations, countries in affiliations, supplementary material, sections, table, figures, funding agencies, and EU funded projects from PDF files of research articles. We have proposed a feature centric technique that can be used to extract logical layout structure of articles from publishers with diversified composition styles. To extract unique metadata from a research article placed in logical layout structure, we have developed a four-staged novel approach “FLAG-PDFe”. The approach is built upon distinct and generic features based on the textual and the geometric information from the raw content of research documents. At the first stage, the distinct features are used to identify different physical layout components of an individual article. Since research journals follow their unique publishing styles and layout formats, therefore, we develop generic features to handle these diversified publishing patterns. We employ support vector classification (SVC) in the third stage to extract the logical layout structure (LLS)/ sections of an article, after performing comprehensive evaluation of generic features and machine learning models. Finally, we further apply heuristics on LLS to extract the desired metadata of an article. The outcomes of the study are obtained using the gold standard data set. The results yields 0.877 recall, precision 0.928 and 0.897 F-measure. Our approach has achieved a 16% gain on f-measure when compared to the best approach of the ESWC challenge.https://ieeexplore.ieee.org/document/9102282/Machine learningresearch articlemetadata extractiontext patternsdocument structure analysis
spellingShingle	Muhammad Waqas Ahmed Muhammad Tanvir Afzal FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications IEEE Access Machine learning research article metadata extraction text patterns document structure analysis
title	FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications
title_full	FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications
title_fullStr	FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications
title_full_unstemmed	FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications
title_short	FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications
title_sort	flag pdfe features oriented metadata extraction framework for scientific publications
topic	Machine learning research article metadata extraction text patterns document structure analysis
url	https://ieeexplore.ieee.org/document/9102282/
work_keys_str_mv	AT muhammadwaqasahmed flagpdfefeaturesorientedmetadataextractionframeworkforscientificpublications AT muhammadtanvirafzal flagpdfefeaturesorientedmetadataextractionframeworkforscientificpublications

FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications

Similar Items