QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently...

Full description

Bibliographic Details
Main Authors:	A. B. Nugumanova, K. S. Apayev, Y. M. Baiburin, M. Mansurova, A. G. Ospan
Format:	Article
Language:	English
Published:	Al-Farabi Kazakh National University 2022-06-01
Series:	Вестник КазНУ. Серия математика, механика, информатика
Subjects:	web-tables table extraction table recognition table understanding knowledge base population
Online Access:	https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664

_version_	1797937219097329664
author	A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan
author_facet	A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan
author_sort	A. B. Nugumanova
collection	DOAJ
description	In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.
first_indexed	2024-04-10T18:41:15Z
format	Article
id	doaj.art-ee7b118681994b5fb54d66abba67b374
institution	Directory Open Access Journal
issn	1563-0277 2617-4871
language	English
last_indexed	2024-04-10T18:41:15Z
publishDate	2022-06-01
publisher	Al-Farabi Kazakh National University
record_format	Article
series	Вестник КазНУ. Серия математика, механика, информатика
spelling	doaj.art-ee7b118681994b5fb54d66abba67b3742023-02-01T14:36:11ZengAl-Farabi Kazakh National UniversityВестник КазНУ. Серия математика, механика, информатика1563-02772617-48712022-06-01114291100https://doi.org/10.26577/JMMCS.2022.v114.i2.08QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATIONA. B. Nugumanova0https://orcid.org/0000-0001-5522-4421K. S. Apayev1https://orcid.org/0000-0001-9292-4785Y. M. BaiburinM. Mansurova2https://orcid.org/0000-0002-9680-2758A. G. Ospan3https://orcid.org/0000-0002-1860-6997Sarsen Amanzholov East Kazakhstan UniversityD. Serikbayev East Kazakhstan Technical UniversityAl-Farabi Kazakh National University, Kazakhstan, AlmatyAl-Farabi Kazakh National University, Kazakhstan, AlmatyIn this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664web-tablestable extractiontable recognitiontable understandingknowledge base population
spellingShingle	A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION Вестник КазНУ. Серия математика, механика, информатика web-tables table extraction table recognition table understanding knowledge base population
title	QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_full	QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_fullStr	QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_full_unstemmed	QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_short	QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_sort	qurma a table extraction pipeline for knowledge base population
topic	web-tables table extraction table recognition table understanding knowledge base population
url	https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664
work_keys_str_mv	AT abnugumanova qurmaatableextractionpipelineforknowledgebasepopulation AT ksapayev qurmaatableextractionpipelineforknowledgebasepopulation AT ymbaiburin qurmaatableextractionpipelineforknowledgebasepopulation AT mmansurova qurmaatableextractionpipelineforknowledgebasepopulation AT agospan qurmaatableextractionpipelineforknowledgebasepopulation

QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

Similar Items